move $enable_debug down below checks for GCC to avoid setting CFLAGS

author Owen Taylor <otaylor@redhat.com>

Fri, 5 Jan 2001 21:22:47 +0000 (21:22 +0000)

committer Owen Taylor <otaylor@src.gnome.org>

Fri, 5 Jan 2001 21:22:47 +0000 (21:22 +0000)
author Owen Taylor <otaylor@redhat.com>
Fri, 5 Jan 2001 21:22:47 +0000 (21:22 +0000)
committer Owen Taylor <otaylor@src.gnome.org>
Fri, 5 Jan 2001 21:22:47 +0000 (21:22 +0000)
diff --git a/ChangeLog b/ChangeLog

index 10269d5..6478f6a 100644 (file)
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,40 @@
+Fri Jan  5 11:25:42 2001  Owen Taylor  <otaylor@redhat.com>
+
+       * configure.in (PACKAGE): move $enable_debug down below
+       checks for GCC to avoid setting CFLAGS prematurely,
+       change checks to avoid adding -g twice.
+
+       * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean
+       0 termination.
+
+       * gutf8.c (g_utf8_to_ucs4): Terminate result with 0.
+
+       * tests/mainloop-test.c (main): Fix uses of 
+       g_main_loop_destroy().
+
+       * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt:
+       Tests for unicode-conversion code.
+
+       * gconvert.c (g_convert, g_convert_with_fallback): work around
+       a couple of GNU libc bugs.
+
+       * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize
+       arguments to match g_convert(). Document.
+
+       * gunicode.[ch]: 
+         - Implement conversion functions to and from UTF-16
+         - Standardize unicode conversion functions on prototype like
+           g_convert.
+         - Add a lot of error checking to unicode conversion functions.
+
+       * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking
+       variant of g_utf8_to_ucs4.
+
+       * gutf8.c (g_utf8_validate): 
+        - add g_return_if_fail (str != NULL).
+        - add checks for overlong strings, non-valid Unicode characters (>= 110000)
+          and single surrogates.
+
  2001-01-05  Tor Lillqvist  <tml@iki.fi>
  
         * testglib.c (main): Add test for g_path_skip_root().
  2001-01-05  Tor Lillqvist  <tml@iki.fi>
  
         * testglib.c (main): Add test for g_path_skip_root().
diff --git a/ChangeLog.pre-2-0 b/ChangeLog.pre-2-0

index 10269d5..6478f6a 100644 (file)
--- a/ChangeLog.pre-2-0
+++ b/ChangeLog.pre-2-0
@@ -1,3 +1,40 @@
+Fri Jan  5 11:25:42 2001  Owen Taylor  <otaylor@redhat.com>
+
+       * configure.in (PACKAGE): move $enable_debug down below
+       checks for GCC to avoid setting CFLAGS prematurely,
+       change checks to avoid adding -g twice.
+
+       * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean
+       0 termination.
+
+       * gutf8.c (g_utf8_to_ucs4): Terminate result with 0.
+
+       * tests/mainloop-test.c (main): Fix uses of 
+       g_main_loop_destroy().
+
+       * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt:
+       Tests for unicode-conversion code.
+
+       * gconvert.c (g_convert, g_convert_with_fallback): work around
+       a couple of GNU libc bugs.
+
+       * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize
+       arguments to match g_convert(). Document.
+
+       * gunicode.[ch]: 
+         - Implement conversion functions to and from UTF-16
+         - Standardize unicode conversion functions on prototype like
+           g_convert.
+         - Add a lot of error checking to unicode conversion functions.
+
+       * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking
+       variant of g_utf8_to_ucs4.
+
+       * gutf8.c (g_utf8_validate): 
+        - add g_return_if_fail (str != NULL).
+        - add checks for overlong strings, non-valid Unicode characters (>= 110000)
+          and single surrogates.
+
  2001-01-05  Tor Lillqvist  <tml@iki.fi>
  
         * testglib.c (main): Add test for g_path_skip_root().
  2001-01-05  Tor Lillqvist  <tml@iki.fi>
  
         * testglib.c (main): Add test for g_path_skip_root().
diff --git a/ChangeLog.pre-2-10 b/ChangeLog.pre-2-10

index 10269d5..6478f6a 100644 (file)
--- a/ChangeLog.pre-2-10
+++ b/ChangeLog.pre-2-10
@@ -1,3 +1,40 @@
+Fri Jan  5 11:25:42 2001  Owen Taylor  <otaylor@redhat.com>
+
+       * configure.in (PACKAGE): move $enable_debug down below
+       checks for GCC to avoid setting CFLAGS prematurely,
+       change checks to avoid adding -g twice.
+
+       * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean
+       0 termination.
+
+       * gutf8.c (g_utf8_to_ucs4): Terminate result with 0.
+
+       * tests/mainloop-test.c (main): Fix uses of 
+       g_main_loop_destroy().
+
+       * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt:
+       Tests for unicode-conversion code.
+
+       * gconvert.c (g_convert, g_convert_with_fallback): work around
+       a couple of GNU libc bugs.
+
+       * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize
+       arguments to match g_convert(). Document.
+
+       * gunicode.[ch]: 
+         - Implement conversion functions to and from UTF-16
+         - Standardize unicode conversion functions on prototype like
+           g_convert.
+         - Add a lot of error checking to unicode conversion functions.
+
+       * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking
+       variant of g_utf8_to_ucs4.
+
+       * gutf8.c (g_utf8_validate): 
+        - add g_return_if_fail (str != NULL).
+        - add checks for overlong strings, non-valid Unicode characters (>= 110000)
+          and single surrogates.
+
  2001-01-05  Tor Lillqvist  <tml@iki.fi>
  
         * testglib.c (main): Add test for g_path_skip_root().
  2001-01-05  Tor Lillqvist  <tml@iki.fi>
  
         * testglib.c (main): Add test for g_path_skip_root().
diff --git a/ChangeLog.pre-2-12 b/ChangeLog.pre-2-12

index 10269d5..6478f6a 100644 (file)
--- a/ChangeLog.pre-2-12
+++ b/ChangeLog.pre-2-12
@@ -1,3 +1,40 @@
+Fri Jan  5 11:25:42 2001  Owen Taylor  <otaylor@redhat.com>
+
+       * configure.in (PACKAGE): move $enable_debug down below
+       checks for GCC to avoid setting CFLAGS prematurely,
+       change checks to avoid adding -g twice.
+
+       * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean
+       0 termination.
+
+       * gutf8.c (g_utf8_to_ucs4): Terminate result with 0.
+
+       * tests/mainloop-test.c (main): Fix uses of 
+       g_main_loop_destroy().
+
+       * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt:
+       Tests for unicode-conversion code.
+
+       * gconvert.c (g_convert, g_convert_with_fallback): work around
+       a couple of GNU libc bugs.
+
+       * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize
+       arguments to match g_convert(). Document.
+
+       * gunicode.[ch]: 
+         - Implement conversion functions to and from UTF-16
+         - Standardize unicode conversion functions on prototype like
+           g_convert.
+         - Add a lot of error checking to unicode conversion functions.
+
+       * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking
+       variant of g_utf8_to_ucs4.
+
+       * gutf8.c (g_utf8_validate): 
+        - add g_return_if_fail (str != NULL).
+        - add checks for overlong strings, non-valid Unicode characters (>= 110000)
+          and single surrogates.
+
  2001-01-05  Tor Lillqvist  <tml@iki.fi>
  
         * testglib.c (main): Add test for g_path_skip_root().
  2001-01-05  Tor Lillqvist  <tml@iki.fi>
  
         * testglib.c (main): Add test for g_path_skip_root().
diff --git a/ChangeLog.pre-2-2 b/ChangeLog.pre-2-2

index 10269d5..6478f6a 100644 (file)
--- a/ChangeLog.pre-2-2
+++ b/ChangeLog.pre-2-2
@@ -1,3 +1,40 @@
+Fri Jan  5 11:25:42 2001  Owen Taylor  <otaylor@redhat.com>
+
+       * configure.in (PACKAGE): move $enable_debug down below
+       checks for GCC to avoid setting CFLAGS prematurely,
+       change checks to avoid adding -g twice.
+
+       * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean
+       0 termination.
+
+       * gutf8.c (g_utf8_to_ucs4): Terminate result with 0.
+
+       * tests/mainloop-test.c (main): Fix uses of 
+       g_main_loop_destroy().
+
+       * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt:
+       Tests for unicode-conversion code.
+
+       * gconvert.c (g_convert, g_convert_with_fallback): work around
+       a couple of GNU libc bugs.
+
+       * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize
+       arguments to match g_convert(). Document.
+
+       * gunicode.[ch]: 
+         - Implement conversion functions to and from UTF-16
+         - Standardize unicode conversion functions on prototype like
+           g_convert.
+         - Add a lot of error checking to unicode conversion functions.
+
+       * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking
+       variant of g_utf8_to_ucs4.
+
+       * gutf8.c (g_utf8_validate): 
+        - add g_return_if_fail (str != NULL).
+        - add checks for overlong strings, non-valid Unicode characters (>= 110000)
+          and single surrogates.
+
  2001-01-05  Tor Lillqvist  <tml@iki.fi>
  
         * testglib.c (main): Add test for g_path_skip_root().
  2001-01-05  Tor Lillqvist  <tml@iki.fi>
  
         * testglib.c (main): Add test for g_path_skip_root().
diff --git a/ChangeLog.pre-2-4 b/ChangeLog.pre-2-4

index 10269d5..6478f6a 100644 (file)
--- a/ChangeLog.pre-2-4
+++ b/ChangeLog.pre-2-4
@@ -1,3 +1,40 @@
+Fri Jan  5 11:25:42 2001  Owen Taylor  <otaylor@redhat.com>
+
+       * configure.in (PACKAGE): move $enable_debug down below
+       checks for GCC to avoid setting CFLAGS prematurely,
+       change checks to avoid adding -g twice.
+
+       * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean
+       0 termination.
+
+       * gutf8.c (g_utf8_to_ucs4): Terminate result with 0.
+
+       * tests/mainloop-test.c (main): Fix uses of 
+       g_main_loop_destroy().
+
+       * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt:
+       Tests for unicode-conversion code.
+
+       * gconvert.c (g_convert, g_convert_with_fallback): work around
+       a couple of GNU libc bugs.
+
+       * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize
+       arguments to match g_convert(). Document.
+
+       * gunicode.[ch]: 
+         - Implement conversion functions to and from UTF-16
+         - Standardize unicode conversion functions on prototype like
+           g_convert.
+         - Add a lot of error checking to unicode conversion functions.
+
+       * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking
+       variant of g_utf8_to_ucs4.
+
+       * gutf8.c (g_utf8_validate): 
+        - add g_return_if_fail (str != NULL).
+        - add checks for overlong strings, non-valid Unicode characters (>= 110000)
+          and single surrogates.
+
  2001-01-05  Tor Lillqvist  <tml@iki.fi>
  
         * testglib.c (main): Add test for g_path_skip_root().
  2001-01-05  Tor Lillqvist  <tml@iki.fi>
  
         * testglib.c (main): Add test for g_path_skip_root().
diff --git a/ChangeLog.pre-2-6 b/ChangeLog.pre-2-6

index 10269d5..6478f6a 100644 (file)
--- a/ChangeLog.pre-2-6
+++ b/ChangeLog.pre-2-6
@@ -1,3 +1,40 @@
+Fri Jan  5 11:25:42 2001  Owen Taylor  <otaylor@redhat.com>
+
+       * configure.in (PACKAGE): move $enable_debug down below
+       checks for GCC to avoid setting CFLAGS prematurely,
+       change checks to avoid adding -g twice.
+
+       * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean
+       0 termination.
+
+       * gutf8.c (g_utf8_to_ucs4): Terminate result with 0.
+
+       * tests/mainloop-test.c (main): Fix uses of 
+       g_main_loop_destroy().
+
+       * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt:
+       Tests for unicode-conversion code.
+
+       * gconvert.c (g_convert, g_convert_with_fallback): work around
+       a couple of GNU libc bugs.
+
+       * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize
+       arguments to match g_convert(). Document.
+
+       * gunicode.[ch]: 
+         - Implement conversion functions to and from UTF-16
+         - Standardize unicode conversion functions on prototype like
+           g_convert.
+         - Add a lot of error checking to unicode conversion functions.
+
+       * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking
+       variant of g_utf8_to_ucs4.
+
+       * gutf8.c (g_utf8_validate): 
+        - add g_return_if_fail (str != NULL).
+        - add checks for overlong strings, non-valid Unicode characters (>= 110000)
+          and single surrogates.
+
  2001-01-05  Tor Lillqvist  <tml@iki.fi>
  
         * testglib.c (main): Add test for g_path_skip_root().
  2001-01-05  Tor Lillqvist  <tml@iki.fi>
  
         * testglib.c (main): Add test for g_path_skip_root().
diff --git a/ChangeLog.pre-2-8 b/ChangeLog.pre-2-8

index 10269d5..6478f6a 100644 (file)
--- a/ChangeLog.pre-2-8
+++ b/ChangeLog.pre-2-8
@@ -1,3 +1,40 @@
+Fri Jan  5 11:25:42 2001  Owen Taylor  <otaylor@redhat.com>
+
+       * configure.in (PACKAGE): move $enable_debug down below
+       checks for GCC to avoid setting CFLAGS prematurely,
+       change checks to avoid adding -g twice.
+
+       * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean
+       0 termination.
+
+       * gutf8.c (g_utf8_to_ucs4): Terminate result with 0.
+
+       * tests/mainloop-test.c (main): Fix uses of 
+       g_main_loop_destroy().
+
+       * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt:
+       Tests for unicode-conversion code.
+
+       * gconvert.c (g_convert, g_convert_with_fallback): work around
+       a couple of GNU libc bugs.
+
+       * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize
+       arguments to match g_convert(). Document.
+
+       * gunicode.[ch]: 
+         - Implement conversion functions to and from UTF-16
+         - Standardize unicode conversion functions on prototype like
+           g_convert.
+         - Add a lot of error checking to unicode conversion functions.
+
+       * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking
+       variant of g_utf8_to_ucs4.
+
+       * gutf8.c (g_utf8_validate): 
+        - add g_return_if_fail (str != NULL).
+        - add checks for overlong strings, non-valid Unicode characters (>= 110000)
+          and single surrogates.
+
  2001-01-05  Tor Lillqvist  <tml@iki.fi>
  
         * testglib.c (main): Add test for g_path_skip_root().
  2001-01-05  Tor Lillqvist  <tml@iki.fi>
  
         * testglib.c (main): Add test for g_path_skip_root().
diff --git a/configure.in b/configure.in

index f68f794..58562be 100644 (file)
--- a/configure.in
+++ b/configure.in
@@ -114,15 +114,6 @@ if test "x$enable_threads" != "xyes"; then
    enable_threads=no
  fi
  
    enable_threads=no
  fi
  
-if test "x$enable_debug" = "xyes"; then
-  test "$cflags_set" = set || CFLAGS="$CFLAGS -g"
-  GLIB_DEBUG_FLAGS="-DG_ENABLE_DEBUG"
-else
-  if test "x$enable_debug" = "xno"; then
-    GLIB_DEBUG_FLAGS="-DG_DISABLE_ASSERT -DG_DISABLE_CHECKS"
-  fi
-fi
-
  AC_DEFINE_UNQUOTED(G_COMPILED_WITH_DEBUGGING, "${enable_debug}",
         [Whether glib was compiled with debugging enabled])
  
  AC_DEFINE_UNQUOTED(G_COMPILED_WITH_DEBUGGING, "${enable_debug}",
         [Whether glib was compiled with debugging enabled])
  
@@ -154,6 +145,21 @@ AC_PROG_CC
  AM_PROG_CC_STDC
  AC_PROG_INSTALL
  
  AM_PROG_CC_STDC
  AC_PROG_INSTALL
  
+if test "x$enable_debug" = "xyes"; then
+  if test x$cflags_set != xset ; then
+      case " $CFLAGS " in
+      *[[\ \   ]]-g[[\ \       ]]*) ;;
+      *) CFLAGS="$CFLAGS -g" ;;
+      esac
+  fi
+       
+  GLIB_DEBUG_FLAGS="-DG_ENABLE_DEBUG"
+else
+  if test "x$enable_debug" = "xno"; then
+    GLIB_DEBUG_FLAGS="-DG_DISABLE_ASSERT -DG_DISABLE_CHECKS"
+  fi
+fi
+
  # define a MAINT-like variable REBUILD which is set if Perl
  # and awk are found, so autogenerated sources can be rebuilt
  AC_PROG_AWK
  # define a MAINT-like variable REBUILD which is set if Perl
  # and awk are found, so autogenerated sources can be rebuilt
  AC_PROG_AWK
diff --git a/gconvert.c b/gconvert.c

index 2169b6d..344902f 100644 (file)
--- a/gconvert.c
+++ b/gconvert.c
@@ -170,7 +170,11 @@ g_convert (const gchar *str,
  
    p = str;
    inbytes_remaining = len;
  
    p = str;
    inbytes_remaining = len;
-  outbuf_size = len + 1; /* + 1 for nul in case len == 1 */
+
+  /* Due to a GLIBC bug, round outbuf_size up to a multiple of 4 */
+  /* + 1 for nul in case len == 1 */
+  outbuf_size = ((len + 3) & ~3) + 1;
+  
    outbytes_remaining = outbuf_size - 1; /* -1 for nul */
    outp = dest = g_malloc (outbuf_size);
  
    outbytes_remaining = outbuf_size - 1; /* -1 for nul */
    outp = dest = g_malloc (outbuf_size);
  
@@ -188,11 +192,20 @@ g_convert (const gchar *str,
         case E2BIG:
           {
             size_t used = outp - dest;
         case E2BIG:
           {
             size_t used = outp - dest;
-           outbuf_size *= 2;
-           dest = g_realloc (dest, outbuf_size);
  
  
-           outp = dest + used;
-           outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */
+           /* glibc's iconv can return E2BIG even if there is space
+            * remaining if an internal buffer is exhausted. The
+            * folllowing is a heuristic to catch this. The 16 is
+            * pretty arbitrary.
+            */
+           if (used + 16 > outbuf_size)
+             {
+               outbuf_size = (outbuf_size - 1) * 2 + 1;
+               dest = g_realloc (dest, outbuf_size);
+               
+               outp = dest + used;
+               outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */
+             }
  
             goto again;
           }
  
             goto again;
           }
@@ -353,7 +366,9 @@ g_convert_with_fallback (const gchar *str,
     * for the original string while we are converting the fallback
     */
    p = utf8;
     * for the original string while we are converting the fallback
     */
    p = utf8;
-  outbuf_size = len + 1; /* + 1 for nul in case len == 1 */
+  /* Due to a GLIBC bug, round outbuf_size up to a multiple of 4 */
+  /* + 1 for nul in case len == 1 */
+  outbuf_size = ((len + 3) & ~3) + 1;
    outbytes_remaining = outbuf_size - 1; /* -1 for nul */
    outp = dest = g_malloc (outbuf_size);
  
    outbytes_remaining = outbuf_size - 1; /* -1 for nul */
    outp = dest = g_malloc (outbuf_size);
  
@@ -373,11 +388,20 @@ g_convert_with_fallback (const gchar *str,
             case E2BIG:
               {
                 size_t used = outp - dest;
             case E2BIG:
               {
                 size_t used = outp - dest;
-               outbuf_size *= 2;
-               dest = g_realloc (dest, outbuf_size);
-               
-               outp = dest + used;
-               outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */
+
+               /* glibc's iconv can return E2BIG even if there is space
+                * remaining if an internal buffer is exhausted. The
+                * folllowing is a heuristic to catch this. The 16 is
+                * pretty arbitrary.
+                */
+               if (used + 16 > outbuf_size)
+                 {
+                   outbuf_size = (outbuf_size - 1) * 2 + 1;
+                   dest = g_realloc (dest, outbuf_size);
+                   
+                   outp = dest + used;
+                   outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */
+                 }
                 
                 break;
               }
                 
                 break;
               }
@@ -458,18 +482,44 @@ g_convert_with_fallback (const gchar *str,
  /*
   * g_locale_to_utf8
   *
  /*
   * g_locale_to_utf8
   *
+ * 
+ */
+
+/**
+ * g_locale_to_utf8:
+ * @opsysstring:   a string in the encoding of the current locale
+ * @len:           the length of the string, or -1 if the string is
+ *                 NULL-terminated.
+ * @bytes_read:    location to store the number of bytes in the
+ *                 input string that were successfully converted, or %NULL.
+ *                 Even if the conversion was succesful, this may be 
+ *                 less than len if there were partial characters
+ *                 at the end of the input. If the error
+ *                 G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value
+ *                 stored will the byte fofset after the last valid
+ *                 input sequence.
+ * @bytes_written: the stored in the output buffer (not including the
+ *                 terminating nul.
+ * @error: location to store the error occuring, or %NULL to ignore
+ *                 errors. Any of the errors in #GConvertError may occur.
+ * 
   * Converts a string which is in the encoding used for strings by
   * the C runtime (usually the same as that used by the operating
   * system) in the current locale into a UTF-8 string.
   * Converts a string which is in the encoding used for strings by
   * the C runtime (usually the same as that used by the operating
   * system) in the current locale into a UTF-8 string.
- */
-
+ * 
+ * Return value: The converted string, or %NULL on an error.
+ **/
  gchar *
  gchar *
-g_locale_to_utf8 (const gchar *opsysstring, GError **error)
+g_locale_to_utf8 (const gchar  *opsysstring,
+                 gint          len,
+                 gint         *bytes_read,
+                 gint         *bytes_written,
+                 GError      **error)
  {
  #ifdef G_OS_WIN32
  
  {
  #ifdef G_OS_WIN32
  
-  gint i, clen, wclen, first;
-  const gint len = strlen (opsysstring);
+  gint i, clen, total_len, wclen, first;
+  const gint len = len < 0 ? strlen (opsysstring) : len;
    wchar_t *wcs, wc;
    gchar *result, *bp;
    const wchar_t *wcp;
    wchar_t *wcs, wc;
    gchar *result, *bp;
    const wchar_t *wcp;
@@ -478,26 +528,26 @@ g_locale_to_utf8 (const gchar *opsysstring, GError **error)
    wclen = MultiByteToWideChar (CP_ACP, 0, opsysstring, len, wcs, len);
  
    wcp = wcs;
    wclen = MultiByteToWideChar (CP_ACP, 0, opsysstring, len, wcs, len);
  
    wcp = wcs;
-  clen = 0;
+  total_len = 0;
    for (i = 0; i < wclen; i++)
      {
        wc = *wcp++;
  
        if (wc < 0x80)
    for (i = 0; i < wclen; i++)
      {
        wc = *wcp++;
  
        if (wc < 0x80)
-       clen += 1;
+       total_len += 1;
        else if (wc < 0x800)
        else if (wc < 0x800)
-       clen += 2;
+       total_len += 2;
        else if (wc < 0x10000)
        else if (wc < 0x10000)
-       clen += 3;
+       total_len += 3;
        else if (wc < 0x200000)
        else if (wc < 0x200000)
-       clen += 4;
+       total_len += 4;
        else if (wc < 0x4000000)
        else if (wc < 0x4000000)
-       clen += 5;
+       total_len += 5;
        else
        else
-       clen += 6;
+       total_len += 6;
      }
  
      }
  
-  result = g_malloc (clen + 1);
+  result = g_malloc (total_len + 1);
    
    wcp = wcs;
    bp = result;
    
    wcp = wcs;
    bp = result;
@@ -553,6 +603,11 @@ g_locale_to_utf8 (const gchar *opsysstring, GError **error)
  
    g_free (wcs);
  
  
    g_free (wcs);
  
+  if (bytes_read)
+    *bytes_read = len;
+  if (bytes_written)
+    *bytes_written = total_len;
+  
    return result;
  
  #else
    return result;
  
  #else
@@ -562,26 +617,48 @@ g_locale_to_utf8 (const gchar *opsysstring, GError **error)
    if (g_get_charset (&charset))
      return g_strdup (opsysstring);
  
    if (g_get_charset (&charset))
      return g_strdup (opsysstring);
  
-  str = g_convert (opsysstring, strlen (opsysstring), 
-                  "UTF-8", charset, NULL, NULL, error);
+  str = g_convert (opsysstring, len, 
+                  "UTF-8", charset, bytes_read, bytes_written, error);
    
    return str;
  #endif
  }
  
    
    return str;
  #endif
  }
  
-/*
- * g_locale_from_utf8
- *
- * The reverse of g_locale_to_utf8.
- */
-
+/**
+ * g_locale_from_utf8:
+ * @utf8string:    a UTF-8 encoded string 
+ * @len:           the length of the string, or -1 if the string is
+ *                 NULL-terminated.
+ * @bytes_read:    location to store the number of bytes in the
+ *                 input string that were successfully converted, or %NULL.
+ *                 Even if the conversion was succesful, this may be 
+ *                 less than len if there were partial characters
+ *                 at the end of the input. If the error
+ *                 G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value
+ *                 stored will the byte fofset after the last valid
+ *                 input sequence.
+ * @bytes_written: the stored in the output buffer (not including the
+ *                 terminating nul.
+ * @error: location to store the error occuring, or %NULL to ignore
+ *                 errors. Any of the errors in #GConvertError may occur.
+ * 
+ * Converts a string from UTF-8 to the encoding used for strings by
+ * the C runtime (usually the same as that used by the operating
+ * system) in the current locale.
+ * 
+ * Return value: The converted string, or %NULL on an error.
+ **/
  gchar *
  gchar *
-g_locale_from_utf8 (const gchar *utf8string, GError **error)
+g_locale_from_utf8 (const gchar *utf8string,
+                   gint         len,
+                   gint        *bytes_read,
+                   gint        *bytes_written,
+                   GError     **error)
  {
  #ifdef G_OS_WIN32
  
    gint i, mask, clen, mblen;
  {
  #ifdef G_OS_WIN32
  
    gint i, mask, clen, mblen;
-  const gint len = strlen (utf8string);
+  const gint len = len < 0 ? strlen (utf8string) : len;
    wchar_t *wcs, *wcp;
    gchar *result;
    guchar *cp, *end, c;
    wchar_t *wcs, *wcp;
    gchar *result;
    guchar *cp, *end, c;
@@ -671,6 +748,11 @@ g_locale_from_utf8 (const gchar *utf8string, GError **error)
    result[mblen] = 0;
    g_free (wcs);
  
    result[mblen] = 0;
    g_free (wcs);
  
+  if (bytes_read)
+    *bytes_read = len;
+  if (bytes_written)
+    *bytes_written = mblen;
+  
    return result;
  
  #else
    return result;
  
  #else
@@ -681,39 +763,123 @@ g_locale_from_utf8 (const gchar *utf8string, GError **error)
      return g_strdup (utf8string);
  
    str = g_convert (utf8string, strlen (utf8string), 
      return g_strdup (utf8string);
  
    str = g_convert (utf8string, strlen (utf8string), 
-                  charset, "UTF-8", NULL, NULL, error);
+                  charset, "UTF-8", bytes_read, bytes_written, error);
  
    return str;
    
  #endif
  }
  
  
    return str;
    
  #endif
  }
  
-/* Filenames are in UTF-8 unless specificially requested otherwise */
-
+/**
+ * g_filename_to_utf8:
+ * @opsysstring:   a string in the encoding for filenames
+ * @len:           the length of the string, or -1 if the string is
+ *                 NULL-terminated.
+ * @bytes_read:    location to store the number of bytes in the
+ *                 input string that were successfully converted, or %NULL.
+ *                 Even if the conversion was succesful, this may be 
+ *                 less than len if there were partial characters
+ *                 at the end of the input. If the error
+ *                 G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value
+ *                 stored will the byte fofset after the last valid
+ *                 input sequence.
+ * @bytes_written: the stored in the output buffer (not including the
+ *                 terminating nul.
+ * @error: location to store the error occuring, or %NULL to ignore
+ *                 errors. Any of the errors in #GConvertError may occur.
+ * 
+ * Converts a string which is in the encoding used for filenames
+ * into a UTF-8 string.
+ * 
+ * Return value: The converted string, or %NULL on an error.
+ **/
  gchar*
  gchar*
-g_filename_to_utf8 (const gchar *string, GError **error)
-
+g_filename_to_utf8 (const gchar *opsysstring, 
+                   gint         len,
+                   gint        *bytes_read,
+                   gint        *bytes_written,
+                   GError     **error)
  {
  #ifdef G_OS_WIN32
  {
  #ifdef G_OS_WIN32
-  return g_locale_to_utf8 (string, error);
+  return g_locale_to_utf8 (opsysstring, len,
+                          bytes_read, bytes_written,
+                          error);
  #else
    if (getenv ("G_BROKEN_FILENAMES"))
  #else
    if (getenv ("G_BROKEN_FILENAMES"))
-    return g_locale_to_utf8 (string, error);
+    return g_locale_to_utf8 (opsysstring, len,
+                            bytes_read, bytes_written,
+                            error);
  
  
-  return g_strdup (string);
+  if (bytes_read || bytes_written)
+    {
+      gint len = strlen (opsysstring);
+
+      if (bytes_read)
+       *bytes_read = len;
+      if (bytes_written)
+       *bytes_written = len;
+    }
+  
+  if (len < 0)
+    return g_strdup (opsysstring);
+  else
+    return g_strndup (opsysstring, len);
  #endif
  }
  
  #endif
  }
  
+/**
+ * g_filename_from_utf8:
+ * @utf8string:    a UTF-8 encoded string 
+ * @len:           the length of the string, or -1 if the string is
+ *                 NULL-terminated.
+ * @bytes_read:    location to store the number of bytes in the
+ *                 input string that were successfully converted, or %NULL.
+ *                 Even if the conversion was succesful, this may be 
+ *                 less than len if there were partial characters
+ *                 at the end of the input. If the error
+ *                 G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value
+ *                 stored will the byte fofset after the last valid
+ *                 input sequence.
+ * @bytes_written: the stored in the output buffer (not including the
+ *                 terminating nul.
+ * @error: location to store the error occuring, or %NULL to ignore
+ *                 errors. Any of the errors in #GConvertError may occur.
+ * 
+ * Converts a string from UTF-8 to the encoding used for filenames.
+ * 
+ * Return value: The converted string, or %NULL on an error.
+ **/
  gchar*
  gchar*
-g_filename_from_utf8 (const gchar *string, GError **error)
+g_filename_from_utf8 (const gchar *utf8string,
+                     gint         len,
+                     gint        *bytes_read,
+                     gint        *bytes_written,
+                     GError     **error)
  {
  #ifdef G_OS_WIN32
  {
  #ifdef G_OS_WIN32
-  return g_locale_from_utf8 (string, error);
+  return g_locale_from_utf8 (utf8string, len,
+                            bytes_read, bytes_written,
+                            error);
  #else
    if (getenv ("G_BROKEN_FILENAMES"))
  #else
    if (getenv ("G_BROKEN_FILENAMES"))
-    return g_locale_from_utf8 (string, error);
+    return g_locale_from_utf8 (utf8string, len,
+                              bytes_read, bytes_written,
+                              error);
+
+  if (bytes_read || bytes_written)
+    {
+      gint len = strlen (utf8string);
+
+      if (bytes_read)
+       *bytes_read = len;
+      if (bytes_written)
+       *bytes_written = len;
+    }
  
  
-  return g_strdup (string);
+  if (len < 0)
+    return g_strdup (utf8string);
+  else
+    return g_strndup (utf8string, len);
  #endif
  }
  
  #endif
  }
  
diff --git a/gconvert.h b/gconvert.h

index ce19b36..e11a106 100644 (file)
--- a/gconvert.h
+++ b/gconvert.h
@@ -76,14 +76,30 @@ gchar* g_convert_with_fallback (const gchar  *str,
  
  /* Convert between libc's idea of strings and UTF-8.
   */
  
  /* Convert between libc's idea of strings and UTF-8.
   */
-gchar*   g_locale_to_utf8 (const gchar *opsysstring, GError **error);
-gchar*   g_locale_from_utf8 (const gchar *utf8string, GError **error);
+gchar* g_locale_to_utf8   (const gchar  *opsysstring,
+                          gint          len,
+                          gint         *bytes_read,
+                          gint         *bytes_written,
+                          GError      **error);
+gchar* g_locale_from_utf8 (const gchar  *utf8string,
+                          gint          len,
+                          gint         *bytes_read,
+                          gint         *bytes_written,
+                          GError      **error);
  
  /* Convert between the operating system (or C runtime)
   * representation of file names and UTF-8.
   */
  
  /* Convert between the operating system (or C runtime)
   * representation of file names and UTF-8.
   */
-gchar*   g_filename_to_utf8 (const gchar *opsysstring, GError **error);
-gchar*   g_filename_from_utf8 (const gchar *utf8string, GError **error);
+gchar* g_filename_to_utf8   (const gchar  *opsysstring,
+                            gint          len,
+                            gint         *bytes_read,
+                            gint         *bytes_written,
+                            GError      **error);
+gchar* g_filename_from_utf8 (const gchar  *utf8string,
+                            gint          len,
+                            gint         *bytes_read,
+                            gint         *bytes_written,
+                            GError      **error);
  
  G_END_DECLS
  
  
  G_END_DECLS
  
diff --git a/glib/gconvert.c b/glib/gconvert.c

index 2169b6d..344902f 100644 (file)
--- a/glib/gconvert.c
+++ b/glib/gconvert.c
@@ -170,7 +170,11 @@ g_convert (const gchar *str,
  
    p = str;
    inbytes_remaining = len;
  
    p = str;
    inbytes_remaining = len;
-  outbuf_size = len + 1; /* + 1 for nul in case len == 1 */
+
+  /* Due to a GLIBC bug, round outbuf_size up to a multiple of 4 */
+  /* + 1 for nul in case len == 1 */
+  outbuf_size = ((len + 3) & ~3) + 1;
+  
    outbytes_remaining = outbuf_size - 1; /* -1 for nul */
    outp = dest = g_malloc (outbuf_size);
  
    outbytes_remaining = outbuf_size - 1; /* -1 for nul */
    outp = dest = g_malloc (outbuf_size);
  
@@ -188,11 +192,20 @@ g_convert (const gchar *str,
         case E2BIG:
           {
             size_t used = outp - dest;
         case E2BIG:
           {
             size_t used = outp - dest;
-           outbuf_size *= 2;
-           dest = g_realloc (dest, outbuf_size);
  
  
-           outp = dest + used;
-           outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */
+           /* glibc's iconv can return E2BIG even if there is space
+            * remaining if an internal buffer is exhausted. The
+            * folllowing is a heuristic to catch this. The 16 is
+            * pretty arbitrary.
+            */
+           if (used + 16 > outbuf_size)
+             {
+               outbuf_size = (outbuf_size - 1) * 2 + 1;
+               dest = g_realloc (dest, outbuf_size);
+               
+               outp = dest + used;
+               outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */
+             }
  
             goto again;
           }
  
             goto again;
           }
@@ -353,7 +366,9 @@ g_convert_with_fallback (const gchar *str,
     * for the original string while we are converting the fallback
     */
    p = utf8;
     * for the original string while we are converting the fallback
     */
    p = utf8;
-  outbuf_size = len + 1; /* + 1 for nul in case len == 1 */
+  /* Due to a GLIBC bug, round outbuf_size up to a multiple of 4 */
+  /* + 1 for nul in case len == 1 */
+  outbuf_size = ((len + 3) & ~3) + 1;
    outbytes_remaining = outbuf_size - 1; /* -1 for nul */
    outp = dest = g_malloc (outbuf_size);
  
    outbytes_remaining = outbuf_size - 1; /* -1 for nul */
    outp = dest = g_malloc (outbuf_size);
  
@@ -373,11 +388,20 @@ g_convert_with_fallback (const gchar *str,
             case E2BIG:
               {
                 size_t used = outp - dest;
             case E2BIG:
               {
                 size_t used = outp - dest;
-               outbuf_size *= 2;
-               dest = g_realloc (dest, outbuf_size);
-               
-               outp = dest + used;
-               outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */
+
+               /* glibc's iconv can return E2BIG even if there is space
+                * remaining if an internal buffer is exhausted. The
+                * folllowing is a heuristic to catch this. The 16 is
+                * pretty arbitrary.
+                */
+               if (used + 16 > outbuf_size)
+                 {
+                   outbuf_size = (outbuf_size - 1) * 2 + 1;
+                   dest = g_realloc (dest, outbuf_size);
+                   
+                   outp = dest + used;
+                   outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */
+                 }
                 
                 break;
               }
                 
                 break;
               }
@@ -458,18 +482,44 @@ g_convert_with_fallback (const gchar *str,
  /*
   * g_locale_to_utf8
   *
  /*
   * g_locale_to_utf8
   *
+ * 
+ */
+
+/**
+ * g_locale_to_utf8:
+ * @opsysstring:   a string in the encoding of the current locale
+ * @len:           the length of the string, or -1 if the string is
+ *                 NULL-terminated.
+ * @bytes_read:    location to store the number of bytes in the
+ *                 input string that were successfully converted, or %NULL.
+ *                 Even if the conversion was succesful, this may be 
+ *                 less than len if there were partial characters
+ *                 at the end of the input. If the error
+ *                 G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value
+ *                 stored will the byte fofset after the last valid
+ *                 input sequence.
+ * @bytes_written: the stored in the output buffer (not including the
+ *                 terminating nul.
+ * @error: location to store the error occuring, or %NULL to ignore
+ *                 errors. Any of the errors in #GConvertError may occur.
+ * 
   * Converts a string which is in the encoding used for strings by
   * the C runtime (usually the same as that used by the operating
   * system) in the current locale into a UTF-8 string.
   * Converts a string which is in the encoding used for strings by
   * the C runtime (usually the same as that used by the operating
   * system) in the current locale into a UTF-8 string.
- */
-
+ * 
+ * Return value: The converted string, or %NULL on an error.
+ **/
  gchar *
  gchar *
-g_locale_to_utf8 (const gchar *opsysstring, GError **error)
+g_locale_to_utf8 (const gchar  *opsysstring,
+                 gint          len,
+                 gint         *bytes_read,
+                 gint         *bytes_written,
+                 GError      **error)
  {
  #ifdef G_OS_WIN32
  
  {
  #ifdef G_OS_WIN32
  
-  gint i, clen, wclen, first;
-  const gint len = strlen (opsysstring);
+  gint i, clen, total_len, wclen, first;
+  const gint len = len < 0 ? strlen (opsysstring) : len;
    wchar_t *wcs, wc;
    gchar *result, *bp;
    const wchar_t *wcp;
    wchar_t *wcs, wc;
    gchar *result, *bp;
    const wchar_t *wcp;
@@ -478,26 +528,26 @@ g_locale_to_utf8 (const gchar *opsysstring, GError **error)
    wclen = MultiByteToWideChar (CP_ACP, 0, opsysstring, len, wcs, len);
  
    wcp = wcs;
    wclen = MultiByteToWideChar (CP_ACP, 0, opsysstring, len, wcs, len);
  
    wcp = wcs;
-  clen = 0;
+  total_len = 0;
    for (i = 0; i < wclen; i++)
      {
        wc = *wcp++;
  
        if (wc < 0x80)
    for (i = 0; i < wclen; i++)
      {
        wc = *wcp++;
  
        if (wc < 0x80)
-       clen += 1;
+       total_len += 1;
        else if (wc < 0x800)
        else if (wc < 0x800)
-       clen += 2;
+       total_len += 2;
        else if (wc < 0x10000)
        else if (wc < 0x10000)
-       clen += 3;
+       total_len += 3;
        else if (wc < 0x200000)
        else if (wc < 0x200000)
-       clen += 4;
+       total_len += 4;
        else if (wc < 0x4000000)
        else if (wc < 0x4000000)
-       clen += 5;
+       total_len += 5;
        else
        else
-       clen += 6;
+       total_len += 6;
      }
  
      }
  
-  result = g_malloc (clen + 1);
+  result = g_malloc (total_len + 1);
    
    wcp = wcs;
    bp = result;
    
    wcp = wcs;
    bp = result;
@@ -553,6 +603,11 @@ g_locale_to_utf8 (const gchar *opsysstring, GError **error)
  
    g_free (wcs);
  
  
    g_free (wcs);
  
+  if (bytes_read)
+    *bytes_read = len;
+  if (bytes_written)
+    *bytes_written = total_len;
+  
    return result;
  
  #else
    return result;
  
  #else
@@ -562,26 +617,48 @@ g_locale_to_utf8 (const gchar *opsysstring, GError **error)
    if (g_get_charset (&charset))
      return g_strdup (opsysstring);
  
    if (g_get_charset (&charset))
      return g_strdup (opsysstring);
  
-  str = g_convert (opsysstring, strlen (opsysstring), 
-                  "UTF-8", charset, NULL, NULL, error);
+  str = g_convert (opsysstring, len, 
+                  "UTF-8", charset, bytes_read, bytes_written, error);
    
    return str;
  #endif
  }
  
    
    return str;
  #endif
  }
  
-/*
- * g_locale_from_utf8
- *
- * The reverse of g_locale_to_utf8.
- */
-
+/**
+ * g_locale_from_utf8:
+ * @utf8string:    a UTF-8 encoded string 
+ * @len:           the length of the string, or -1 if the string is
+ *                 NULL-terminated.
+ * @bytes_read:    location to store the number of bytes in the
+ *                 input string that were successfully converted, or %NULL.
+ *                 Even if the conversion was succesful, this may be 
+ *                 less than len if there were partial characters
+ *                 at the end of the input. If the error
+ *                 G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value
+ *                 stored will the byte fofset after the last valid
+ *                 input sequence.
+ * @bytes_written: the stored in the output buffer (not including the
+ *                 terminating nul.
+ * @error: location to store the error occuring, or %NULL to ignore
+ *                 errors. Any of the errors in #GConvertError may occur.
+ * 
+ * Converts a string from UTF-8 to the encoding used for strings by
+ * the C runtime (usually the same as that used by the operating
+ * system) in the current locale.
+ * 
+ * Return value: The converted string, or %NULL on an error.
+ **/
  gchar *
  gchar *
-g_locale_from_utf8 (const gchar *utf8string, GError **error)
+g_locale_from_utf8 (const gchar *utf8string,
+                   gint         len,
+                   gint        *bytes_read,
+                   gint        *bytes_written,
+                   GError     **error)
  {
  #ifdef G_OS_WIN32
  
    gint i, mask, clen, mblen;
  {
  #ifdef G_OS_WIN32
  
    gint i, mask, clen, mblen;
-  const gint len = strlen (utf8string);
+  const gint len = len < 0 ? strlen (utf8string) : len;
    wchar_t *wcs, *wcp;
    gchar *result;
    guchar *cp, *end, c;
    wchar_t *wcs, *wcp;
    gchar *result;
    guchar *cp, *end, c;
@@ -671,6 +748,11 @@ g_locale_from_utf8 (const gchar *utf8string, GError **error)
    result[mblen] = 0;
    g_free (wcs);
  
    result[mblen] = 0;
    g_free (wcs);
  
+  if (bytes_read)
+    *bytes_read = len;
+  if (bytes_written)
+    *bytes_written = mblen;
+  
    return result;
  
  #else
    return result;
  
  #else
@@ -681,39 +763,123 @@ g_locale_from_utf8 (const gchar *utf8string, GError **error)
      return g_strdup (utf8string);
  
    str = g_convert (utf8string, strlen (utf8string), 
      return g_strdup (utf8string);
  
    str = g_convert (utf8string, strlen (utf8string), 
-                  charset, "UTF-8", NULL, NULL, error);
+                  charset, "UTF-8", bytes_read, bytes_written, error);
  
    return str;
    
  #endif
  }
  
  
    return str;
    
  #endif
  }
  
-/* Filenames are in UTF-8 unless specificially requested otherwise */
-
+/**
+ * g_filename_to_utf8:
+ * @opsysstring:   a string in the encoding for filenames
+ * @len:           the length of the string, or -1 if the string is
+ *                 NULL-terminated.
+ * @bytes_read:    location to store the number of bytes in the
+ *                 input string that were successfully converted, or %NULL.
+ *                 Even if the conversion was succesful, this may be 
+ *                 less than len if there were partial characters
+ *                 at the end of the input. If the error
+ *                 G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value
+ *                 stored will the byte fofset after the last valid
+ *                 input sequence.
+ * @bytes_written: the stored in the output buffer (not including the
+ *                 terminating nul.
+ * @error: location to store the error occuring, or %NULL to ignore
+ *                 errors. Any of the errors in #GConvertError may occur.
+ * 
+ * Converts a string which is in the encoding used for filenames
+ * into a UTF-8 string.
+ * 
+ * Return value: The converted string, or %NULL on an error.
+ **/
  gchar*
  gchar*
-g_filename_to_utf8 (const gchar *string, GError **error)
-
+g_filename_to_utf8 (const gchar *opsysstring, 
+                   gint         len,
+                   gint        *bytes_read,
+                   gint        *bytes_written,
+                   GError     **error)
  {
  #ifdef G_OS_WIN32
  {
  #ifdef G_OS_WIN32
-  return g_locale_to_utf8 (string, error);
+  return g_locale_to_utf8 (opsysstring, len,
+                          bytes_read, bytes_written,
+                          error);
  #else
    if (getenv ("G_BROKEN_FILENAMES"))
  #else
    if (getenv ("G_BROKEN_FILENAMES"))
-    return g_locale_to_utf8 (string, error);
+    return g_locale_to_utf8 (opsysstring, len,
+                            bytes_read, bytes_written,
+                            error);
  
  
-  return g_strdup (string);
+  if (bytes_read || bytes_written)
+    {
+      gint len = strlen (opsysstring);
+
+      if (bytes_read)
+       *bytes_read = len;
+      if (bytes_written)
+       *bytes_written = len;
+    }
+  
+  if (len < 0)
+    return g_strdup (opsysstring);
+  else
+    return g_strndup (opsysstring, len);
  #endif
  }
  
  #endif
  }
  
+/**
+ * g_filename_from_utf8:
+ * @utf8string:    a UTF-8 encoded string 
+ * @len:           the length of the string, or -1 if the string is
+ *                 NULL-terminated.
+ * @bytes_read:    location to store the number of bytes in the
+ *                 input string that were successfully converted, or %NULL.
+ *                 Even if the conversion was succesful, this may be 
+ *                 less than len if there were partial characters
+ *                 at the end of the input. If the error
+ *                 G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value
+ *                 stored will the byte fofset after the last valid
+ *                 input sequence.
+ * @bytes_written: the stored in the output buffer (not including the
+ *                 terminating nul.
+ * @error: location to store the error occuring, or %NULL to ignore
+ *                 errors. Any of the errors in #GConvertError may occur.
+ * 
+ * Converts a string from UTF-8 to the encoding used for filenames.
+ * 
+ * Return value: The converted string, or %NULL on an error.
+ **/
  gchar*
  gchar*
-g_filename_from_utf8 (const gchar *string, GError **error)
+g_filename_from_utf8 (const gchar *utf8string,
+                     gint         len,
+                     gint        *bytes_read,
+                     gint        *bytes_written,
+                     GError     **error)
  {
  #ifdef G_OS_WIN32
  {
  #ifdef G_OS_WIN32
-  return g_locale_from_utf8 (string, error);
+  return g_locale_from_utf8 (utf8string, len,
+                            bytes_read, bytes_written,
+                            error);
  #else
    if (getenv ("G_BROKEN_FILENAMES"))
  #else
    if (getenv ("G_BROKEN_FILENAMES"))
-    return g_locale_from_utf8 (string, error);
+    return g_locale_from_utf8 (utf8string, len,
+                              bytes_read, bytes_written,
+                              error);
+
+  if (bytes_read || bytes_written)
+    {
+      gint len = strlen (utf8string);
+
+      if (bytes_read)
+       *bytes_read = len;
+      if (bytes_written)
+       *bytes_written = len;
+    }
  
  
-  return g_strdup (string);
+  if (len < 0)
+    return g_strdup (utf8string);
+  else
+    return g_strndup (utf8string, len);
  #endif
  }
  
  #endif
  }
  
diff --git a/glib/gconvert.h b/glib/gconvert.h

index ce19b36..e11a106 100644 (file)
--- a/glib/gconvert.h
+++ b/glib/gconvert.h
@@ -76,14 +76,30 @@ gchar* g_convert_with_fallback (const gchar  *str,
  
  /* Convert between libc's idea of strings and UTF-8.
   */
  
  /* Convert between libc's idea of strings and UTF-8.
   */
-gchar*   g_locale_to_utf8 (const gchar *opsysstring, GError **error);
-gchar*   g_locale_from_utf8 (const gchar *utf8string, GError **error);
+gchar* g_locale_to_utf8   (const gchar  *opsysstring,
+                          gint          len,
+                          gint         *bytes_read,
+                          gint         *bytes_written,
+                          GError      **error);
+gchar* g_locale_from_utf8 (const gchar  *utf8string,
+                          gint          len,
+                          gint         *bytes_read,
+                          gint         *bytes_written,
+                          GError      **error);
  
  /* Convert between the operating system (or C runtime)
   * representation of file names and UTF-8.
   */
  
  /* Convert between the operating system (or C runtime)
   * representation of file names and UTF-8.
   */
-gchar*   g_filename_to_utf8 (const gchar *opsysstring, GError **error);
-gchar*   g_filename_from_utf8 (const gchar *utf8string, GError **error);
+gchar* g_filename_to_utf8   (const gchar  *opsysstring,
+                            gint          len,
+                            gint         *bytes_read,
+                            gint         *bytes_written,
+                            GError      **error);
+gchar* g_filename_from_utf8 (const gchar  *utf8string,
+                            gint          len,
+                            gint         *bytes_read,
+                            gint         *bytes_written,
+                            GError      **error);
  
  G_END_DECLS
  
  
  G_END_DECLS
  
diff --git a/glib/gunicode.h b/glib/gunicode.h

index 93f3683..db4800a 100644 (file)
--- a/glib/gunicode.h
+++ b/glib/gunicode.h
@@ -206,18 +206,39 @@ gchar *g_utf8_strchr  (const gchar *p,
  gchar *g_utf8_strrchr (const gchar *p,
                        gunichar     c);
  
  gchar *g_utf8_strrchr (const gchar *p,
                        gunichar     c);
  
-gunichar2 *g_utf8_to_utf16 (const gchar     *str,
-                           gint             len);
-gunichar * g_utf8_to_ucs4  (const gchar     *str,
-                           gint             len);
-gunichar * g_utf16_to_ucs4 (const gunichar2 *str,
-                           gint             len);
-gchar *    g_utf16_to_utf8 (const gunichar2 *str,
-                           gint             len);
-gunichar * g_ucs4_to_utf16 (const gunichar  *str,
-                           gint             len);
-gchar *    g_ucs4_to_utf8  (const gunichar  *str,
-                           gint             len);
+gunichar2 *g_utf8_to_utf16     (const gchar      *str,
+                               gint              len,
+                               gint             *items_read,
+                               gint             *items_written,
+                               GError          **error);
+gunichar * g_utf8_to_ucs4      (const gchar      *str,
+                               gint              len,
+                               gint             *items_read,
+                               gint             *items_written,
+                               GError          **error);
+gunichar * g_utf8_to_ucs4_fast (const gchar      *str,
+                               gint              len,
+                               gint             *items_written);
+gunichar * g_utf16_to_ucs4     (const gunichar2  *str,
+                               gint              len,
+                               gint             *items_read,
+                               gint             *items_written,
+                               GError          **error);
+gchar *    g_utf16_to_utf8     (const gunichar2  *str,
+                               gint              len,
+                               gint             *items_read,
+                               gint             *items_written,
+                               GError          **error);
+gunichar2 *g_ucs4_to_utf16     (const gunichar   *str,
+                               gint              len,
+                               gint             *items_read,
+                               gint             *items_written,
+                               GError          **error);
+gchar *    g_ucs4_to_utf8      (const gunichar   *str,
+                               gint              len,
+                               gint             *items_read,
+                               gint             *items_written,
+                               GError          **error);
  
  /* Convert a single character into UTF-8. outbuf must have at
   * least 6 bytes of space. Returns the number of bytes in the
  
  /* Convert a single character into UTF-8. outbuf must have at
   * least 6 bytes of space. Returns the number of bytes in the
diff --git a/glib/gutf8.c b/glib/gutf8.c

index f584080..788b74a 100644 (file)
--- a/glib/gutf8.c
+++ b/glib/gutf8.c
@@ -33,6 +33,8 @@
  #include <windows.h>
  #endif
  
  #include <windows.h>
  #endif
  
+#define _(s) (s)
+
  #define UTF8_COMPUTE(Char, Mask, Len)                                        \
    if (Char < 128)                                                            \
      {                                                                        \
  #define UTF8_COMPUTE(Char, Mask, Len)                                        \
    if (Char < 128)                                                            \
      {                                                                        \
@@ -67,6 +69,14 @@
    else                                                                       \
      Len = -1;
  
    else                                                                       \
      Len = -1;
  
+#define UTF8_LENGTH(Char)              \
+  ((Char) < 0x80 ? 1 :                 \
+   ((Char) < 0x800 ? 2 :               \
+    ((Char) < 0x10000 ? 3 :            \
+     ((Char) < 0x200000 ? 4 :          \
+      ((Char) < 0x4000000 ? 5 : 6)))))
+   
+
  #define UTF8_GET(Result, Chars, Count, Mask, Len)                            \
    (Result) = (Chars)[0] & (Mask);                                            \
    for ((Count) = 1; (Count) < (Len); ++(Count))                                      \
  #define UTF8_GET(Result, Chars, Count, Mask, Len)                            \
    (Result) = (Chars)[0] & (Mask);                                            \
    for ((Count) = 1; (Count) < (Len); ++(Count))                                      \
@@ -79,6 +89,13 @@
        (Result) <<= 6;                                                        \
        (Result) |= ((Chars)[(Count)] & 0x3f);                                 \
      }
        (Result) <<= 6;                                                        \
        (Result) |= ((Chars)[(Count)] & 0x3f);                                 \
      }
+
+#define UNICODE_VALID(Char)                   \
+    ((Char) < 0x110000 &&                     \
+     ((Char) < 0xD800 || (Char) >= 0xE000) && \
+     (Char) != 0xFFFE && (Char) != 0xFFFF)
+   
+     
  gchar g_utf8_skip[256] = {
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
  gchar g_utf8_skip[256] = {
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
@@ -473,33 +490,272 @@ unicode_strrchr (const char *p, gunichar c)
  #endif
  
  
  #endif
  
  
+/* Like g_utf8_get_char, but take a maximum length
+ * and return (gunichar)-2 on incomplete trailing character
+ */
+static inline gunichar
+g_utf8_get_char_extended (const gchar *p, int max_len)
+{
+  gint i, len;
+  gunichar wc = (guchar) *p;
+
+  if (wc < 0x80)
+    {
+      return wc;
+    }
+  else if (wc < 0xc0)
+    {
+      return (gunichar)-1;
+    }
+  else if (wc < 0xe0)
+    {
+      len = 2;
+      wc &= 0x1f;
+    }
+  else if (wc < 0xf0)
+    {
+      len = 3;
+      wc &= 0x0f;
+    }
+  else if (wc < 0xf8)
+    {
+      len = 4;
+      wc &= 0x07;
+    }
+  else if (wc < 0xfc)
+    {
+      len = 5;
+      wc &= 0x03;
+    }
+  else if (wc < 0xfe)
+    {
+      len = 6;
+      wc &= 0x01;
+    }
+  else
+    {
+      return (gunichar)-1;
+    }
+  
+  if (len == -1)
+    return (gunichar)-1;
+  if (max_len >= 0 && len > max_len)
+    {
+      for (i = 1; i < max_len; i++)
+       {
+         if ((((guchar *)p)[i] & 0xc0) != 0x80)
+           return (gunichar)-1;
+       }
+      return (gunichar)-2;
+    }
+
+  for (i = 1; i < len; ++i)
+    {
+      gunichar ch = ((guchar *)p)[i];
+      
+      if ((ch & 0xc0) != 0x80)
+       {
+         if (ch)
+           return (gunichar)-1;
+         else
+           return (gunichar)-2;
+       }
+
+      wc <<= 6;
+      wc |= (ch & 0x3f);
+    }
+
+  if (UTF8_LENGTH(wc) != len)
+    return (gunichar)-1;
+  
+  return wc;
+}
+
  /**
  /**
- * g_utf8_to_ucs4:
- * @str: a UTF-8 encoded strnig
- * @len: the length of @
- * 
+ * g_utf8_to_ucs4_fast:
+ * @str: a UTF-8 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ *       the string is %NULL terminated.
+ * @items_written: location to store the number of characters in the
+ *                 result, or %NULL.
+ *
   * Convert a string from UTF-8 to a 32-bit fixed width
   * Convert a string from UTF-8 to a 32-bit fixed width
- * representation as UCS-4.
+ * representation as UCS-4, assuming valid UTF-8 input.
+ * This function is roughly twice as fast as g_utf8_to_ucs4()
+ * but does no error checking on the input.
   * 
   * Return value: a pointer to a newly allocated UCS-4 string.
   *               This value must be freed with g_free()
   **/
  gunichar *
   * 
   * Return value: a pointer to a newly allocated UCS-4 string.
   *               This value must be freed with g_free()
   **/
  gunichar *
-g_utf8_to_ucs4 (const char *str, int len)
+g_utf8_to_ucs4_fast (const gchar *str,
+                    gint         len,
+                    gint        *items_written)
  {
  {
+  gint j, charlen;
    gunichar *result;
    gint n_chars, i;
    const gchar *p;
    gunichar *result;
    gint n_chars, i;
    const gchar *p;
+
+  g_return_val_if_fail (str != NULL, NULL);
+
+  p = str;
+  n_chars = 0;
+  if (len < 0)
+    {
+      while (*p)
+       {
+         p = g_utf8_next_char (p);
+         ++n_chars;
+       }
+    }
+  else
+    {
+      while (*p && p < str + len)
+       {
+         p = g_utf8_next_char (p);
+         ++n_chars;
+       }
+    }
    
    
-  n_chars = g_utf8_strlen (str, len);
-  result = g_new (gunichar, n_chars);
+  result = g_new (gunichar, n_chars + 1);
    
    p = str;
    for (i=0; i < n_chars; i++)
      {
    
    p = str;
    for (i=0; i < n_chars; i++)
      {
-      result[i] = g_utf8_get_char (p);
-      p = g_utf8_next_char (p);
+      gunichar wc = ((unsigned char *)p)[0];
+
+      if (wc < 0x80)
+       {
+         result[i] = wc;
+         p++;
+       }
+      else
+       { 
+         if (wc < 0xe0)
+           {
+             charlen = 2;
+             wc &= 0x1f;
+           }
+         else if (wc < 0xf0)
+           {
+             charlen = 3;
+             wc &= 0x0f;
+           }
+         else if (wc < 0xf8)
+           {
+             charlen = 4;
+             wc &= 0x07;
+           }
+         else if (wc < 0xfc)
+           {
+             charlen = 5;
+             wc &= 0x03;
+           }
+         else
+           {
+             charlen = 6;
+             wc &= 0x01;
+           }
+
+         for (j = 1; j < charlen; j++)
+           {
+             wc <<= 6;
+             wc |= ((unsigned char *)p)[j] & 0x3f;
+           }
+
+         result[i] = wc;
+         p += charlen;
+       }
      }
      }
+  result[i] = 0;
+
+  if (items_written)
+    *items_written = i;
+
+  return result;
+}
+
+/**
+ * g_utf8_to_ucs4:
+ * @str: a UTF-8 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ *       the string is %NULL terminated.
+ * @items_read: location to store number of bytes read, or %NULL.
+ *              If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be
+ *              returned in case @str contains a trailing partial
+ *              character. If an error occurs then the index of the
+ *              invalid input is stored here.
+ * @items_written: location to store number of characters written or %NULL.
+ *                 The value here stored does not include the trailing 0
+ *                 character. 
+ * @error: location to store the error occuring, or %NULL to ignore
+ *         errors. Any of the errors in #GConvertError other than
+ *         %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
+ * Convert a string from UTF-8 to a 32-bit fixed width
+ * representation as UCS-4. A trailing 0 will be added to the
+ * string after the converted text.
+ * 
+ * Return value: a pointer to a newly allocated UCS-4 string.
+ *               This value must be freed with g_free(). If an
+ *               error occurs, %NULL will be returned and
+ *               @error set.
+ **/
+gunichar *
+g_utf8_to_ucs4 (const gchar *str,
+               gint         len,
+               gint        *items_read,
+               gint        *items_written,
+               GError     **error)
+{
+  gunichar *result = NULL;
+  gint n_chars, i;
+  const gchar *in;
+  
+  in = str;
+  n_chars = 0;
+  while ((len < 0 || str + len - in > 0) && *in)
+    {
+      gunichar wc = g_utf8_get_char_extended (in, str + len - in);
+      if (wc & 0x80000000)
+       {
+         if (wc == (gunichar)-2)
+           {
+             if (items_read)
+               break;
+             else
+               g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT,
+                            _("Partial character sequence at end of input"));
+           }
+         else
+           g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                        _("Invalid byte sequence in conversion input"));
+
+         goto err_out;
+       }
+
+      n_chars++;
+
+      in = g_utf8_next_char (in);
+    }
+
+  result = g_new (gunichar, n_chars + 1);
+  
+  in = str;
+  for (i=0; i < n_chars; i++)
+    {
+      result[i] = g_utf8_get_char (in);
+      in = g_utf8_next_char (in);
+    }
+  result[i] = 0;
+
+  if (items_written)
+    *items_written = n_chars;
+
+ err_out:
+  if (items_read)
+    *items_read = in - str;
  
    return result;
  }
  
    return result;
  }
@@ -507,35 +763,569 @@ g_utf8_to_ucs4 (const char *str, int len)
  /**
   * g_ucs4_to_utf8:
   * @str: a UCS-4 encoded string
  /**
   * g_ucs4_to_utf8:
   * @str: a UCS-4 encoded string
- * @len: the length of @
- * 
+ * @len: the maximum length of @str to use. If < 0, then
+ *       the string is %NULL terminated.
+ * @items_read: location to store number of characters read read, or %NULL.
+ * @items_written: location to store number of bytes written or %NULL.
+ *                 The value here stored does not include the trailing 0
+ *                 byte. 
+ * @error: location to store the error occuring, or %NULL to ignore
+ *         errors. Any of the errors in #GConvertError other than
+ *         %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
   * Convert a string from a 32-bit fixed width representation as UCS-4.
   * Convert a string from a 32-bit fixed width representation as UCS-4.
- * to UTF-8.
+ * to UTF-8. The result will be terminated with a 0 byte.
   * 
   * Return value: a pointer to a newly allocated UTF-8 string.
   * 
   * Return value: a pointer to a newly allocated UTF-8 string.
- *               This value must be freed with g_free()
+ *               This value must be freed with g_free(). If an
+ *               error occurs, %NULL will be returned and
+ *               @error set.
   **/
  gchar *
   **/
  gchar *
-g_ucs4_to_utf8 (const gunichar *str, int len)
+g_ucs4_to_utf8 (const gunichar *str,
+               gint            len,
+               gint           *items_read,
+               gint           *items_written,
+               GError        **error)
  {
    gint result_length;
  {
    gint result_length;
-  gchar *result, *p;
+  gchar *result = NULL;
+  gchar *p;
    gint i;
  
    result_length = 0;
    gint i;
  
    result_length = 0;
-  for (i = 0; i < len ; i++)
-    result_length += g_unichar_to_utf8 (str[i], NULL);
+  for (i = 0; len < 0 || i < len ; i++)
+    {
+      if (!str[i])
+       break;
  
  
-  result_length++;
+      if (str[i] >= 0x80000000)
+       {
+         if (items_read)
+           *items_read = i;
+         
+         g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                      _("Character out of range for UTF-8"));
+         goto err_out;
+       }
+      
+      result_length += UTF8_LENGTH (str[i]);
+    }
  
    result = g_malloc (result_length + 1);
    p = result;
  
  
    result = g_malloc (result_length + 1);
    p = result;
  
-  for (i = 0; i < len ; i++)
-    p += g_unichar_to_utf8 (str[i], p);
+  i = 0;
+  while (p < result + result_length)
+    p += g_unichar_to_utf8 (str[i++], p);
    
    *p = '\0';
  
    
    *p = '\0';
  
+  if (items_written)
+    *items_written = p - result;
+
+ err_out:
+  if (items_read)
+    *items_read = i;
+
+  return result;
+}
+
+#define SURROGATE_VALUE(h,l) (((h) - 0xd800) * 0x400 + (l) - 0xdc00 + 0x10000)
+
+/**
+ * g_utf16_to_utf8:
+ * @str: a UTF-16 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ *       the string is terminated with a 0 character.
+ * @items_read: location to store number of words read, or %NULL.
+ *              If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be
+ *              returned in case @str contains a trailing partial
+ *              character. If an error occurs then the index of the
+ *              invalid input is stored here.
+ * @items_written: location to store number of bytes written, or %NULL.
+ *                 The value stored here does not include the trailing
+ *                 0 byte.
+ * @error: location to store the error occuring, or %NULL to ignore
+ *         errors. Any of the errors in #GConvertError other than
+ *         %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
+ * Convert a string from UTF-16 to UTF-8. The result will be
+ * terminated with a 0 byte.
+ * 
+ * Return value: a pointer to a newly allocated UTF-8 string.
+ *               This value must be freed with g_free(). If an
+ *               error occurs, %NULL will be returned and
+ *               @error set.
+ **/
+gchar *
+g_utf16_to_utf8 (const gunichar2  *str,
+                gint              len,
+                gint             *items_read,
+                gint             *items_written,
+                GError          **error)
+{
+  /* This function and g_utf16_to_ucs4 are almost exactly identical - The lines that differ
+   * are marked.
+   */
+  const gunichar2 *in;
+  gchar *out;
+  gchar *result = NULL;
+  gint n_bytes;
+  gunichar high_surrogate;
+
+  g_return_val_if_fail (str != 0, NULL);
+
+  n_bytes = 0;
+  in = str;
+  high_surrogate = 0;
+  while ((len < 0 || in - str < len) && *in)
+    {
+      gunichar2 c = *in;
+      gunichar wc;
+
+      if (c >= 0xdc00 && c < 0xe000) /* low surrogate */
+       {
+         if (high_surrogate)
+           {
+             wc = SURROGATE_VALUE (high_surrogate, c);
+             high_surrogate = 0;
+           }
+         else
+           {
+             g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                          _("Invalid sequence in conversion input"));
+             goto err_out;
+           }
+       }
+      else
+       {
+         if (high_surrogate)
+           {
+             g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                          _("Invalid sequence in conversion input"));
+             goto err_out;
+           }
+
+         if (c >= 0xd800 && c < 0xdc00) /* high surrogate */
+           {
+             high_surrogate = c;
+             goto next1;
+           }
+         else
+           wc = c;
+       }
+
+      /********** DIFFERENT for UTF8/UCS4 **********/
+      n_bytes += UTF8_LENGTH (wc);
+
+    next1:
+      in++;
+    }
+
+  if (high_surrogate && !items_read)
+    {
+      g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT,
+                  _("Partial character sequence at end of input"));
+      goto err_out;
+    }
+  
+  /* At this point, everything is valid, and we just need to convert
+   */
+  /********** DIFFERENT for UTF8/UCS4 **********/
+  result = g_malloc (n_bytes + 1);
+  
+  high_surrogate = 0;
+  out = result;
+  in = str;
+  while (out < result + n_bytes)
+    {
+      gunichar2 c = *in;
+      gunichar wc;
+
+      if (c >= 0xdc00 && c < 0xe000) /* low surrogate */
+       {
+         wc = SURROGATE_VALUE (high_surrogate, c);
+         high_surrogate = 0;
+       }
+      else if (c >= 0xd800 && c < 0xdc00) /* high surrogate */
+       {
+         high_surrogate = c;
+         goto next2;
+       }
+      else
+       wc = c;
+
+      /********** DIFFERENT for UTF8/UCS4 **********/
+      out += g_unichar_to_utf8 (wc, out);
+
+    next2:
+      in++;
+    }
+  
+  /********** DIFFERENT for UTF8/UCS4 **********/
+  *out = '\0';
+
+  if (items_written)
+    /********** DIFFERENT for UTF8/UCS4 **********/
+    *items_written = out - result;
+
+ err_out:
+  if (items_read)
+    *items_read = in - str;
+
+  return result;
+}
+
+/**
+ * g_utf16_to_ucs4:
+ * @str: a UTF-16 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ *       the string is terminated with a 0 character.
+ * @items_read: location to store number of words read, or %NULL.
+ *              If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be
+ *              returned in case @str contains a trailing partial
+ *              character. If an error occurs then the index of the
+ *              invalid input is stored here.
+ * @items_written: location to store number of characters written, or %NULL.
+ *                 The value stored here does not include the trailing
+ *                 0 character.
+ * @error: location to store the error occuring, or %NULL to ignore
+ *         errors. Any of the errors in #GConvertError other than
+ *         %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
+ * Convert a string from UTF-16 to UCS-4. The result will be
+ * terminated with a 0 character.
+ * 
+ * Return value: a pointer to a newly allocated UCS-4 string.
+ *               This value must be freed with g_free(). If an
+ *               error occurs, %NULL will be returned and
+ *               @error set.
+ **/
+gunichar *
+g_utf16_to_ucs4 (const gunichar2  *str,
+                gint              len,
+                gint             *items_read,
+                gint             *items_written,
+                GError          **error)
+{
+  const gunichar2 *in;
+  gchar *out;
+  gchar *result = NULL;
+  gint n_bytes;
+  gunichar high_surrogate;
+
+  g_return_val_if_fail (str != 0, NULL);
+
+  n_bytes = 0;
+  in = str;
+  high_surrogate = 0;
+  while ((len < 0 || in - str < len) && *in)
+    {
+      gunichar2 c = *in;
+      gunichar wc;
+
+      if (c >= 0xdc00 && c < 0xe000) /* low surrogate */
+       {
+         if (high_surrogate)
+           {
+             wc = SURROGATE_VALUE (high_surrogate, c);
+             high_surrogate = 0;
+           }
+         else
+           {
+             g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                          _("Invalid sequence in conversion input"));
+             goto err_out;
+           }
+       }
+      else
+       {
+         if (high_surrogate)
+           {
+             g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                          _("Invalid sequence in conversion input"));
+             goto err_out;
+           }
+
+         if (c >= 0xd800 && c < 0xdc00) /* high surrogate */
+           {
+             high_surrogate = c;
+             goto next1;
+           }
+         else
+           wc = c;
+       }
+
+      /********** DIFFERENT for UTF8/UCS4 **********/
+      n_bytes += sizeof (gunichar);
+
+    next1:
+      in++;
+    }
+
+  if (high_surrogate && !items_read)
+    {
+      g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT,
+                  _("Partial character sequence at end of input"));
+      goto err_out;
+    }
+  
+  /* At this point, everything is valid, and we just need to convert
+   */
+  /********** DIFFERENT for UTF8/UCS4 **********/
+  result = g_malloc (n_bytes + 4);
+  
+  high_surrogate = 0;
+  out = result;
+  in = str;
+  while (out < result + n_bytes)
+    {
+      gunichar2 c = *in;
+      gunichar wc;
+
+      if (c >= 0xdc00 && c < 0xe000) /* low surrogate */
+       {
+         wc = SURROGATE_VALUE (high_surrogate, c);
+         high_surrogate = 0;
+       }
+      else if (c >= 0xd800 && c < 0xdc00) /* high surrogate */
+       {
+         high_surrogate = c;
+         goto next2;
+       }
+      else
+       wc = c;
+
+      /********** DIFFERENT for UTF8/UCS4 **********/
+      *(gunichar *)out = wc;
+      out += sizeof (gunichar);
+
+    next2:
+      in++;
+    }
+
+  /********** DIFFERENT for UTF8/UCS4 **********/
+  *(gunichar *)out = 0;
+
+  if (items_written)
+    /********** DIFFERENT for UTF8/UCS4 **********/
+    *items_written = (out - result) / sizeof (gunichar);
+
+ err_out:
+  if (items_read)
+    *items_read = in - str;
+
+  return (gunichar *)result;
+}
+
+/**
+ * g_utf8_to_utf16:
+ * @str: a UTF-8 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ *       the string is %NULL terminated.
+ 
+ * @items_read: location to store number of bytes read, or %NULL.
+ *              If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be
+ *              returned in case @str contains a trailing partial
+ *              character. If an error occurs then the index of the
+ *              invalid input is stored here.
+ * @items_written: location to store number of words written, or %NULL.
+ *                 The value stored here does not include the trailing
+ *                 0 word.
+ * @error: location to store the error occuring, or %NULL to ignore
+ *         errors. Any of the errors in #GConvertError other than
+ *         %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
+ * Convert a string from UTF-8 to UTF-16. A 0 word will be
+ * added to the result after the converted text.
+ * 
+ * Return value: a pointer to a newly allocated UTF-16 string.
+ *               This value must be freed with g_free(). If an
+ *               error occurs, %NULL will be returned and
+ *               @error set.
+ **/
+gunichar2 *
+g_utf8_to_utf16 (const gchar *str,
+                gint         len,
+                gint        *items_read,
+                gint        *items_written,
+                GError     **error)
+{
+  gunichar2 *result = NULL;
+  gint n16;
+  const gchar *in;
+  gint i;
+
+  g_return_val_if_fail (str != NULL, NULL);
+
+  in = str;
+  n16 = 0;
+  while ((len < 0 || str + len - in > 0) && *in)
+    {
+      gunichar wc = g_utf8_get_char_extended (in, str + len - in);
+      if (wc & 0x80000000)
+       {
+         if (wc == (gunichar)-2)
+           {
+             if (items_read)
+               break;
+             else
+               g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT,
+                            _("Partial character sequence at end of input"));
+           }
+         else
+           g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                        _("Invalid byte sequence in conversion input"));
+
+         goto err_out;
+       }
+
+      if (wc < 0xd800)
+       n16 += 1;
+      else if (wc < 0xe000)
+       {
+         g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                      _("Invalid sequence in conversion input"));
+
+         goto err_out;
+       }
+      else if (wc < 0x10000)
+       n16 += 1;
+      else if (wc < 0x110000)
+       n16 += 2;
+      else
+       {
+         g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                      _("Character out of range for UTF-16"));
+
+         goto err_out;
+       }
+      
+      in = g_utf8_next_char (in);
+    }
+
+  result = g_new (gunichar2, n16 + 1);
+  
+  in = str;
+  for (i = 0; i < n16;)
+    {
+      gunichar wc = g_utf8_get_char (in);
+
+      if (wc < 0x10000)
+       {
+         result[i++] = wc;
+       }
+      else
+       {
+         result[i++] = (wc - 0x10000) / 0x400 + 0xd800;
+         result[i++] = (wc - 0x10000) % 0x400 + 0xdc00;
+       }
+      
+      in = g_utf8_next_char (in);
+    }
+
+  result[i] = 0;
+
+  if (items_written)
+    *items_written = n16;
+
+ err_out:
+  if (items_read)
+    *items_read = in - str;
+  
+  return result;
+}
+
+/**
+ * g_ucs4_to_utf16:
+ * @str: a UCS-4 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ *       the string is terminated with a zero character.
+ * @items_read: location to store number of bytes read, or %NULL.
+ *              If an error occurs then the index of the invalid input
+ *              is stored here.
+ * @items_written: location to store number of words written, or %NULL.
+ *                 The value stored here does not include the trailing
+ *                 0 word.
+ * @error: location to store the error occuring, or %NULL to ignore
+ *         errors. Any of the errors in #GConvertError other than
+ *         %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
+ * Convert a string from UCS-4 to UTF-16. A 0 word will be
+ * added to the result after the converted text.
+ * 
+ * Return value: a pointer to a newly allocated UTF-16 string.
+ *               This value must be freed with g_free(). If an
+ *               error occurs, %NULL will be returned and
+ *               @error set.
+ **/
+gunichar2 *
+g_ucs4_to_utf16 (const gunichar  *str,
+                gint             len,
+                gint            *items_read,
+                gint            *items_written,
+                GError         **error)
+{
+  gunichar2 *result = NULL;
+  gint n16;
+  gint i, j;
+
+  n16 = 0;
+  i = 0;
+  while ((len < 0 || i < len) && str[i])
+    {
+      gunichar wc = str[i];
+
+      if (wc < 0xd800)
+       n16 += 1;
+      else if (wc < 0xe000)
+       {
+         g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                      _("Invalid sequence in conversion input"));
+
+         goto err_out;
+       }
+      else if (wc < 0x10000)
+       n16 += 1;
+      else if (wc < 0x110000)
+       n16 += 2;
+      else
+       {
+         g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                      _("Character out of range for UTF-16"));
+
+         goto err_out;
+       }
+
+      i++;
+    }
+  
+  result = g_new (gunichar2, n16 + 1);
+  
+  for (i = 0, j = 0; j < n16; i++)
+    {
+      gunichar wc = str[i];
+
+      if (wc < 0x10000)
+       {
+         result[j++] = wc;
+       }
+      else
+       {
+         result[j++] = (wc - 0x10000) / 0x400 + 0xd800;
+         result[j++] = (wc - 0x10000) % 0x400 + 0xdc00;
+       }
+    }
+  result[j] = 0;
+
+  if (items_written)
+    *items_written = n16;
+  
+ err_out:
+  if (items_read)
+    *items_read = i;
+  
    return result;
  }
  
    return result;
  }
  
@@ -567,6 +1357,8 @@ g_utf8_validate (const gchar  *str,
  {
  
    const gchar *p;
  {
  
    const gchar *p;
+
+  g_return_val_if_fail (str != NULL, FALSE);
    
    if (end)
      *end = str;
    
    if (end)
      *end = str;
@@ -591,8 +1383,14 @@ g_utf8_validate (const gchar  *str,
          
        UTF8_GET (result, p, i, mask, len);
  
          
        UTF8_GET (result, p, i, mask, len);
  
+      if (UTF8_LENGTH (result) != len) /* Check for overlong UTF-8 */
+       break;
+
        if (result == (gunichar)-1)
          break;
        if (result == (gunichar)-1)
          break;
+
+      if (!UNICODE_VALID (result))
+       break;
        
        p += len;
      }
        
        p += len;
      }
diff --git a/gunicode.h b/gunicode.h

index 93f3683..db4800a 100644 (file)
--- a/gunicode.h
+++ b/gunicode.h
@@ -206,18 +206,39 @@ gchar *g_utf8_strchr  (const gchar *p,
  gchar *g_utf8_strrchr (const gchar *p,
                        gunichar     c);
  
  gchar *g_utf8_strrchr (const gchar *p,
                        gunichar     c);
  
-gunichar2 *g_utf8_to_utf16 (const gchar     *str,
-                           gint             len);
-gunichar * g_utf8_to_ucs4  (const gchar     *str,
-                           gint             len);
-gunichar * g_utf16_to_ucs4 (const gunichar2 *str,
-                           gint             len);
-gchar *    g_utf16_to_utf8 (const gunichar2 *str,
-                           gint             len);
-gunichar * g_ucs4_to_utf16 (const gunichar  *str,
-                           gint             len);
-gchar *    g_ucs4_to_utf8  (const gunichar  *str,
-                           gint             len);
+gunichar2 *g_utf8_to_utf16     (const gchar      *str,
+                               gint              len,
+                               gint             *items_read,
+                               gint             *items_written,
+                               GError          **error);
+gunichar * g_utf8_to_ucs4      (const gchar      *str,
+                               gint              len,
+                               gint             *items_read,
+                               gint             *items_written,
+                               GError          **error);
+gunichar * g_utf8_to_ucs4_fast (const gchar      *str,
+                               gint              len,
+                               gint             *items_written);
+gunichar * g_utf16_to_ucs4     (const gunichar2  *str,
+                               gint              len,
+                               gint             *items_read,
+                               gint             *items_written,
+                               GError          **error);
+gchar *    g_utf16_to_utf8     (const gunichar2  *str,
+                               gint              len,
+                               gint             *items_read,
+                               gint             *items_written,
+                               GError          **error);
+gunichar2 *g_ucs4_to_utf16     (const gunichar   *str,
+                               gint              len,
+                               gint             *items_read,
+                               gint             *items_written,
+                               GError          **error);
+gchar *    g_ucs4_to_utf8      (const gunichar   *str,
+                               gint              len,
+                               gint             *items_read,
+                               gint             *items_written,
+                               GError          **error);
  
  /* Convert a single character into UTF-8. outbuf must have at
   * least 6 bytes of space. Returns the number of bytes in the
  
  /* Convert a single character into UTF-8. outbuf must have at
   * least 6 bytes of space. Returns the number of bytes in the
diff --git a/gutf8.c b/gutf8.c

index f584080..788b74a 100644 (file)
--- a/gutf8.c
+++ b/gutf8.c
@@ -33,6 +33,8 @@
  #include <windows.h>
  #endif
  
  #include <windows.h>
  #endif
  
+#define _(s) (s)
+
  #define UTF8_COMPUTE(Char, Mask, Len)                                        \
    if (Char < 128)                                                            \
      {                                                                        \
  #define UTF8_COMPUTE(Char, Mask, Len)                                        \
    if (Char < 128)                                                            \
      {                                                                        \
@@ -67,6 +69,14 @@
    else                                                                       \
      Len = -1;
  
    else                                                                       \
      Len = -1;
  
+#define UTF8_LENGTH(Char)              \
+  ((Char) < 0x80 ? 1 :                 \
+   ((Char) < 0x800 ? 2 :               \
+    ((Char) < 0x10000 ? 3 :            \
+     ((Char) < 0x200000 ? 4 :          \
+      ((Char) < 0x4000000 ? 5 : 6)))))
+   
+
  #define UTF8_GET(Result, Chars, Count, Mask, Len)                            \
    (Result) = (Chars)[0] & (Mask);                                            \
    for ((Count) = 1; (Count) < (Len); ++(Count))                                      \
  #define UTF8_GET(Result, Chars, Count, Mask, Len)                            \
    (Result) = (Chars)[0] & (Mask);                                            \
    for ((Count) = 1; (Count) < (Len); ++(Count))                                      \
@@ -79,6 +89,13 @@
        (Result) <<= 6;                                                        \
        (Result) |= ((Chars)[(Count)] & 0x3f);                                 \
      }
        (Result) <<= 6;                                                        \
        (Result) |= ((Chars)[(Count)] & 0x3f);                                 \
      }
+
+#define UNICODE_VALID(Char)                   \
+    ((Char) < 0x110000 &&                     \
+     ((Char) < 0xD800 || (Char) >= 0xE000) && \
+     (Char) != 0xFFFE && (Char) != 0xFFFF)
+   
+     
  gchar g_utf8_skip[256] = {
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
  gchar g_utf8_skip[256] = {
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
@@ -473,33 +490,272 @@ unicode_strrchr (const char *p, gunichar c)
  #endif
  
  
  #endif
  
  
+/* Like g_utf8_get_char, but take a maximum length
+ * and return (gunichar)-2 on incomplete trailing character
+ */
+static inline gunichar
+g_utf8_get_char_extended (const gchar *p, int max_len)
+{
+  gint i, len;
+  gunichar wc = (guchar) *p;
+
+  if (wc < 0x80)
+    {
+      return wc;
+    }
+  else if (wc < 0xc0)
+    {
+      return (gunichar)-1;
+    }
+  else if (wc < 0xe0)
+    {
+      len = 2;
+      wc &= 0x1f;
+    }
+  else if (wc < 0xf0)
+    {
+      len = 3;
+      wc &= 0x0f;
+    }
+  else if (wc < 0xf8)
+    {
+      len = 4;
+      wc &= 0x07;
+    }
+  else if (wc < 0xfc)
+    {
+      len = 5;
+      wc &= 0x03;
+    }
+  else if (wc < 0xfe)
+    {
+      len = 6;
+      wc &= 0x01;
+    }
+  else
+    {
+      return (gunichar)-1;
+    }
+  
+  if (len == -1)
+    return (gunichar)-1;
+  if (max_len >= 0 && len > max_len)
+    {
+      for (i = 1; i < max_len; i++)
+       {
+         if ((((guchar *)p)[i] & 0xc0) != 0x80)
+           return (gunichar)-1;
+       }
+      return (gunichar)-2;
+    }
+
+  for (i = 1; i < len; ++i)
+    {
+      gunichar ch = ((guchar *)p)[i];
+      
+      if ((ch & 0xc0) != 0x80)
+       {
+         if (ch)
+           return (gunichar)-1;
+         else
+           return (gunichar)-2;
+       }
+
+      wc <<= 6;
+      wc |= (ch & 0x3f);
+    }
+
+  if (UTF8_LENGTH(wc) != len)
+    return (gunichar)-1;
+  
+  return wc;
+}
+
  /**
  /**
- * g_utf8_to_ucs4:
- * @str: a UTF-8 encoded strnig
- * @len: the length of @
- * 
+ * g_utf8_to_ucs4_fast:
+ * @str: a UTF-8 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ *       the string is %NULL terminated.
+ * @items_written: location to store the number of characters in the
+ *                 result, or %NULL.
+ *
   * Convert a string from UTF-8 to a 32-bit fixed width
   * Convert a string from UTF-8 to a 32-bit fixed width
- * representation as UCS-4.
+ * representation as UCS-4, assuming valid UTF-8 input.
+ * This function is roughly twice as fast as g_utf8_to_ucs4()
+ * but does no error checking on the input.
   * 
   * Return value: a pointer to a newly allocated UCS-4 string.
   *               This value must be freed with g_free()
   **/
  gunichar *
   * 
   * Return value: a pointer to a newly allocated UCS-4 string.
   *               This value must be freed with g_free()
   **/
  gunichar *
-g_utf8_to_ucs4 (const char *str, int len)
+g_utf8_to_ucs4_fast (const gchar *str,
+                    gint         len,
+                    gint        *items_written)
  {
  {
+  gint j, charlen;
    gunichar *result;
    gint n_chars, i;
    const gchar *p;
    gunichar *result;
    gint n_chars, i;
    const gchar *p;
+
+  g_return_val_if_fail (str != NULL, NULL);
+
+  p = str;
+  n_chars = 0;
+  if (len < 0)
+    {
+      while (*p)
+       {
+         p = g_utf8_next_char (p);
+         ++n_chars;
+       }
+    }
+  else
+    {
+      while (*p && p < str + len)
+       {
+         p = g_utf8_next_char (p);
+         ++n_chars;
+       }
+    }
    
    
-  n_chars = g_utf8_strlen (str, len);
-  result = g_new (gunichar, n_chars);
+  result = g_new (gunichar, n_chars + 1);
    
    p = str;
    for (i=0; i < n_chars; i++)
      {
    
    p = str;
    for (i=0; i < n_chars; i++)
      {
-      result[i] = g_utf8_get_char (p);
-      p = g_utf8_next_char (p);
+      gunichar wc = ((unsigned char *)p)[0];
+
+      if (wc < 0x80)
+       {
+         result[i] = wc;
+         p++;
+       }
+      else
+       { 
+         if (wc < 0xe0)
+           {
+             charlen = 2;
+             wc &= 0x1f;
+           }
+         else if (wc < 0xf0)
+           {
+             charlen = 3;
+             wc &= 0x0f;
+           }
+         else if (wc < 0xf8)
+           {
+             charlen = 4;
+             wc &= 0x07;
+           }
+         else if (wc < 0xfc)
+           {
+             charlen = 5;
+             wc &= 0x03;
+           }
+         else
+           {
+             charlen = 6;
+             wc &= 0x01;
+           }
+
+         for (j = 1; j < charlen; j++)
+           {
+             wc <<= 6;
+             wc |= ((unsigned char *)p)[j] & 0x3f;
+           }
+
+         result[i] = wc;
+         p += charlen;
+       }
      }
      }
+  result[i] = 0;
+
+  if (items_written)
+    *items_written = i;
+
+  return result;
+}
+
+/**
+ * g_utf8_to_ucs4:
+ * @str: a UTF-8 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ *       the string is %NULL terminated.
+ * @items_read: location to store number of bytes read, or %NULL.
+ *              If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be
+ *              returned in case @str contains a trailing partial
+ *              character. If an error occurs then the index of the
+ *              invalid input is stored here.
+ * @items_written: location to store number of characters written or %NULL.
+ *                 The value here stored does not include the trailing 0
+ *                 character. 
+ * @error: location to store the error occuring, or %NULL to ignore
+ *         errors. Any of the errors in #GConvertError other than
+ *         %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
+ * Convert a string from UTF-8 to a 32-bit fixed width
+ * representation as UCS-4. A trailing 0 will be added to the
+ * string after the converted text.
+ * 
+ * Return value: a pointer to a newly allocated UCS-4 string.
+ *               This value must be freed with g_free(). If an
+ *               error occurs, %NULL will be returned and
+ *               @error set.
+ **/
+gunichar *
+g_utf8_to_ucs4 (const gchar *str,
+               gint         len,
+               gint        *items_read,
+               gint        *items_written,
+               GError     **error)
+{
+  gunichar *result = NULL;
+  gint n_chars, i;
+  const gchar *in;
+  
+  in = str;
+  n_chars = 0;
+  while ((len < 0 || str + len - in > 0) && *in)
+    {
+      gunichar wc = g_utf8_get_char_extended (in, str + len - in);
+      if (wc & 0x80000000)
+       {
+         if (wc == (gunichar)-2)
+           {
+             if (items_read)
+               break;
+             else
+               g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT,
+                            _("Partial character sequence at end of input"));
+           }
+         else
+           g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                        _("Invalid byte sequence in conversion input"));
+
+         goto err_out;
+       }
+
+      n_chars++;
+
+      in = g_utf8_next_char (in);
+    }
+
+  result = g_new (gunichar, n_chars + 1);
+  
+  in = str;
+  for (i=0; i < n_chars; i++)
+    {
+      result[i] = g_utf8_get_char (in);
+      in = g_utf8_next_char (in);
+    }
+  result[i] = 0;
+
+  if (items_written)
+    *items_written = n_chars;
+
+ err_out:
+  if (items_read)
+    *items_read = in - str;
  
    return result;
  }
  
    return result;
  }
@@ -507,35 +763,569 @@ g_utf8_to_ucs4 (const char *str, int len)
  /**
   * g_ucs4_to_utf8:
   * @str: a UCS-4 encoded string
  /**
   * g_ucs4_to_utf8:
   * @str: a UCS-4 encoded string
- * @len: the length of @
- * 
+ * @len: the maximum length of @str to use. If < 0, then
+ *       the string is %NULL terminated.
+ * @items_read: location to store number of characters read read, or %NULL.
+ * @items_written: location to store number of bytes written or %NULL.
+ *                 The value here stored does not include the trailing 0
+ *                 byte. 
+ * @error: location to store the error occuring, or %NULL to ignore
+ *         errors. Any of the errors in #GConvertError other than
+ *         %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
   * Convert a string from a 32-bit fixed width representation as UCS-4.
   * Convert a string from a 32-bit fixed width representation as UCS-4.
- * to UTF-8.
+ * to UTF-8. The result will be terminated with a 0 byte.
   * 
   * Return value: a pointer to a newly allocated UTF-8 string.
   * 
   * Return value: a pointer to a newly allocated UTF-8 string.
- *               This value must be freed with g_free()
+ *               This value must be freed with g_free(). If an
+ *               error occurs, %NULL will be returned and
+ *               @error set.
   **/
  gchar *
   **/
  gchar *
-g_ucs4_to_utf8 (const gunichar *str, int len)
+g_ucs4_to_utf8 (const gunichar *str,
+               gint            len,
+               gint           *items_read,
+               gint           *items_written,
+               GError        **error)
  {
    gint result_length;
  {
    gint result_length;
-  gchar *result, *p;
+  gchar *result = NULL;
+  gchar *p;
    gint i;
  
    result_length = 0;
    gint i;
  
    result_length = 0;
-  for (i = 0; i < len ; i++)
-    result_length += g_unichar_to_utf8 (str[i], NULL);
+  for (i = 0; len < 0 || i < len ; i++)
+    {
+      if (!str[i])
+       break;
  
  
-  result_length++;
+      if (str[i] >= 0x80000000)
+       {
+         if (items_read)
+           *items_read = i;
+         
+         g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                      _("Character out of range for UTF-8"));
+         goto err_out;
+       }
+      
+      result_length += UTF8_LENGTH (str[i]);
+    }
  
    result = g_malloc (result_length + 1);
    p = result;
  
  
    result = g_malloc (result_length + 1);
    p = result;
  
-  for (i = 0; i < len ; i++)
-    p += g_unichar_to_utf8 (str[i], p);
+  i = 0;
+  while (p < result + result_length)
+    p += g_unichar_to_utf8 (str[i++], p);
    
    *p = '\0';
  
    
    *p = '\0';
  
+  if (items_written)
+    *items_written = p - result;
+
+ err_out:
+  if (items_read)
+    *items_read = i;
+
+  return result;
+}
+
+#define SURROGATE_VALUE(h,l) (((h) - 0xd800) * 0x400 + (l) - 0xdc00 + 0x10000)
+
+/**
+ * g_utf16_to_utf8:
+ * @str: a UTF-16 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ *       the string is terminated with a 0 character.
+ * @items_read: location to store number of words read, or %NULL.
+ *              If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be
+ *              returned in case @str contains a trailing partial
+ *              character. If an error occurs then the index of the
+ *              invalid input is stored here.
+ * @items_written: location to store number of bytes written, or %NULL.
+ *                 The value stored here does not include the trailing
+ *                 0 byte.
+ * @error: location to store the error occuring, or %NULL to ignore
+ *         errors. Any of the errors in #GConvertError other than
+ *         %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
+ * Convert a string from UTF-16 to UTF-8. The result will be
+ * terminated with a 0 byte.
+ * 
+ * Return value: a pointer to a newly allocated UTF-8 string.
+ *               This value must be freed with g_free(). If an
+ *               error occurs, %NULL will be returned and
+ *               @error set.
+ **/
+gchar *
+g_utf16_to_utf8 (const gunichar2  *str,
+                gint              len,
+                gint             *items_read,
+                gint             *items_written,
+                GError          **error)
+{
+  /* This function and g_utf16_to_ucs4 are almost exactly identical - The lines that differ
+   * are marked.
+   */
+  const gunichar2 *in;
+  gchar *out;
+  gchar *result = NULL;
+  gint n_bytes;
+  gunichar high_surrogate;
+
+  g_return_val_if_fail (str != 0, NULL);
+
+  n_bytes = 0;
+  in = str;
+  high_surrogate = 0;
+  while ((len < 0 || in - str < len) && *in)
+    {
+      gunichar2 c = *in;
+      gunichar wc;
+
+      if (c >= 0xdc00 && c < 0xe000) /* low surrogate */
+       {
+         if (high_surrogate)
+           {
+             wc = SURROGATE_VALUE (high_surrogate, c);
+             high_surrogate = 0;
+           }
+         else
+           {
+             g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                          _("Invalid sequence in conversion input"));
+             goto err_out;
+           }
+       }
+      else
+       {
+         if (high_surrogate)
+           {
+             g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                          _("Invalid sequence in conversion input"));
+             goto err_out;
+           }
+
+         if (c >= 0xd800 && c < 0xdc00) /* high surrogate */
+           {
+             high_surrogate = c;
+             goto next1;
+           }
+         else
+           wc = c;
+       }
+
+      /********** DIFFERENT for UTF8/UCS4 **********/
+      n_bytes += UTF8_LENGTH (wc);
+
+    next1:
+      in++;
+    }
+
+  if (high_surrogate && !items_read)
+    {
+      g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT,
+                  _("Partial character sequence at end of input"));
+      goto err_out;
+    }
+  
+  /* At this point, everything is valid, and we just need to convert
+   */
+  /********** DIFFERENT for UTF8/UCS4 **********/
+  result = g_malloc (n_bytes + 1);
+  
+  high_surrogate = 0;
+  out = result;
+  in = str;
+  while (out < result + n_bytes)
+    {
+      gunichar2 c = *in;
+      gunichar wc;
+
+      if (c >= 0xdc00 && c < 0xe000) /* low surrogate */
+       {
+         wc = SURROGATE_VALUE (high_surrogate, c);
+         high_surrogate = 0;
+       }
+      else if (c >= 0xd800 && c < 0xdc00) /* high surrogate */
+       {
+         high_surrogate = c;
+         goto next2;
+       }
+      else
+       wc = c;
+
+      /********** DIFFERENT for UTF8/UCS4 **********/
+      out += g_unichar_to_utf8 (wc, out);
+
+    next2:
+      in++;
+    }
+  
+  /********** DIFFERENT for UTF8/UCS4 **********/
+  *out = '\0';
+
+  if (items_written)
+    /********** DIFFERENT for UTF8/UCS4 **********/
+    *items_written = out - result;
+
+ err_out:
+  if (items_read)
+    *items_read = in - str;
+
+  return result;
+}
+
+/**
+ * g_utf16_to_ucs4:
+ * @str: a UTF-16 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ *       the string is terminated with a 0 character.
+ * @items_read: location to store number of words read, or %NULL.
+ *              If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be
+ *              returned in case @str contains a trailing partial
+ *              character. If an error occurs then the index of the
+ *              invalid input is stored here.
+ * @items_written: location to store number of characters written, or %NULL.
+ *                 The value stored here does not include the trailing
+ *                 0 character.
+ * @error: location to store the error occuring, or %NULL to ignore
+ *         errors. Any of the errors in #GConvertError other than
+ *         %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
+ * Convert a string from UTF-16 to UCS-4. The result will be
+ * terminated with a 0 character.
+ * 
+ * Return value: a pointer to a newly allocated UCS-4 string.
+ *               This value must be freed with g_free(). If an
+ *               error occurs, %NULL will be returned and
+ *               @error set.
+ **/
+gunichar *
+g_utf16_to_ucs4 (const gunichar2  *str,
+                gint              len,
+                gint             *items_read,
+                gint             *items_written,
+                GError          **error)
+{
+  const gunichar2 *in;
+  gchar *out;
+  gchar *result = NULL;
+  gint n_bytes;
+  gunichar high_surrogate;
+
+  g_return_val_if_fail (str != 0, NULL);
+
+  n_bytes = 0;
+  in = str;
+  high_surrogate = 0;
+  while ((len < 0 || in - str < len) && *in)
+    {
+      gunichar2 c = *in;
+      gunichar wc;
+
+      if (c >= 0xdc00 && c < 0xe000) /* low surrogate */
+       {
+         if (high_surrogate)
+           {
+             wc = SURROGATE_VALUE (high_surrogate, c);
+             high_surrogate = 0;
+           }
+         else
+           {
+             g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                          _("Invalid sequence in conversion input"));
+             goto err_out;
+           }
+       }
+      else
+       {
+         if (high_surrogate)
+           {
+             g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                          _("Invalid sequence in conversion input"));
+             goto err_out;
+           }
+
+         if (c >= 0xd800 && c < 0xdc00) /* high surrogate */
+           {
+             high_surrogate = c;
+             goto next1;
+           }
+         else
+           wc = c;
+       }
+
+      /********** DIFFERENT for UTF8/UCS4 **********/
+      n_bytes += sizeof (gunichar);
+
+    next1:
+      in++;
+    }
+
+  if (high_surrogate && !items_read)
+    {
+      g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT,
+                  _("Partial character sequence at end of input"));
+      goto err_out;
+    }
+  
+  /* At this point, everything is valid, and we just need to convert
+   */
+  /********** DIFFERENT for UTF8/UCS4 **********/
+  result = g_malloc (n_bytes + 4);
+  
+  high_surrogate = 0;
+  out = result;
+  in = str;
+  while (out < result + n_bytes)
+    {
+      gunichar2 c = *in;
+      gunichar wc;
+
+      if (c >= 0xdc00 && c < 0xe000) /* low surrogate */
+       {
+         wc = SURROGATE_VALUE (high_surrogate, c);
+         high_surrogate = 0;
+       }
+      else if (c >= 0xd800 && c < 0xdc00) /* high surrogate */
+       {
+         high_surrogate = c;
+         goto next2;
+       }
+      else
+       wc = c;
+
+      /********** DIFFERENT for UTF8/UCS4 **********/
+      *(gunichar *)out = wc;
+      out += sizeof (gunichar);
+
+    next2:
+      in++;
+    }
+
+  /********** DIFFERENT for UTF8/UCS4 **********/
+  *(gunichar *)out = 0;
+
+  if (items_written)
+    /********** DIFFERENT for UTF8/UCS4 **********/
+    *items_written = (out - result) / sizeof (gunichar);
+
+ err_out:
+  if (items_read)
+    *items_read = in - str;
+
+  return (gunichar *)result;
+}
+
+/**
+ * g_utf8_to_utf16:
+ * @str: a UTF-8 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ *       the string is %NULL terminated.
+ 
+ * @items_read: location to store number of bytes read, or %NULL.
+ *              If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be
+ *              returned in case @str contains a trailing partial
+ *              character. If an error occurs then the index of the
+ *              invalid input is stored here.
+ * @items_written: location to store number of words written, or %NULL.
+ *                 The value stored here does not include the trailing
+ *                 0 word.
+ * @error: location to store the error occuring, or %NULL to ignore
+ *         errors. Any of the errors in #GConvertError other than
+ *         %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
+ * Convert a string from UTF-8 to UTF-16. A 0 word will be
+ * added to the result after the converted text.
+ * 
+ * Return value: a pointer to a newly allocated UTF-16 string.
+ *               This value must be freed with g_free(). If an
+ *               error occurs, %NULL will be returned and
+ *               @error set.
+ **/
+gunichar2 *
+g_utf8_to_utf16 (const gchar *str,
+                gint         len,
+                gint        *items_read,
+                gint        *items_written,
+                GError     **error)
+{
+  gunichar2 *result = NULL;
+  gint n16;
+  const gchar *in;
+  gint i;
+
+  g_return_val_if_fail (str != NULL, NULL);
+
+  in = str;
+  n16 = 0;
+  while ((len < 0 || str + len - in > 0) && *in)
+    {
+      gunichar wc = g_utf8_get_char_extended (in, str + len - in);
+      if (wc & 0x80000000)
+       {
+         if (wc == (gunichar)-2)
+           {
+             if (items_read)
+               break;
+             else
+               g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT,
+                            _("Partial character sequence at end of input"));
+           }
+         else
+           g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                        _("Invalid byte sequence in conversion input"));
+
+         goto err_out;
+       }
+
+      if (wc < 0xd800)
+       n16 += 1;
+      else if (wc < 0xe000)
+       {
+         g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                      _("Invalid sequence in conversion input"));
+
+         goto err_out;
+       }
+      else if (wc < 0x10000)
+       n16 += 1;
+      else if (wc < 0x110000)
+       n16 += 2;
+      else
+       {
+         g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                      _("Character out of range for UTF-16"));
+
+         goto err_out;
+       }
+      
+      in = g_utf8_next_char (in);
+    }
+
+  result = g_new (gunichar2, n16 + 1);
+  
+  in = str;
+  for (i = 0; i < n16;)
+    {
+      gunichar wc = g_utf8_get_char (in);
+
+      if (wc < 0x10000)
+       {
+         result[i++] = wc;
+       }
+      else
+       {
+         result[i++] = (wc - 0x10000) / 0x400 + 0xd800;
+         result[i++] = (wc - 0x10000) % 0x400 + 0xdc00;
+       }
+      
+      in = g_utf8_next_char (in);
+    }
+
+  result[i] = 0;
+
+  if (items_written)
+    *items_written = n16;
+
+ err_out:
+  if (items_read)
+    *items_read = in - str;
+  
+  return result;
+}
+
+/**
+ * g_ucs4_to_utf16:
+ * @str: a UCS-4 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ *       the string is terminated with a zero character.
+ * @items_read: location to store number of bytes read, or %NULL.
+ *              If an error occurs then the index of the invalid input
+ *              is stored here.
+ * @items_written: location to store number of words written, or %NULL.
+ *                 The value stored here does not include the trailing
+ *                 0 word.
+ * @error: location to store the error occuring, or %NULL to ignore
+ *         errors. Any of the errors in #GConvertError other than
+ *         %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
+ * Convert a string from UCS-4 to UTF-16. A 0 word will be
+ * added to the result after the converted text.
+ * 
+ * Return value: a pointer to a newly allocated UTF-16 string.
+ *               This value must be freed with g_free(). If an
+ *               error occurs, %NULL will be returned and
+ *               @error set.
+ **/
+gunichar2 *
+g_ucs4_to_utf16 (const gunichar  *str,
+                gint             len,
+                gint            *items_read,
+                gint            *items_written,
+                GError         **error)
+{
+  gunichar2 *result = NULL;
+  gint n16;
+  gint i, j;
+
+  n16 = 0;
+  i = 0;
+  while ((len < 0 || i < len) && str[i])
+    {
+      gunichar wc = str[i];
+
+      if (wc < 0xd800)
+       n16 += 1;
+      else if (wc < 0xe000)
+       {
+         g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                      _("Invalid sequence in conversion input"));
+
+         goto err_out;
+       }
+      else if (wc < 0x10000)
+       n16 += 1;
+      else if (wc < 0x110000)
+       n16 += 2;
+      else
+       {
+         g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+                      _("Character out of range for UTF-16"));
+
+         goto err_out;
+       }
+
+      i++;
+    }
+  
+  result = g_new (gunichar2, n16 + 1);
+  
+  for (i = 0, j = 0; j < n16; i++)
+    {
+      gunichar wc = str[i];
+
+      if (wc < 0x10000)
+       {
+         result[j++] = wc;
+       }
+      else
+       {
+         result[j++] = (wc - 0x10000) / 0x400 + 0xd800;
+         result[j++] = (wc - 0x10000) % 0x400 + 0xdc00;
+       }
+    }
+  result[j] = 0;
+
+  if (items_written)
+    *items_written = n16;
+  
+ err_out:
+  if (items_read)
+    *items_read = i;
+  
    return result;
  }
  
    return result;
  }
  
@@ -567,6 +1357,8 @@ g_utf8_validate (const gchar  *str,
  {
  
    const gchar *p;
  {
  
    const gchar *p;
+
+  g_return_val_if_fail (str != NULL, FALSE);
    
    if (end)
      *end = str;
    
    if (end)
      *end = str;
@@ -591,8 +1383,14 @@ g_utf8_validate (const gchar  *str,
          
        UTF8_GET (result, p, i, mask, len);
  
          
        UTF8_GET (result, p, i, mask, len);
  
+      if (UTF8_LENGTH (result) != len) /* Check for overlong UTF-8 */
+       break;
+
        if (result == (gunichar)-1)
          break;
        if (result == (gunichar)-1)
          break;
+
+      if (!UNICODE_VALID (result))
+       break;
        
        p += len;
      }
        
        p += len;
      }
diff --git a/tests/Makefile.am b/tests/Makefile.am

index 756e1b7..1d8ce8a 100644 (file)
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -33,7 +33,8 @@ test_programs = \
         thread-test     \
         threadpool-test \
         tree-test       \
         thread-test     \
         threadpool-test \
         tree-test       \
-       type-test
+       type-test       \
+       unicode-encoding
  
  test_scripts = run-markup-tests.sh
  
  
  test_scripts = run-markup-tests.sh
  
@@ -71,6 +72,7 @@ thread_test_LDADD = $(thread_LDADD)
  threadpool_test_LDADD = $(thread_LDADD)
  tree_test_LDADD = $(progs_LDADD)
  type_test_LDADD = $(progs_LDADD)
  threadpool_test_LDADD = $(thread_LDADD)
  tree_test_LDADD = $(progs_LDADD)
  type_test_LDADD = $(progs_LDADD)
+unicode_encoding_LDADD = $(progs_LDADD)
  
  lib_LTLIBRARIES = libmoduletestplugin_a.la libmoduletestplugin_b.la
  
  
  lib_LTLIBRARIES = libmoduletestplugin_a.la libmoduletestplugin_b.la
  
diff --git a/tests/mainloop-test.c b/tests/mainloop-test.c

index 2652d63..422a669 100644 (file)
--- a/tests/mainloop-test.c
+++ b/tests/mainloop-test.c
@@ -155,7 +155,7 @@ adder_thread (gpointer data)
  
    g_free (channels);
    
  
    g_free (channels);
    
-  g_main_loop_destroy (addr_data.loop);
+  g_main_loop_unref (addr_data.loop);
  
    g_print ("Timeout run %d times\n", addr_data.count);
  
  
    g_print ("Timeout run %d times\n", addr_data.count);
  
@@ -393,7 +393,7 @@ main (int   argc,
    g_timeout_add (RECURSER_TIMEOUT, recurser_start, NULL);
  
    g_main_loop_run (main_loop);
    g_timeout_add (RECURSER_TIMEOUT, recurser_start, NULL);
  
    g_main_loop_run (main_loop);
-  g_main_loop_destroy (main_loop);
+  g_main_loop_unref (main_loop);
  
  #endif
    return 0;
  
  #endif
    return 0;
diff --git a/tests/unicode-encoding.c b/tests/unicode-encoding.c

new file mode 100644 (file)

index 0000000..498137b
--- /dev/null
+++ b/tests/unicode-encoding.c
@@ -0,0 +1,411 @@
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <glib.h>
+
+static gint exit_status = 0;
+
+void
+croak (char *format, ...)
+{
+  va_list va;
+  
+  va_start (va, format);
+  vfprintf (stderr, format, va);
+  va_end (va);
+
+  exit (1);
+}
+
+void
+fail (char *format, ...)
+{
+  va_list va;
+  
+  va_start (va, format);
+  vfprintf (stderr, format, va);
+  va_end (va);
+
+  exit_status |= 1;
+}
+
+typedef enum
+{
+  VALID,
+  INCOMPLETE,
+  NOTUNICODE,
+  OVERLONG,
+  MALFORMED
+} Status;
+
+static gboolean
+ucs4_equal (gunichar *a, gunichar *b)
+{
+  while (*a && *b && (*a == *b))
+    {
+      a++;
+      b++;
+    }
+
+  return (*a == *b);
+}
+
+static gboolean
+utf16_equal (gunichar2 *a, gunichar2 *b)
+{
+  while (*a && *b && (*a == *b))
+    {
+      a++;
+      b++;
+    }
+
+  return (*a == *b);
+}
+
+static gint
+utf16_count (gunichar2 *a)
+{
+  gint result = 0;
+  
+  while (a[result])
+    result++;
+
+  return result;
+}
+
+static void
+process (gint      line,
+        gchar    *utf8,
+        Status    status,
+        gunichar *ucs4,
+        gint      ucs4_len)
+{
+  const gchar *end;
+  gboolean is_valid = g_utf8_validate (utf8, -1, &end);
+  GError *error = NULL;
+  gint items_read, items_written;
+
+  switch (status)
+    {
+    case VALID:
+      if (!is_valid)
+       {
+         fail ("line %d: valid but g_utf8_validate returned FALSE\n", line);
+         return;
+       }
+      break;
+    case NOTUNICODE:
+    case INCOMPLETE:
+    case OVERLONG:
+    case MALFORMED:
+      if (is_valid)
+       {
+         fail ("line %d: invalid but g_utf8_validate returned TRUE\n", line);
+         return;
+       }
+      break;
+    }
+
+  if (status == INCOMPLETE)
+    {
+      gunichar *ucs4_result;      
+
+      ucs4_result = g_utf8_to_ucs4 (utf8, -1, NULL, NULL, &error);
+
+      if (!error || !g_error_matches (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT))
+       {
+         fail ("line %d: incomplete input not properly detected\n", line);
+         return;
+       }
+      g_clear_error (&error);
+
+      ucs4_result = g_utf8_to_ucs4 (utf8, -1, &items_read, NULL, &error);
+
+      if (!ucs4_result || items_read == strlen (utf8))
+       {
+         fail ("line %d: incomplete input not properly detected\n", line);
+         return;
+       }
+
+      g_free (ucs4_result);
+    }
+
+  if (status == VALID || status == NOTUNICODE)
+    {
+      gunichar *ucs4_result;
+      gchar *utf8_result;
+
+      ucs4_result = g_utf8_to_ucs4 (utf8, -1, &items_read, &items_written, &error);
+      if (!ucs4_result)
+       {
+         fail ("line %d: conversion to ucs4 failed: %s\n", line, error->message);
+         return;
+       }
+      
+      if (!ucs4_equal (ucs4_result, ucs4) ||
+         items_read != strlen (utf8) ||
+         items_written != ucs4_len)
+       {
+         fail ("line %d: results of conversion to ucs4 do not match expected.\n", line);
+         return;
+       }
+
+      g_free (ucs4_result);
+
+      ucs4_result = g_utf8_to_ucs4_fast (utf8, -1, &items_written);
+      
+      if (!ucs4_equal (ucs4_result, ucs4) ||
+         items_written != ucs4_len)
+       {
+         fail ("line %d: results of conversion to ucs4 do not match expected.\n", line);
+         return;
+       }
+
+      utf8_result = g_ucs4_to_utf8 (ucs4_result, -1, &items_read, &items_written, &error);
+      if (!utf8_result)
+       {
+         fail ("line %d: conversion back to utf8 failed: %s", line, error->message);
+         return;
+       }
+
+      if (strcmp (utf8_result, utf8) != 0 ||
+         items_read != ucs4_len ||
+         items_written != strlen (utf8))
+       {
+         fail ("line %d: conversion back to utf8 did not match original\n", line);
+         return;
+       }
+
+      g_free (utf8_result);
+      g_free (ucs4_result);
+    }
+
+  if (status == VALID)
+    {
+      gunichar2 *utf16_expected_tmp;
+      gunichar2 *utf16_expected;
+      gunichar2 *utf16_from_utf8;
+      gunichar2 *utf16_from_ucs4;
+      gunichar *ucs4_result;
+      gint bytes_written;
+      gint n_chars;
+      gchar *utf8_result;
+
+      if (!(utf16_expected_tmp = (gunichar2 *)g_convert (utf8, -1, "UTF-16", "UTF-8",
+                                                        NULL, &bytes_written, NULL)))
+       {
+         fail ("line %d: could not convert to UTF-16 via g_convert\n", line);
+         return;
+       }
+
+      /* zero-terminate and remove BOM
+       */
+      n_chars = bytes_written / 2;
+      if (utf16_expected_tmp[0] == 0xfeff) /* BOM */
+       {
+         n_chars--;
+         utf16_expected = g_new (gunichar2, n_chars + 1);
+         memcpy (utf16_expected, utf16_expected_tmp + 1, sizeof(gunichar2) * n_chars);
+       }
+      else if (utf16_expected_tmp[0] == 0xfffe) /* ANTI-BOM */
+       {
+         fail ("line %d: conversion via iconv to \"UTF-16\" is not native-endian\n");
+         return;
+       }
+      else
+       {
+         utf16_expected = g_new (gunichar2, n_chars + 1);
+         memcpy (utf16_expected, utf16_expected_tmp, sizeof(gunichar2) * n_chars);
+       }
+
+      utf16_expected[n_chars] = '\0';
+      
+      if (!(utf16_from_utf8 = g_utf8_to_utf16 (utf8, -1, &items_read, &items_written, &error)))
+       {
+         fail ("line %d: conversion to ucs16 failed: %s\n", line, error->message);
+         return;
+       }
+
+      if (items_read != strlen (utf8) ||
+         utf16_count (utf16_from_utf8) != items_written)
+       {
+         fail ("line %d: length error in conversion to ucs16\n", line);
+         return;
+       }
+
+      if (!(utf16_from_ucs4 = g_ucs4_to_utf16 (ucs4, -1, &items_read, &items_written, &error)))
+       {
+         fail ("line %d: conversion to ucs16 failed: %s\n", line, error->message);
+         return;
+       }
+
+      if (items_read != ucs4_len ||
+         utf16_count (utf16_from_ucs4) != items_written)
+       {
+         fail ("line %d: length error in conversion to ucs16\n", line);
+         return;
+       }
+
+      if (!utf16_equal (utf16_from_utf8, utf16_expected) ||
+         !utf16_equal (utf16_from_ucs4, utf16_expected))
+       {
+         fail ("line %d: results of conversion to ucs16 do not match\n", line);
+         return;
+       }
+
+      if (!(utf8_result = g_utf16_to_utf8 (utf16_from_utf8, -1, &items_read, &items_written, &error)))
+       {
+         fail ("line %d: conversion back to utf8 failed: %s\n", line, error->message);
+         return;
+       }
+
+      if (items_read != utf16_count (utf16_from_utf8) ||
+         items_written != strlen (utf8))
+       {
+         fail ("line %d: length error in conversion from ucs16 to utf8\n", line);
+         return;
+       }
+
+      if (!(ucs4_result = g_utf16_to_ucs4 (utf16_from_ucs4, -1, &items_read, &items_written, &error)))
+       {
+         fail ("line %d: conversion back to utf8/ucs4 failed\n", line);
+         return;
+       }
+
+      if (items_read != utf16_count (utf16_from_utf8) ||
+         items_written != ucs4_len)
+       {
+         fail ("line %d: length error in conversion from ucs16 to ucs4\n", line);
+         return;
+       }
+
+      if (strcmp (utf8, utf8_result) != 0 ||
+         !ucs4_equal (ucs4, ucs4_result))
+       {
+         fail ("line %d: conversion back to utf8/ucs4 did not match original\n", line);
+         return;
+       }
+      
+      g_free (utf16_expected_tmp);
+      g_free (utf16_expected);
+      g_free (utf16_from_utf8);
+      g_free (utf16_from_ucs4);
+      g_free (utf8_result);
+      g_free (ucs4_result);
+    }
+}
+
+int
+main (int argc, char **argv)
+{
+  gchar *srcdir = getenv ("srcdir");
+  gchar *testfile;
+  gchar *contents;
+  GError *error = NULL;
+  gchar *p, *end;
+  char *tmp;
+  gint state = 0;
+  gint line = 1;
+  gint start_line = 0;         /* Quiet GCC */
+  gchar *utf8 = NULL;          /* Quiet GCC */
+  GArray *ucs4;
+  Status status = VALID;       /* Quiet GCC */
+
+  if (!srcdir)
+    srcdir = ".";
+  
+  testfile = g_strconcat (srcdir, "/", "utf8.txt", NULL);
+  
+  g_file_get_contents (testfile, &contents, NULL, &error);
+  if (error)
+    croak ("Cannot open utf8.txt: %s", error->message);
+
+  ucs4 = g_array_new (TRUE, FALSE, sizeof(gunichar));
+
+  p = contents;
+
+  /* Loop over lines */
+  while (*p)
+    {
+      while (*p && (*p == ' ' || *p == '\t'))
+       p++;
+
+      end = p;
+      while (*end && *end != '\n')
+       end++;
+      
+      if (!*p || *p == '#' || *p == '\n')
+       goto next_line;
+
+      tmp = g_strstrip (g_strndup (p, end - p));
+      
+      switch (state)
+       {
+       case 0:
+         /* UTF-8 string */
+         start_line = line;
+         utf8 = tmp;
+         tmp = NULL;
+         break;
+         
+       case 1:
+         /* Status */
+         if (!strcmp (tmp, "VALID"))
+           status = VALID;
+         else if (!strcmp (tmp, "INCOMPLETE"))
+           status = INCOMPLETE;
+         else if (!strcmp (tmp, "NOTUNICODE"))
+           status = NOTUNICODE;
+         else if (!strcmp (tmp, "OVERLONG"))
+           status = OVERLONG;
+         else if (!strcmp (tmp, "MALFORMED"))
+           status = MALFORMED;
+         else
+           croak ("Invalid status on line %d\n", line);
+
+         if (status != VALID && status != NOTUNICODE)
+           state++;            /* No UCS-4 data */
+         
+         break;
+         
+       case 2:
+         /* UCS-4 version */
+
+         p = strtok (tmp, " \t");
+         while (p)
+           {
+             gchar *endptr;
+             
+             gunichar ch = strtoul (p, &endptr, 16);
+             if (*endptr != '\0')
+               croak ("Invalid UCS-4 character on line %d\n", line);
+
+             g_array_append_val (ucs4, ch);
+             
+             p = strtok (NULL, " \t");
+           }
+
+         break;
+       }
+
+      g_free (tmp);
+      state = (state + 1) % 3;
+
+      if (state == 0)
+       {
+         process (start_line, utf8, status, (gunichar *)ucs4->data, ucs4->len);
+         g_array_set_size (ucs4, 0);
+         g_free (utf8);
+       }
+      
+    next_line:
+      p = end;
+      if (*p && *p == '\n')
+       p++;
+      
+      line++;
+    }
+
+  return 0;
+}
diff --git a/tests/utf8.txt b/tests/utf8.txt

new file mode 100644 (file)

index 0000000..8197d0b
--- /dev/null
+++ b/tests/utf8.txt
@@ -0,0 +1,297 @@
+# This file is derived from 
+#
+#    http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
+#    
+# Which was created by   Markus Kuhn <mkuhn@acm.org> - 2000-09-02 
+#
+# lines begining with # and blank lines are ignored
+#
+# Beyond that, this file consists of a series of test cases. Each test case consists of
+# 2 or 3 lines:
+#
+#  1. A UTF-8 string
+#  2. A status
+#      VALID      : The string is a valid UTF-8 representation of valid Unicode
+#      INCOMPLETE : The string has a partial character at the end
+#      NOTUNICODE : The string is valid UTF-8, but the characters represented
+#                   are not valid unicode (
+#      OVERLONG   : The string includes overlong sequences
+#      MALFORMED  : The string is not valid UTF-8
+# 3. If the status is VALID or NOTUNICODE, the UCS-4 representation of the string,
+#    as a series of hex numbers.
+
+# 1  Some correct UTF-8 text
+κόσμε
+VALID
+03ba 1f79 03c3 03bc 03b5
+
+# 2.1  First possible sequence of a certain length
+#
+# FIXME - handle NULLS?
+#
+# [ NULL BYTE ]
+#VALID
+#0000
+
+\80
+VALID
+0080
+
+ࠀ
+VALID
+0800
+
+𐀀
+VALID
+00010000
+
+�����
+NOTUNICODE
+00200000
+
+������
+NOTUNICODE
+04000000
+
+\7f
+VALID
+0000007f
+
+߿
+VALID
+000007ff
+
+
+NOTUNICODE
+0000ffff
+
+����
+NOTUNICODE
+001fffff
+
+�����
+NOTUNICODE
+03ffffff
+
+������
+NOTUNICODE
+7fffffff
+
+# 2.3  Other boundary conditions
+
+퟿
+VALID
+d7ff
+
+
+VALID
+e000
+
+�
+VALID
+fffd
+
+􏿿
+VALID
+0010ffff
+
+����
+NOTUNICODE
+00110000
+
+# 3.1  Unexpected continuation bytes
+
+\80
+MALFORMED
+¿
+MALFORMED
+\80¿
+MALFORMED
+\80¿\80
+MALFORMED
+\80¿\80¿
+MALFORMED
+\80¿\80¿\80
+MALFORMED
+\80¿\80¿\80¿
+MALFORMED
+\80¿\80¿\80¿\80
+MALFORMED
+\80\81\82\83\84\85\86\87\88\89\8a\8b\8c\8d\8e\8f\90\91\92\93\94\95\96\97\98\99\9a\9b\9c\9d\9e\9f¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿
+MALFORMED
+
+# 3.2  Lonely start characters
+
+À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß 
+MALFORMED
+à á â ã ä å æ ç è é ê ë ì í î ï 
+MALFORMED
+ð ñ ò ó ô õ ö ÷ 
+MALFORMED
+ø ù ú û 
+MALFORMED
+ü ý 
+MALFORMED
+
+# 3.3  Sequences with last continuation byte missing
+
+À
+INCOMPLETE
+à\80
+INCOMPLETE
+ð\80\80
+INCOMPLETE
+ø\80\80\80
+INCOMPLETE
+ü\80\80\80\80
+INCOMPLETE
+ß
+INCOMPLETE
+ï¿
+INCOMPLETE
+÷¿¿
+INCOMPLETE
+û¿¿¿
+INCOMPLETE
+ý¿¿¿¿
+INCOMPLETE
+
+# 3.4  Concatenation of incomplete sequences
+
+Àà\80ð\80\80ø\80\80\80ü\80\80\80\80ßï¿÷¿¿û¿¿¿ý¿¿¿¿
+MALFORMED
+
+# 3.5  Impossible bytes
+
+þ
+MALFORMED
+ÿ
+MALFORMED
+þþÿÿ
+MALFORMED
+
+#  Examples of an overlong ASCII character
+
+À¯
+OVERLONG
+à\80¯
+OVERLONG
+ð\80\80¯
+OVERLONG
+ø\80\80\80¯
+OVERLONG
+ü\80\80\80\80¯
+OVERLONG
+
+#  Maximum overlong sequences
+
+Á¿
+OVERLONG
+à\9f¿
+OVERLONG
+ð\8f¿¿
+OVERLONG
+ø\87¿¿¿
+OVERLONG
+ü\83¿¿¿¿
+OVERLONG
+
+# Overlong representation of the NUL character
+
+À\80
+OVERLONG
+à\80\80
+OVERLONG
+ð\80\80\80
+OVERLONG
+ø\80\80\80\80
+OVERLONG
+ü\80\80\80\80\80
+OVERLONG
+
+# Illegal code positions
+
+# Single UTF-16 surrogates
+
+���
+NOTUNICODE
+d800
+
+���
+NOTUNICODE
+db7f
+
+���
+NOTUNICODE
+db80
+
+���
+NOTUNICODE
+dbff
+
+���
+NOTUNICODE
+dc00
+
+���
+NOTUNICODE
+df80
+
+���
+NOTUNICODE
+dfff
+
+# Paired UTF-16 surrogates
+
+������
+NOTUNICODE
+d800 dc00
+
+������
+NOTUNICODE
+d800 dfff
+
+������
+NOTUNICODE
+db7f dc00
+
+������
+NOTUNICODE
+db7f dfff
+
+������
+NOTUNICODE
+db80 dc00
+
+������
+NOTUNICODE
+db80 dfff
+
+������
+NOTUNICODE
+dbff dc00
+
+������
+NOTUNICODE
+dbff dfff
+
+# Other illegal code positions
+
+
+NOTUNICODE
+fffe
+
+
+NOTUNICODE
+ffff
+
+################
+#
+# Some more tests, not from Markus Kuhn's file
+#
+
+# Mixed plane 0 and higher planes
+
+A𐀀B􏿿C
+VALID
+41 00010000 42 10ffff 43
author	Owen Taylor <otaylor@redhat.com>
	Fri, 5 Jan 2001 21:22:47 +0000 (21:22 +0000)
committer	Owen Taylor <otaylor@src.gnome.org>
	Fri, 5 Jan 2001 21:22:47 +0000 (21:22 +0000)
ChangeLog		patch \| blob \| history
ChangeLog.pre-2-0		patch \| blob \| history
ChangeLog.pre-2-10		patch \| blob \| history
ChangeLog.pre-2-12		patch \| blob \| history
ChangeLog.pre-2-2		patch \| blob \| history
ChangeLog.pre-2-4		patch \| blob \| history
ChangeLog.pre-2-6		patch \| blob \| history
ChangeLog.pre-2-8		patch \| blob \| history
configure.in		patch \| blob \| history
gconvert.c		patch \| blob \| history
gconvert.h		patch \| blob \| history
glib/gconvert.c		patch \| blob \| history
glib/gconvert.h		patch \| blob \| history
glib/gunicode.h		patch \| blob \| history
glib/gutf8.c		patch \| blob \| history
gunicode.h		patch \| blob \| history
gutf8.c		patch \| blob \| history
tests/Makefile.am		patch \| blob \| history
tests/mainloop-test.c		patch \| blob \| history
tests/unicode-encoding.c	[new file with mode: 0644]	patch \| blob
tests/utf8.txt	[new file with mode: 0644]	patch \| blob