+Fri Jan 5 11:25:42 2001 Owen Taylor <otaylor@redhat.com>
+
+ * configure.in (PACKAGE): move $enable_debug down below
+ checks for GCC to avoid setting CFLAGS prematurely,
+ change checks to avoid adding -g twice.
+
+ * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean
+ 0 termination.
+
+ * gutf8.c (g_utf8_to_ucs4): Terminate result with 0.
+
+ * tests/mainloop-test.c (main): Fix uses of
+ g_main_loop_destroy().
+
+ * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt:
+ Tests for unicode-conversion code.
+
+ * gconvert.c (g_convert, g_convert_with_fallback): work around
+ a couple of GNU libc bugs.
+
+ * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize
+ arguments to match g_convert(). Document.
+
+ * gunicode.[ch]:
+ - Implement conversion functions to and from UTF-16
+ - Standardize unicode conversion functions on prototype like
+ g_convert.
+ - Add a lot of error checking to unicode conversion functions.
+
+ * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking
+ variant of g_utf8_to_ucs4.
+
+ * gutf8.c (g_utf8_validate):
+ - add g_return_if_fail (str != NULL).
+ - add checks for overlong strings, non-valid Unicode characters (>= 110000)
+ and single surrogates.
+
2001-01-05 Tor Lillqvist <tml@iki.fi>
* testglib.c (main): Add test for g_path_skip_root().
+Fri Jan 5 11:25:42 2001 Owen Taylor <otaylor@redhat.com>
+
+ * configure.in (PACKAGE): move $enable_debug down below
+ checks for GCC to avoid setting CFLAGS prematurely,
+ change checks to avoid adding -g twice.
+
+ * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean
+ 0 termination.
+
+ * gutf8.c (g_utf8_to_ucs4): Terminate result with 0.
+
+ * tests/mainloop-test.c (main): Fix uses of
+ g_main_loop_destroy().
+
+ * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt:
+ Tests for unicode-conversion code.
+
+ * gconvert.c (g_convert, g_convert_with_fallback): work around
+ a couple of GNU libc bugs.
+
+ * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize
+ arguments to match g_convert(). Document.
+
+ * gunicode.[ch]:
+ - Implement conversion functions to and from UTF-16
+ - Standardize unicode conversion functions on prototype like
+ g_convert.
+ - Add a lot of error checking to unicode conversion functions.
+
+ * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking
+ variant of g_utf8_to_ucs4.
+
+ * gutf8.c (g_utf8_validate):
+ - add g_return_if_fail (str != NULL).
+ - add checks for overlong strings, non-valid Unicode characters (>= 110000)
+ and single surrogates.
+
2001-01-05 Tor Lillqvist <tml@iki.fi>
* testglib.c (main): Add test for g_path_skip_root().
+Fri Jan 5 11:25:42 2001 Owen Taylor <otaylor@redhat.com>
+
+ * configure.in (PACKAGE): move $enable_debug down below
+ checks for GCC to avoid setting CFLAGS prematurely,
+ change checks to avoid adding -g twice.
+
+ * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean
+ 0 termination.
+
+ * gutf8.c (g_utf8_to_ucs4): Terminate result with 0.
+
+ * tests/mainloop-test.c (main): Fix uses of
+ g_main_loop_destroy().
+
+ * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt:
+ Tests for unicode-conversion code.
+
+ * gconvert.c (g_convert, g_convert_with_fallback): work around
+ a couple of GNU libc bugs.
+
+ * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize
+ arguments to match g_convert(). Document.
+
+ * gunicode.[ch]:
+ - Implement conversion functions to and from UTF-16
+ - Standardize unicode conversion functions on prototype like
+ g_convert.
+ - Add a lot of error checking to unicode conversion functions.
+
+ * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking
+ variant of g_utf8_to_ucs4.
+
+ * gutf8.c (g_utf8_validate):
+ - add g_return_if_fail (str != NULL).
+ - add checks for overlong strings, non-valid Unicode characters (>= 110000)
+ and single surrogates.
+
2001-01-05 Tor Lillqvist <tml@iki.fi>
* testglib.c (main): Add test for g_path_skip_root().
+Fri Jan 5 11:25:42 2001 Owen Taylor <otaylor@redhat.com>
+
+ * configure.in (PACKAGE): move $enable_debug down below
+ checks for GCC to avoid setting CFLAGS prematurely,
+ change checks to avoid adding -g twice.
+
+ * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean
+ 0 termination.
+
+ * gutf8.c (g_utf8_to_ucs4): Terminate result with 0.
+
+ * tests/mainloop-test.c (main): Fix uses of
+ g_main_loop_destroy().
+
+ * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt:
+ Tests for unicode-conversion code.
+
+ * gconvert.c (g_convert, g_convert_with_fallback): work around
+ a couple of GNU libc bugs.
+
+ * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize
+ arguments to match g_convert(). Document.
+
+ * gunicode.[ch]:
+ - Implement conversion functions to and from UTF-16
+ - Standardize unicode conversion functions on prototype like
+ g_convert.
+ - Add a lot of error checking to unicode conversion functions.
+
+ * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking
+ variant of g_utf8_to_ucs4.
+
+ * gutf8.c (g_utf8_validate):
+ - add g_return_if_fail (str != NULL).
+ - add checks for overlong strings, non-valid Unicode characters (>= 110000)
+ and single surrogates.
+
2001-01-05 Tor Lillqvist <tml@iki.fi>
* testglib.c (main): Add test for g_path_skip_root().
+Fri Jan 5 11:25:42 2001 Owen Taylor <otaylor@redhat.com>
+
+ * configure.in (PACKAGE): move $enable_debug down below
+ checks for GCC to avoid setting CFLAGS prematurely,
+ change checks to avoid adding -g twice.
+
+ * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean
+ 0 termination.
+
+ * gutf8.c (g_utf8_to_ucs4): Terminate result with 0.
+
+ * tests/mainloop-test.c (main): Fix uses of
+ g_main_loop_destroy().
+
+ * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt:
+ Tests for unicode-conversion code.
+
+ * gconvert.c (g_convert, g_convert_with_fallback): work around
+ a couple of GNU libc bugs.
+
+ * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize
+ arguments to match g_convert(). Document.
+
+ * gunicode.[ch]:
+ - Implement conversion functions to and from UTF-16
+ - Standardize unicode conversion functions on prototype like
+ g_convert.
+ - Add a lot of error checking to unicode conversion functions.
+
+ * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking
+ variant of g_utf8_to_ucs4.
+
+ * gutf8.c (g_utf8_validate):
+ - add g_return_if_fail (str != NULL).
+ - add checks for overlong strings, non-valid Unicode characters (>= 110000)
+ and single surrogates.
+
2001-01-05 Tor Lillqvist <tml@iki.fi>
* testglib.c (main): Add test for g_path_skip_root().
+Fri Jan 5 11:25:42 2001 Owen Taylor <otaylor@redhat.com>
+
+ * configure.in (PACKAGE): move $enable_debug down below
+ checks for GCC to avoid setting CFLAGS prematurely,
+ change checks to avoid adding -g twice.
+
+ * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean
+ 0 termination.
+
+ * gutf8.c (g_utf8_to_ucs4): Terminate result with 0.
+
+ * tests/mainloop-test.c (main): Fix uses of
+ g_main_loop_destroy().
+
+ * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt:
+ Tests for unicode-conversion code.
+
+ * gconvert.c (g_convert, g_convert_with_fallback): work around
+ a couple of GNU libc bugs.
+
+ * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize
+ arguments to match g_convert(). Document.
+
+ * gunicode.[ch]:
+ - Implement conversion functions to and from UTF-16
+ - Standardize unicode conversion functions on prototype like
+ g_convert.
+ - Add a lot of error checking to unicode conversion functions.
+
+ * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking
+ variant of g_utf8_to_ucs4.
+
+ * gutf8.c (g_utf8_validate):
+ - add g_return_if_fail (str != NULL).
+ - add checks for overlong strings, non-valid Unicode characters (>= 110000)
+ and single surrogates.
+
2001-01-05 Tor Lillqvist <tml@iki.fi>
* testglib.c (main): Add test for g_path_skip_root().
+Fri Jan 5 11:25:42 2001 Owen Taylor <otaylor@redhat.com>
+
+ * configure.in (PACKAGE): move $enable_debug down below
+ checks for GCC to avoid setting CFLAGS prematurely,
+ change checks to avoid adding -g twice.
+
+ * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean
+ 0 termination.
+
+ * gutf8.c (g_utf8_to_ucs4): Terminate result with 0.
+
+ * tests/mainloop-test.c (main): Fix uses of
+ g_main_loop_destroy().
+
+ * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt:
+ Tests for unicode-conversion code.
+
+ * gconvert.c (g_convert, g_convert_with_fallback): work around
+ a couple of GNU libc bugs.
+
+ * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize
+ arguments to match g_convert(). Document.
+
+ * gunicode.[ch]:
+ - Implement conversion functions to and from UTF-16
+ - Standardize unicode conversion functions on prototype like
+ g_convert.
+ - Add a lot of error checking to unicode conversion functions.
+
+ * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking
+ variant of g_utf8_to_ucs4.
+
+ * gutf8.c (g_utf8_validate):
+ - add g_return_if_fail (str != NULL).
+ - add checks for overlong strings, non-valid Unicode characters (>= 110000)
+ and single surrogates.
+
2001-01-05 Tor Lillqvist <tml@iki.fi>
* testglib.c (main): Add test for g_path_skip_root().
+Fri Jan 5 11:25:42 2001 Owen Taylor <otaylor@redhat.com>
+
+ * configure.in (PACKAGE): move $enable_debug down below
+ checks for GCC to avoid setting CFLAGS prematurely,
+ change checks to avoid adding -g twice.
+
+ * gutf8.c (g_ucs4_to_utf8): Support len < 0 to mean
+ 0 termination.
+
+ * gutf8.c (g_utf8_to_ucs4): Terminate result with 0.
+
+ * tests/mainloop-test.c (main): Fix uses of
+ g_main_loop_destroy().
+
+ * tests/unicode-encoding.c tests/Makefile.am tests/utf8.txt:
+ Tests for unicode-conversion code.
+
+ * gconvert.c (g_convert, g_convert_with_fallback): work around
+ a couple of GNU libc bugs.
+
+ * gconvert.[ch] (g_{locale,filename}_{to,from}_utf8): Standardize
+ arguments to match g_convert(). Document.
+
+ * gunicode.[ch]:
+ - Implement conversion functions to and from UTF-16
+ - Standardize unicode conversion functions on prototype like
+ g_convert.
+ - Add a lot of error checking to unicode conversion functions.
+
+ * gunicode.[ch] (g_utf8_to_ucs4_fast): Add fast, non-checking
+ variant of g_utf8_to_ucs4.
+
+ * gutf8.c (g_utf8_validate):
+ - add g_return_if_fail (str != NULL).
+ - add checks for overlong strings, non-valid Unicode characters (>= 110000)
+ and single surrogates.
+
2001-01-05 Tor Lillqvist <tml@iki.fi>
* testglib.c (main): Add test for g_path_skip_root().
enable_threads=no
fi
-if test "x$enable_debug" = "xyes"; then
- test "$cflags_set" = set || CFLAGS="$CFLAGS -g"
- GLIB_DEBUG_FLAGS="-DG_ENABLE_DEBUG"
-else
- if test "x$enable_debug" = "xno"; then
- GLIB_DEBUG_FLAGS="-DG_DISABLE_ASSERT -DG_DISABLE_CHECKS"
- fi
-fi
-
AC_DEFINE_UNQUOTED(G_COMPILED_WITH_DEBUGGING, "${enable_debug}",
[Whether glib was compiled with debugging enabled])
AM_PROG_CC_STDC
AC_PROG_INSTALL
+if test "x$enable_debug" = "xyes"; then
+ if test x$cflags_set != xset ; then
+ case " $CFLAGS " in
+ *[[\ \ ]]-g[[\ \ ]]*) ;;
+ *) CFLAGS="$CFLAGS -g" ;;
+ esac
+ fi
+
+ GLIB_DEBUG_FLAGS="-DG_ENABLE_DEBUG"
+else
+ if test "x$enable_debug" = "xno"; then
+ GLIB_DEBUG_FLAGS="-DG_DISABLE_ASSERT -DG_DISABLE_CHECKS"
+ fi
+fi
+
# define a MAINT-like variable REBUILD which is set if Perl
# and awk are found, so autogenerated sources can be rebuilt
AC_PROG_AWK
p = str;
inbytes_remaining = len;
- outbuf_size = len + 1; /* + 1 for nul in case len == 1 */
+
+ /* Due to a GLIBC bug, round outbuf_size up to a multiple of 4 */
+ /* + 1 for nul in case len == 1 */
+ outbuf_size = ((len + 3) & ~3) + 1;
+
outbytes_remaining = outbuf_size - 1; /* -1 for nul */
outp = dest = g_malloc (outbuf_size);
case E2BIG:
{
size_t used = outp - dest;
- outbuf_size *= 2;
- dest = g_realloc (dest, outbuf_size);
- outp = dest + used;
- outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */
+ /* glibc's iconv can return E2BIG even if there is space
+ * remaining if an internal buffer is exhausted. The
+ * folllowing is a heuristic to catch this. The 16 is
+ * pretty arbitrary.
+ */
+ if (used + 16 > outbuf_size)
+ {
+ outbuf_size = (outbuf_size - 1) * 2 + 1;
+ dest = g_realloc (dest, outbuf_size);
+
+ outp = dest + used;
+ outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */
+ }
goto again;
}
* for the original string while we are converting the fallback
*/
p = utf8;
- outbuf_size = len + 1; /* + 1 for nul in case len == 1 */
+ /* Due to a GLIBC bug, round outbuf_size up to a multiple of 4 */
+ /* + 1 for nul in case len == 1 */
+ outbuf_size = ((len + 3) & ~3) + 1;
outbytes_remaining = outbuf_size - 1; /* -1 for nul */
outp = dest = g_malloc (outbuf_size);
case E2BIG:
{
size_t used = outp - dest;
- outbuf_size *= 2;
- dest = g_realloc (dest, outbuf_size);
-
- outp = dest + used;
- outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */
+
+ /* glibc's iconv can return E2BIG even if there is space
+ * remaining if an internal buffer is exhausted. The
+ * folllowing is a heuristic to catch this. The 16 is
+ * pretty arbitrary.
+ */
+ if (used + 16 > outbuf_size)
+ {
+ outbuf_size = (outbuf_size - 1) * 2 + 1;
+ dest = g_realloc (dest, outbuf_size);
+
+ outp = dest + used;
+ outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */
+ }
break;
}
/*
* g_locale_to_utf8
*
+ *
+ */
+
+/**
+ * g_locale_to_utf8:
+ * @opsysstring: a string in the encoding of the current locale
+ * @len: the length of the string, or -1 if the string is
+ * NULL-terminated.
+ * @bytes_read: location to store the number of bytes in the
+ * input string that were successfully converted, or %NULL.
+ * Even if the conversion was succesful, this may be
+ * less than len if there were partial characters
+ * at the end of the input. If the error
+ * G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value
+ * stored will the byte fofset after the last valid
+ * input sequence.
+ * @bytes_written: the stored in the output buffer (not including the
+ * terminating nul.
+ * @error: location to store the error occuring, or %NULL to ignore
+ * errors. Any of the errors in #GConvertError may occur.
+ *
* Converts a string which is in the encoding used for strings by
* the C runtime (usually the same as that used by the operating
* system) in the current locale into a UTF-8 string.
- */
-
+ *
+ * Return value: The converted string, or %NULL on an error.
+ **/
gchar *
-g_locale_to_utf8 (const gchar *opsysstring, GError **error)
+g_locale_to_utf8 (const gchar *opsysstring,
+ gint len,
+ gint *bytes_read,
+ gint *bytes_written,
+ GError **error)
{
#ifdef G_OS_WIN32
- gint i, clen, wclen, first;
- const gint len = strlen (opsysstring);
+ gint i, clen, total_len, wclen, first;
+ const gint len = len < 0 ? strlen (opsysstring) : len;
wchar_t *wcs, wc;
gchar *result, *bp;
const wchar_t *wcp;
wclen = MultiByteToWideChar (CP_ACP, 0, opsysstring, len, wcs, len);
wcp = wcs;
- clen = 0;
+ total_len = 0;
for (i = 0; i < wclen; i++)
{
wc = *wcp++;
if (wc < 0x80)
- clen += 1;
+ total_len += 1;
else if (wc < 0x800)
- clen += 2;
+ total_len += 2;
else if (wc < 0x10000)
- clen += 3;
+ total_len += 3;
else if (wc < 0x200000)
- clen += 4;
+ total_len += 4;
else if (wc < 0x4000000)
- clen += 5;
+ total_len += 5;
else
- clen += 6;
+ total_len += 6;
}
- result = g_malloc (clen + 1);
+ result = g_malloc (total_len + 1);
wcp = wcs;
bp = result;
g_free (wcs);
+ if (bytes_read)
+ *bytes_read = len;
+ if (bytes_written)
+ *bytes_written = total_len;
+
return result;
#else
if (g_get_charset (&charset))
return g_strdup (opsysstring);
- str = g_convert (opsysstring, strlen (opsysstring),
- "UTF-8", charset, NULL, NULL, error);
+ str = g_convert (opsysstring, len,
+ "UTF-8", charset, bytes_read, bytes_written, error);
return str;
#endif
}
-/*
- * g_locale_from_utf8
- *
- * The reverse of g_locale_to_utf8.
- */
-
+/**
+ * g_locale_from_utf8:
+ * @utf8string: a UTF-8 encoded string
+ * @len: the length of the string, or -1 if the string is
+ * NULL-terminated.
+ * @bytes_read: location to store the number of bytes in the
+ * input string that were successfully converted, or %NULL.
+ * Even if the conversion was succesful, this may be
+ * less than len if there were partial characters
+ * at the end of the input. If the error
+ * G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value
+ * stored will the byte fofset after the last valid
+ * input sequence.
+ * @bytes_written: the stored in the output buffer (not including the
+ * terminating nul.
+ * @error: location to store the error occuring, or %NULL to ignore
+ * errors. Any of the errors in #GConvertError may occur.
+ *
+ * Converts a string from UTF-8 to the encoding used for strings by
+ * the C runtime (usually the same as that used by the operating
+ * system) in the current locale.
+ *
+ * Return value: The converted string, or %NULL on an error.
+ **/
gchar *
-g_locale_from_utf8 (const gchar *utf8string, GError **error)
+g_locale_from_utf8 (const gchar *utf8string,
+ gint len,
+ gint *bytes_read,
+ gint *bytes_written,
+ GError **error)
{
#ifdef G_OS_WIN32
gint i, mask, clen, mblen;
- const gint len = strlen (utf8string);
+ const gint len = len < 0 ? strlen (utf8string) : len;
wchar_t *wcs, *wcp;
gchar *result;
guchar *cp, *end, c;
result[mblen] = 0;
g_free (wcs);
+ if (bytes_read)
+ *bytes_read = len;
+ if (bytes_written)
+ *bytes_written = mblen;
+
return result;
#else
return g_strdup (utf8string);
str = g_convert (utf8string, strlen (utf8string),
- charset, "UTF-8", NULL, NULL, error);
+ charset, "UTF-8", bytes_read, bytes_written, error);
return str;
#endif
}
-/* Filenames are in UTF-8 unless specificially requested otherwise */
-
+/**
+ * g_filename_to_utf8:
+ * @opsysstring: a string in the encoding for filenames
+ * @len: the length of the string, or -1 if the string is
+ * NULL-terminated.
+ * @bytes_read: location to store the number of bytes in the
+ * input string that were successfully converted, or %NULL.
+ * Even if the conversion was succesful, this may be
+ * less than len if there were partial characters
+ * at the end of the input. If the error
+ * G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value
+ * stored will the byte fofset after the last valid
+ * input sequence.
+ * @bytes_written: the stored in the output buffer (not including the
+ * terminating nul.
+ * @error: location to store the error occuring, or %NULL to ignore
+ * errors. Any of the errors in #GConvertError may occur.
+ *
+ * Converts a string which is in the encoding used for filenames
+ * into a UTF-8 string.
+ *
+ * Return value: The converted string, or %NULL on an error.
+ **/
gchar*
-g_filename_to_utf8 (const gchar *string, GError **error)
-
+g_filename_to_utf8 (const gchar *opsysstring,
+ gint len,
+ gint *bytes_read,
+ gint *bytes_written,
+ GError **error)
{
#ifdef G_OS_WIN32
- return g_locale_to_utf8 (string, error);
+ return g_locale_to_utf8 (opsysstring, len,
+ bytes_read, bytes_written,
+ error);
#else
if (getenv ("G_BROKEN_FILENAMES"))
- return g_locale_to_utf8 (string, error);
+ return g_locale_to_utf8 (opsysstring, len,
+ bytes_read, bytes_written,
+ error);
- return g_strdup (string);
+ if (bytes_read || bytes_written)
+ {
+ gint len = strlen (opsysstring);
+
+ if (bytes_read)
+ *bytes_read = len;
+ if (bytes_written)
+ *bytes_written = len;
+ }
+
+ if (len < 0)
+ return g_strdup (opsysstring);
+ else
+ return g_strndup (opsysstring, len);
#endif
}
+/**
+ * g_filename_from_utf8:
+ * @utf8string: a UTF-8 encoded string
+ * @len: the length of the string, or -1 if the string is
+ * NULL-terminated.
+ * @bytes_read: location to store the number of bytes in the
+ * input string that were successfully converted, or %NULL.
+ * Even if the conversion was succesful, this may be
+ * less than len if there were partial characters
+ * at the end of the input. If the error
+ * G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value
+ * stored will the byte fofset after the last valid
+ * input sequence.
+ * @bytes_written: the stored in the output buffer (not including the
+ * terminating nul.
+ * @error: location to store the error occuring, or %NULL to ignore
+ * errors. Any of the errors in #GConvertError may occur.
+ *
+ * Converts a string from UTF-8 to the encoding used for filenames.
+ *
+ * Return value: The converted string, or %NULL on an error.
+ **/
gchar*
-g_filename_from_utf8 (const gchar *string, GError **error)
+g_filename_from_utf8 (const gchar *utf8string,
+ gint len,
+ gint *bytes_read,
+ gint *bytes_written,
+ GError **error)
{
#ifdef G_OS_WIN32
- return g_locale_from_utf8 (string, error);
+ return g_locale_from_utf8 (utf8string, len,
+ bytes_read, bytes_written,
+ error);
#else
if (getenv ("G_BROKEN_FILENAMES"))
- return g_locale_from_utf8 (string, error);
+ return g_locale_from_utf8 (utf8string, len,
+ bytes_read, bytes_written,
+ error);
+
+ if (bytes_read || bytes_written)
+ {
+ gint len = strlen (utf8string);
+
+ if (bytes_read)
+ *bytes_read = len;
+ if (bytes_written)
+ *bytes_written = len;
+ }
- return g_strdup (string);
+ if (len < 0)
+ return g_strdup (utf8string);
+ else
+ return g_strndup (utf8string, len);
#endif
}
/* Convert between libc's idea of strings and UTF-8.
*/
-gchar* g_locale_to_utf8 (const gchar *opsysstring, GError **error);
-gchar* g_locale_from_utf8 (const gchar *utf8string, GError **error);
+gchar* g_locale_to_utf8 (const gchar *opsysstring,
+ gint len,
+ gint *bytes_read,
+ gint *bytes_written,
+ GError **error);
+gchar* g_locale_from_utf8 (const gchar *utf8string,
+ gint len,
+ gint *bytes_read,
+ gint *bytes_written,
+ GError **error);
/* Convert between the operating system (or C runtime)
* representation of file names and UTF-8.
*/
-gchar* g_filename_to_utf8 (const gchar *opsysstring, GError **error);
-gchar* g_filename_from_utf8 (const gchar *utf8string, GError **error);
+gchar* g_filename_to_utf8 (const gchar *opsysstring,
+ gint len,
+ gint *bytes_read,
+ gint *bytes_written,
+ GError **error);
+gchar* g_filename_from_utf8 (const gchar *utf8string,
+ gint len,
+ gint *bytes_read,
+ gint *bytes_written,
+ GError **error);
G_END_DECLS
p = str;
inbytes_remaining = len;
- outbuf_size = len + 1; /* + 1 for nul in case len == 1 */
+
+ /* Due to a GLIBC bug, round outbuf_size up to a multiple of 4 */
+ /* + 1 for nul in case len == 1 */
+ outbuf_size = ((len + 3) & ~3) + 1;
+
outbytes_remaining = outbuf_size - 1; /* -1 for nul */
outp = dest = g_malloc (outbuf_size);
case E2BIG:
{
size_t used = outp - dest;
- outbuf_size *= 2;
- dest = g_realloc (dest, outbuf_size);
- outp = dest + used;
- outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */
+ /* glibc's iconv can return E2BIG even if there is space
+ * remaining if an internal buffer is exhausted. The
+ * folllowing is a heuristic to catch this. The 16 is
+ * pretty arbitrary.
+ */
+ if (used + 16 > outbuf_size)
+ {
+ outbuf_size = (outbuf_size - 1) * 2 + 1;
+ dest = g_realloc (dest, outbuf_size);
+
+ outp = dest + used;
+ outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */
+ }
goto again;
}
* for the original string while we are converting the fallback
*/
p = utf8;
- outbuf_size = len + 1; /* + 1 for nul in case len == 1 */
+ /* Due to a GLIBC bug, round outbuf_size up to a multiple of 4 */
+ /* + 1 for nul in case len == 1 */
+ outbuf_size = ((len + 3) & ~3) + 1;
outbytes_remaining = outbuf_size - 1; /* -1 for nul */
outp = dest = g_malloc (outbuf_size);
case E2BIG:
{
size_t used = outp - dest;
- outbuf_size *= 2;
- dest = g_realloc (dest, outbuf_size);
-
- outp = dest + used;
- outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */
+
+ /* glibc's iconv can return E2BIG even if there is space
+ * remaining if an internal buffer is exhausted. The
+ * folllowing is a heuristic to catch this. The 16 is
+ * pretty arbitrary.
+ */
+ if (used + 16 > outbuf_size)
+ {
+ outbuf_size = (outbuf_size - 1) * 2 + 1;
+ dest = g_realloc (dest, outbuf_size);
+
+ outp = dest + used;
+ outbytes_remaining = outbuf_size - used - 1; /* -1 for nul */
+ }
break;
}
/*
* g_locale_to_utf8
*
+ *
+ */
+
+/**
+ * g_locale_to_utf8:
+ * @opsysstring: a string in the encoding of the current locale
+ * @len: the length of the string, or -1 if the string is
+ * NULL-terminated.
+ * @bytes_read: location to store the number of bytes in the
+ * input string that were successfully converted, or %NULL.
+ * Even if the conversion was succesful, this may be
+ * less than len if there were partial characters
+ * at the end of the input. If the error
+ * G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value
+ * stored will the byte fofset after the last valid
+ * input sequence.
+ * @bytes_written: the stored in the output buffer (not including the
+ * terminating nul.
+ * @error: location to store the error occuring, or %NULL to ignore
+ * errors. Any of the errors in #GConvertError may occur.
+ *
* Converts a string which is in the encoding used for strings by
* the C runtime (usually the same as that used by the operating
* system) in the current locale into a UTF-8 string.
- */
-
+ *
+ * Return value: The converted string, or %NULL on an error.
+ **/
gchar *
-g_locale_to_utf8 (const gchar *opsysstring, GError **error)
+g_locale_to_utf8 (const gchar *opsysstring,
+ gint len,
+ gint *bytes_read,
+ gint *bytes_written,
+ GError **error)
{
#ifdef G_OS_WIN32
- gint i, clen, wclen, first;
- const gint len = strlen (opsysstring);
+ gint i, clen, total_len, wclen, first;
+ const gint len = len < 0 ? strlen (opsysstring) : len;
wchar_t *wcs, wc;
gchar *result, *bp;
const wchar_t *wcp;
wclen = MultiByteToWideChar (CP_ACP, 0, opsysstring, len, wcs, len);
wcp = wcs;
- clen = 0;
+ total_len = 0;
for (i = 0; i < wclen; i++)
{
wc = *wcp++;
if (wc < 0x80)
- clen += 1;
+ total_len += 1;
else if (wc < 0x800)
- clen += 2;
+ total_len += 2;
else if (wc < 0x10000)
- clen += 3;
+ total_len += 3;
else if (wc < 0x200000)
- clen += 4;
+ total_len += 4;
else if (wc < 0x4000000)
- clen += 5;
+ total_len += 5;
else
- clen += 6;
+ total_len += 6;
}
- result = g_malloc (clen + 1);
+ result = g_malloc (total_len + 1);
wcp = wcs;
bp = result;
g_free (wcs);
+ if (bytes_read)
+ *bytes_read = len;
+ if (bytes_written)
+ *bytes_written = total_len;
+
return result;
#else
if (g_get_charset (&charset))
return g_strdup (opsysstring);
- str = g_convert (opsysstring, strlen (opsysstring),
- "UTF-8", charset, NULL, NULL, error);
+ str = g_convert (opsysstring, len,
+ "UTF-8", charset, bytes_read, bytes_written, error);
return str;
#endif
}
-/*
- * g_locale_from_utf8
- *
- * The reverse of g_locale_to_utf8.
- */
-
+/**
+ * g_locale_from_utf8:
+ * @utf8string: a UTF-8 encoded string
+ * @len: the length of the string, or -1 if the string is
+ * NULL-terminated.
+ * @bytes_read: location to store the number of bytes in the
+ * input string that were successfully converted, or %NULL.
+ * Even if the conversion was succesful, this may be
+ * less than len if there were partial characters
+ * at the end of the input. If the error
+ * G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value
+ * stored will the byte fofset after the last valid
+ * input sequence.
+ * @bytes_written: the stored in the output buffer (not including the
+ * terminating nul.
+ * @error: location to store the error occuring, or %NULL to ignore
+ * errors. Any of the errors in #GConvertError may occur.
+ *
+ * Converts a string from UTF-8 to the encoding used for strings by
+ * the C runtime (usually the same as that used by the operating
+ * system) in the current locale.
+ *
+ * Return value: The converted string, or %NULL on an error.
+ **/
gchar *
-g_locale_from_utf8 (const gchar *utf8string, GError **error)
+g_locale_from_utf8 (const gchar *utf8string,
+ gint len,
+ gint *bytes_read,
+ gint *bytes_written,
+ GError **error)
{
#ifdef G_OS_WIN32
gint i, mask, clen, mblen;
- const gint len = strlen (utf8string);
+ const gint len = len < 0 ? strlen (utf8string) : len;
wchar_t *wcs, *wcp;
gchar *result;
guchar *cp, *end, c;
result[mblen] = 0;
g_free (wcs);
+ if (bytes_read)
+ *bytes_read = len;
+ if (bytes_written)
+ *bytes_written = mblen;
+
return result;
#else
return g_strdup (utf8string);
str = g_convert (utf8string, strlen (utf8string),
- charset, "UTF-8", NULL, NULL, error);
+ charset, "UTF-8", bytes_read, bytes_written, error);
return str;
#endif
}
-/* Filenames are in UTF-8 unless specificially requested otherwise */
-
+/**
+ * g_filename_to_utf8:
+ * @opsysstring: a string in the encoding for filenames
+ * @len: the length of the string, or -1 if the string is
+ * NULL-terminated.
+ * @bytes_read: location to store the number of bytes in the
+ * input string that were successfully converted, or %NULL.
+ * Even if the conversion was succesful, this may be
+ * less than len if there were partial characters
+ * at the end of the input. If the error
+ * G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value
+ * stored will the byte fofset after the last valid
+ * input sequence.
+ * @bytes_written: the stored in the output buffer (not including the
+ * terminating nul.
+ * @error: location to store the error occuring, or %NULL to ignore
+ * errors. Any of the errors in #GConvertError may occur.
+ *
+ * Converts a string which is in the encoding used for filenames
+ * into a UTF-8 string.
+ *
+ * Return value: The converted string, or %NULL on an error.
+ **/
gchar*
-g_filename_to_utf8 (const gchar *string, GError **error)
-
+g_filename_to_utf8 (const gchar *opsysstring,
+ gint len,
+ gint *bytes_read,
+ gint *bytes_written,
+ GError **error)
{
#ifdef G_OS_WIN32
- return g_locale_to_utf8 (string, error);
+ return g_locale_to_utf8 (opsysstring, len,
+ bytes_read, bytes_written,
+ error);
#else
if (getenv ("G_BROKEN_FILENAMES"))
- return g_locale_to_utf8 (string, error);
+ return g_locale_to_utf8 (opsysstring, len,
+ bytes_read, bytes_written,
+ error);
- return g_strdup (string);
+ if (bytes_read || bytes_written)
+ {
+ gint len = strlen (opsysstring);
+
+ if (bytes_read)
+ *bytes_read = len;
+ if (bytes_written)
+ *bytes_written = len;
+ }
+
+ if (len < 0)
+ return g_strdup (opsysstring);
+ else
+ return g_strndup (opsysstring, len);
#endif
}
+/**
+ * g_filename_from_utf8:
+ * @utf8string: a UTF-8 encoded string
+ * @len: the length of the string, or -1 if the string is
+ * NULL-terminated.
+ * @bytes_read: location to store the number of bytes in the
+ * input string that were successfully converted, or %NULL.
+ * Even if the conversion was succesful, this may be
+ * less than len if there were partial characters
+ * at the end of the input. If the error
+ * G_CONVERT_ERROR_ILLEGAL_SEQUENCE occurs, the value
+ * stored will the byte fofset after the last valid
+ * input sequence.
+ * @bytes_written: the stored in the output buffer (not including the
+ * terminating nul.
+ * @error: location to store the error occuring, or %NULL to ignore
+ * errors. Any of the errors in #GConvertError may occur.
+ *
+ * Converts a string from UTF-8 to the encoding used for filenames.
+ *
+ * Return value: The converted string, or %NULL on an error.
+ **/
gchar*
-g_filename_from_utf8 (const gchar *string, GError **error)
+g_filename_from_utf8 (const gchar *utf8string,
+ gint len,
+ gint *bytes_read,
+ gint *bytes_written,
+ GError **error)
{
#ifdef G_OS_WIN32
- return g_locale_from_utf8 (string, error);
+ return g_locale_from_utf8 (utf8string, len,
+ bytes_read, bytes_written,
+ error);
#else
if (getenv ("G_BROKEN_FILENAMES"))
- return g_locale_from_utf8 (string, error);
+ return g_locale_from_utf8 (utf8string, len,
+ bytes_read, bytes_written,
+ error);
+
+ if (bytes_read || bytes_written)
+ {
+ gint len = strlen (utf8string);
+
+ if (bytes_read)
+ *bytes_read = len;
+ if (bytes_written)
+ *bytes_written = len;
+ }
- return g_strdup (string);
+ if (len < 0)
+ return g_strdup (utf8string);
+ else
+ return g_strndup (utf8string, len);
#endif
}
/* Convert between libc's idea of strings and UTF-8.
*/
-gchar* g_locale_to_utf8 (const gchar *opsysstring, GError **error);
-gchar* g_locale_from_utf8 (const gchar *utf8string, GError **error);
+gchar* g_locale_to_utf8 (const gchar *opsysstring,
+ gint len,
+ gint *bytes_read,
+ gint *bytes_written,
+ GError **error);
+gchar* g_locale_from_utf8 (const gchar *utf8string,
+ gint len,
+ gint *bytes_read,
+ gint *bytes_written,
+ GError **error);
/* Convert between the operating system (or C runtime)
* representation of file names and UTF-8.
*/
-gchar* g_filename_to_utf8 (const gchar *opsysstring, GError **error);
-gchar* g_filename_from_utf8 (const gchar *utf8string, GError **error);
+gchar* g_filename_to_utf8 (const gchar *opsysstring,
+ gint len,
+ gint *bytes_read,
+ gint *bytes_written,
+ GError **error);
+gchar* g_filename_from_utf8 (const gchar *utf8string,
+ gint len,
+ gint *bytes_read,
+ gint *bytes_written,
+ GError **error);
G_END_DECLS
gchar *g_utf8_strrchr (const gchar *p,
gunichar c);
-gunichar2 *g_utf8_to_utf16 (const gchar *str,
- gint len);
-gunichar * g_utf8_to_ucs4 (const gchar *str,
- gint len);
-gunichar * g_utf16_to_ucs4 (const gunichar2 *str,
- gint len);
-gchar * g_utf16_to_utf8 (const gunichar2 *str,
- gint len);
-gunichar * g_ucs4_to_utf16 (const gunichar *str,
- gint len);
-gchar * g_ucs4_to_utf8 (const gunichar *str,
- gint len);
+gunichar2 *g_utf8_to_utf16 (const gchar *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error);
+gunichar * g_utf8_to_ucs4 (const gchar *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error);
+gunichar * g_utf8_to_ucs4_fast (const gchar *str,
+ gint len,
+ gint *items_written);
+gunichar * g_utf16_to_ucs4 (const gunichar2 *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error);
+gchar * g_utf16_to_utf8 (const gunichar2 *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error);
+gunichar2 *g_ucs4_to_utf16 (const gunichar *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error);
+gchar * g_ucs4_to_utf8 (const gunichar *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error);
/* Convert a single character into UTF-8. outbuf must have at
* least 6 bytes of space. Returns the number of bytes in the
#include <windows.h>
#endif
+#define _(s) (s)
+
#define UTF8_COMPUTE(Char, Mask, Len) \
if (Char < 128) \
{ \
else \
Len = -1;
+#define UTF8_LENGTH(Char) \
+ ((Char) < 0x80 ? 1 : \
+ ((Char) < 0x800 ? 2 : \
+ ((Char) < 0x10000 ? 3 : \
+ ((Char) < 0x200000 ? 4 : \
+ ((Char) < 0x4000000 ? 5 : 6)))))
+
+
#define UTF8_GET(Result, Chars, Count, Mask, Len) \
(Result) = (Chars)[0] & (Mask); \
for ((Count) = 1; (Count) < (Len); ++(Count)) \
(Result) <<= 6; \
(Result) |= ((Chars)[(Count)] & 0x3f); \
}
+
+#define UNICODE_VALID(Char) \
+ ((Char) < 0x110000 && \
+ ((Char) < 0xD800 || (Char) >= 0xE000) && \
+ (Char) != 0xFFFE && (Char) != 0xFFFF)
+
+
gchar g_utf8_skip[256] = {
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
#endif
+/* Like g_utf8_get_char, but take a maximum length
+ * and return (gunichar)-2 on incomplete trailing character
+ */
+static inline gunichar
+g_utf8_get_char_extended (const gchar *p, int max_len)
+{
+ gint i, len;
+ gunichar wc = (guchar) *p;
+
+ if (wc < 0x80)
+ {
+ return wc;
+ }
+ else if (wc < 0xc0)
+ {
+ return (gunichar)-1;
+ }
+ else if (wc < 0xe0)
+ {
+ len = 2;
+ wc &= 0x1f;
+ }
+ else if (wc < 0xf0)
+ {
+ len = 3;
+ wc &= 0x0f;
+ }
+ else if (wc < 0xf8)
+ {
+ len = 4;
+ wc &= 0x07;
+ }
+ else if (wc < 0xfc)
+ {
+ len = 5;
+ wc &= 0x03;
+ }
+ else if (wc < 0xfe)
+ {
+ len = 6;
+ wc &= 0x01;
+ }
+ else
+ {
+ return (gunichar)-1;
+ }
+
+ if (len == -1)
+ return (gunichar)-1;
+ if (max_len >= 0 && len > max_len)
+ {
+ for (i = 1; i < max_len; i++)
+ {
+ if ((((guchar *)p)[i] & 0xc0) != 0x80)
+ return (gunichar)-1;
+ }
+ return (gunichar)-2;
+ }
+
+ for (i = 1; i < len; ++i)
+ {
+ gunichar ch = ((guchar *)p)[i];
+
+ if ((ch & 0xc0) != 0x80)
+ {
+ if (ch)
+ return (gunichar)-1;
+ else
+ return (gunichar)-2;
+ }
+
+ wc <<= 6;
+ wc |= (ch & 0x3f);
+ }
+
+ if (UTF8_LENGTH(wc) != len)
+ return (gunichar)-1;
+
+ return wc;
+}
+
/**
- * g_utf8_to_ucs4:
- * @str: a UTF-8 encoded strnig
- * @len: the length of @
- *
+ * g_utf8_to_ucs4_fast:
+ * @str: a UTF-8 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ * the string is %NULL terminated.
+ * @items_written: location to store the number of characters in the
+ * result, or %NULL.
+ *
* Convert a string from UTF-8 to a 32-bit fixed width
- * representation as UCS-4.
+ * representation as UCS-4, assuming valid UTF-8 input.
+ * This function is roughly twice as fast as g_utf8_to_ucs4()
+ * but does no error checking on the input.
*
* Return value: a pointer to a newly allocated UCS-4 string.
* This value must be freed with g_free()
**/
gunichar *
-g_utf8_to_ucs4 (const char *str, int len)
+g_utf8_to_ucs4_fast (const gchar *str,
+ gint len,
+ gint *items_written)
{
+ gint j, charlen;
gunichar *result;
gint n_chars, i;
const gchar *p;
+
+ g_return_val_if_fail (str != NULL, NULL);
+
+ p = str;
+ n_chars = 0;
+ if (len < 0)
+ {
+ while (*p)
+ {
+ p = g_utf8_next_char (p);
+ ++n_chars;
+ }
+ }
+ else
+ {
+ while (*p && p < str + len)
+ {
+ p = g_utf8_next_char (p);
+ ++n_chars;
+ }
+ }
- n_chars = g_utf8_strlen (str, len);
- result = g_new (gunichar, n_chars);
+ result = g_new (gunichar, n_chars + 1);
p = str;
for (i=0; i < n_chars; i++)
{
- result[i] = g_utf8_get_char (p);
- p = g_utf8_next_char (p);
+ gunichar wc = ((unsigned char *)p)[0];
+
+ if (wc < 0x80)
+ {
+ result[i] = wc;
+ p++;
+ }
+ else
+ {
+ if (wc < 0xe0)
+ {
+ charlen = 2;
+ wc &= 0x1f;
+ }
+ else if (wc < 0xf0)
+ {
+ charlen = 3;
+ wc &= 0x0f;
+ }
+ else if (wc < 0xf8)
+ {
+ charlen = 4;
+ wc &= 0x07;
+ }
+ else if (wc < 0xfc)
+ {
+ charlen = 5;
+ wc &= 0x03;
+ }
+ else
+ {
+ charlen = 6;
+ wc &= 0x01;
+ }
+
+ for (j = 1; j < charlen; j++)
+ {
+ wc <<= 6;
+ wc |= ((unsigned char *)p)[j] & 0x3f;
+ }
+
+ result[i] = wc;
+ p += charlen;
+ }
}
+ result[i] = 0;
+
+ if (items_written)
+ *items_written = i;
+
+ return result;
+}
+
+/**
+ * g_utf8_to_ucs4:
+ * @str: a UTF-8 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ * the string is %NULL terminated.
+ * @items_read: location to store number of bytes read, or %NULL.
+ * If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be
+ * returned in case @str contains a trailing partial
+ * character. If an error occurs then the index of the
+ * invalid input is stored here.
+ * @items_written: location to store number of characters written or %NULL.
+ * The value here stored does not include the trailing 0
+ * character.
+ * @error: location to store the error occuring, or %NULL to ignore
+ * errors. Any of the errors in #GConvertError other than
+ * %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
+ * Convert a string from UTF-8 to a 32-bit fixed width
+ * representation as UCS-4. A trailing 0 will be added to the
+ * string after the converted text.
+ *
+ * Return value: a pointer to a newly allocated UCS-4 string.
+ * This value must be freed with g_free(). If an
+ * error occurs, %NULL will be returned and
+ * @error set.
+ **/
+gunichar *
+g_utf8_to_ucs4 (const gchar *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error)
+{
+ gunichar *result = NULL;
+ gint n_chars, i;
+ const gchar *in;
+
+ in = str;
+ n_chars = 0;
+ while ((len < 0 || str + len - in > 0) && *in)
+ {
+ gunichar wc = g_utf8_get_char_extended (in, str + len - in);
+ if (wc & 0x80000000)
+ {
+ if (wc == (gunichar)-2)
+ {
+ if (items_read)
+ break;
+ else
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT,
+ _("Partial character sequence at end of input"));
+ }
+ else
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Invalid byte sequence in conversion input"));
+
+ goto err_out;
+ }
+
+ n_chars++;
+
+ in = g_utf8_next_char (in);
+ }
+
+ result = g_new (gunichar, n_chars + 1);
+
+ in = str;
+ for (i=0; i < n_chars; i++)
+ {
+ result[i] = g_utf8_get_char (in);
+ in = g_utf8_next_char (in);
+ }
+ result[i] = 0;
+
+ if (items_written)
+ *items_written = n_chars;
+
+ err_out:
+ if (items_read)
+ *items_read = in - str;
return result;
}
/**
* g_ucs4_to_utf8:
* @str: a UCS-4 encoded string
- * @len: the length of @
- *
+ * @len: the maximum length of @str to use. If < 0, then
+ * the string is %NULL terminated.
+ * @items_read: location to store number of characters read read, or %NULL.
+ * @items_written: location to store number of bytes written or %NULL.
+ * The value here stored does not include the trailing 0
+ * byte.
+ * @error: location to store the error occuring, or %NULL to ignore
+ * errors. Any of the errors in #GConvertError other than
+ * %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
* Convert a string from a 32-bit fixed width representation as UCS-4.
- * to UTF-8.
+ * to UTF-8. The result will be terminated with a 0 byte.
*
* Return value: a pointer to a newly allocated UTF-8 string.
- * This value must be freed with g_free()
+ * This value must be freed with g_free(). If an
+ * error occurs, %NULL will be returned and
+ * @error set.
**/
gchar *
-g_ucs4_to_utf8 (const gunichar *str, int len)
+g_ucs4_to_utf8 (const gunichar *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error)
{
gint result_length;
- gchar *result, *p;
+ gchar *result = NULL;
+ gchar *p;
gint i;
result_length = 0;
- for (i = 0; i < len ; i++)
- result_length += g_unichar_to_utf8 (str[i], NULL);
+ for (i = 0; len < 0 || i < len ; i++)
+ {
+ if (!str[i])
+ break;
- result_length++;
+ if (str[i] >= 0x80000000)
+ {
+ if (items_read)
+ *items_read = i;
+
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Character out of range for UTF-8"));
+ goto err_out;
+ }
+
+ result_length += UTF8_LENGTH (str[i]);
+ }
result = g_malloc (result_length + 1);
p = result;
- for (i = 0; i < len ; i++)
- p += g_unichar_to_utf8 (str[i], p);
+ i = 0;
+ while (p < result + result_length)
+ p += g_unichar_to_utf8 (str[i++], p);
*p = '\0';
+ if (items_written)
+ *items_written = p - result;
+
+ err_out:
+ if (items_read)
+ *items_read = i;
+
+ return result;
+}
+
+#define SURROGATE_VALUE(h,l) (((h) - 0xd800) * 0x400 + (l) - 0xdc00 + 0x10000)
+
+/**
+ * g_utf16_to_utf8:
+ * @str: a UTF-16 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ * the string is terminated with a 0 character.
+ * @items_read: location to store number of words read, or %NULL.
+ * If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be
+ * returned in case @str contains a trailing partial
+ * character. If an error occurs then the index of the
+ * invalid input is stored here.
+ * @items_written: location to store number of bytes written, or %NULL.
+ * The value stored here does not include the trailing
+ * 0 byte.
+ * @error: location to store the error occuring, or %NULL to ignore
+ * errors. Any of the errors in #GConvertError other than
+ * %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
+ * Convert a string from UTF-16 to UTF-8. The result will be
+ * terminated with a 0 byte.
+ *
+ * Return value: a pointer to a newly allocated UTF-8 string.
+ * This value must be freed with g_free(). If an
+ * error occurs, %NULL will be returned and
+ * @error set.
+ **/
+gchar *
+g_utf16_to_utf8 (const gunichar2 *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error)
+{
+ /* This function and g_utf16_to_ucs4 are almost exactly identical - The lines that differ
+ * are marked.
+ */
+ const gunichar2 *in;
+ gchar *out;
+ gchar *result = NULL;
+ gint n_bytes;
+ gunichar high_surrogate;
+
+ g_return_val_if_fail (str != 0, NULL);
+
+ n_bytes = 0;
+ in = str;
+ high_surrogate = 0;
+ while ((len < 0 || in - str < len) && *in)
+ {
+ gunichar2 c = *in;
+ gunichar wc;
+
+ if (c >= 0xdc00 && c < 0xe000) /* low surrogate */
+ {
+ if (high_surrogate)
+ {
+ wc = SURROGATE_VALUE (high_surrogate, c);
+ high_surrogate = 0;
+ }
+ else
+ {
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Invalid sequence in conversion input"));
+ goto err_out;
+ }
+ }
+ else
+ {
+ if (high_surrogate)
+ {
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Invalid sequence in conversion input"));
+ goto err_out;
+ }
+
+ if (c >= 0xd800 && c < 0xdc00) /* high surrogate */
+ {
+ high_surrogate = c;
+ goto next1;
+ }
+ else
+ wc = c;
+ }
+
+ /********** DIFFERENT for UTF8/UCS4 **********/
+ n_bytes += UTF8_LENGTH (wc);
+
+ next1:
+ in++;
+ }
+
+ if (high_surrogate && !items_read)
+ {
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT,
+ _("Partial character sequence at end of input"));
+ goto err_out;
+ }
+
+ /* At this point, everything is valid, and we just need to convert
+ */
+ /********** DIFFERENT for UTF8/UCS4 **********/
+ result = g_malloc (n_bytes + 1);
+
+ high_surrogate = 0;
+ out = result;
+ in = str;
+ while (out < result + n_bytes)
+ {
+ gunichar2 c = *in;
+ gunichar wc;
+
+ if (c >= 0xdc00 && c < 0xe000) /* low surrogate */
+ {
+ wc = SURROGATE_VALUE (high_surrogate, c);
+ high_surrogate = 0;
+ }
+ else if (c >= 0xd800 && c < 0xdc00) /* high surrogate */
+ {
+ high_surrogate = c;
+ goto next2;
+ }
+ else
+ wc = c;
+
+ /********** DIFFERENT for UTF8/UCS4 **********/
+ out += g_unichar_to_utf8 (wc, out);
+
+ next2:
+ in++;
+ }
+
+ /********** DIFFERENT for UTF8/UCS4 **********/
+ *out = '\0';
+
+ if (items_written)
+ /********** DIFFERENT for UTF8/UCS4 **********/
+ *items_written = out - result;
+
+ err_out:
+ if (items_read)
+ *items_read = in - str;
+
+ return result;
+}
+
+/**
+ * g_utf16_to_ucs4:
+ * @str: a UTF-16 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ * the string is terminated with a 0 character.
+ * @items_read: location to store number of words read, or %NULL.
+ * If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be
+ * returned in case @str contains a trailing partial
+ * character. If an error occurs then the index of the
+ * invalid input is stored here.
+ * @items_written: location to store number of characters written, or %NULL.
+ * The value stored here does not include the trailing
+ * 0 character.
+ * @error: location to store the error occuring, or %NULL to ignore
+ * errors. Any of the errors in #GConvertError other than
+ * %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
+ * Convert a string from UTF-16 to UCS-4. The result will be
+ * terminated with a 0 character.
+ *
+ * Return value: a pointer to a newly allocated UCS-4 string.
+ * This value must be freed with g_free(). If an
+ * error occurs, %NULL will be returned and
+ * @error set.
+ **/
+gunichar *
+g_utf16_to_ucs4 (const gunichar2 *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error)
+{
+ const gunichar2 *in;
+ gchar *out;
+ gchar *result = NULL;
+ gint n_bytes;
+ gunichar high_surrogate;
+
+ g_return_val_if_fail (str != 0, NULL);
+
+ n_bytes = 0;
+ in = str;
+ high_surrogate = 0;
+ while ((len < 0 || in - str < len) && *in)
+ {
+ gunichar2 c = *in;
+ gunichar wc;
+
+ if (c >= 0xdc00 && c < 0xe000) /* low surrogate */
+ {
+ if (high_surrogate)
+ {
+ wc = SURROGATE_VALUE (high_surrogate, c);
+ high_surrogate = 0;
+ }
+ else
+ {
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Invalid sequence in conversion input"));
+ goto err_out;
+ }
+ }
+ else
+ {
+ if (high_surrogate)
+ {
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Invalid sequence in conversion input"));
+ goto err_out;
+ }
+
+ if (c >= 0xd800 && c < 0xdc00) /* high surrogate */
+ {
+ high_surrogate = c;
+ goto next1;
+ }
+ else
+ wc = c;
+ }
+
+ /********** DIFFERENT for UTF8/UCS4 **********/
+ n_bytes += sizeof (gunichar);
+
+ next1:
+ in++;
+ }
+
+ if (high_surrogate && !items_read)
+ {
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT,
+ _("Partial character sequence at end of input"));
+ goto err_out;
+ }
+
+ /* At this point, everything is valid, and we just need to convert
+ */
+ /********** DIFFERENT for UTF8/UCS4 **********/
+ result = g_malloc (n_bytes + 4);
+
+ high_surrogate = 0;
+ out = result;
+ in = str;
+ while (out < result + n_bytes)
+ {
+ gunichar2 c = *in;
+ gunichar wc;
+
+ if (c >= 0xdc00 && c < 0xe000) /* low surrogate */
+ {
+ wc = SURROGATE_VALUE (high_surrogate, c);
+ high_surrogate = 0;
+ }
+ else if (c >= 0xd800 && c < 0xdc00) /* high surrogate */
+ {
+ high_surrogate = c;
+ goto next2;
+ }
+ else
+ wc = c;
+
+ /********** DIFFERENT for UTF8/UCS4 **********/
+ *(gunichar *)out = wc;
+ out += sizeof (gunichar);
+
+ next2:
+ in++;
+ }
+
+ /********** DIFFERENT for UTF8/UCS4 **********/
+ *(gunichar *)out = 0;
+
+ if (items_written)
+ /********** DIFFERENT for UTF8/UCS4 **********/
+ *items_written = (out - result) / sizeof (gunichar);
+
+ err_out:
+ if (items_read)
+ *items_read = in - str;
+
+ return (gunichar *)result;
+}
+
+/**
+ * g_utf8_to_utf16:
+ * @str: a UTF-8 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ * the string is %NULL terminated.
+
+ * @items_read: location to store number of bytes read, or %NULL.
+ * If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be
+ * returned in case @str contains a trailing partial
+ * character. If an error occurs then the index of the
+ * invalid input is stored here.
+ * @items_written: location to store number of words written, or %NULL.
+ * The value stored here does not include the trailing
+ * 0 word.
+ * @error: location to store the error occuring, or %NULL to ignore
+ * errors. Any of the errors in #GConvertError other than
+ * %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
+ * Convert a string from UTF-8 to UTF-16. A 0 word will be
+ * added to the result after the converted text.
+ *
+ * Return value: a pointer to a newly allocated UTF-16 string.
+ * This value must be freed with g_free(). If an
+ * error occurs, %NULL will be returned and
+ * @error set.
+ **/
+gunichar2 *
+g_utf8_to_utf16 (const gchar *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error)
+{
+ gunichar2 *result = NULL;
+ gint n16;
+ const gchar *in;
+ gint i;
+
+ g_return_val_if_fail (str != NULL, NULL);
+
+ in = str;
+ n16 = 0;
+ while ((len < 0 || str + len - in > 0) && *in)
+ {
+ gunichar wc = g_utf8_get_char_extended (in, str + len - in);
+ if (wc & 0x80000000)
+ {
+ if (wc == (gunichar)-2)
+ {
+ if (items_read)
+ break;
+ else
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT,
+ _("Partial character sequence at end of input"));
+ }
+ else
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Invalid byte sequence in conversion input"));
+
+ goto err_out;
+ }
+
+ if (wc < 0xd800)
+ n16 += 1;
+ else if (wc < 0xe000)
+ {
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Invalid sequence in conversion input"));
+
+ goto err_out;
+ }
+ else if (wc < 0x10000)
+ n16 += 1;
+ else if (wc < 0x110000)
+ n16 += 2;
+ else
+ {
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Character out of range for UTF-16"));
+
+ goto err_out;
+ }
+
+ in = g_utf8_next_char (in);
+ }
+
+ result = g_new (gunichar2, n16 + 1);
+
+ in = str;
+ for (i = 0; i < n16;)
+ {
+ gunichar wc = g_utf8_get_char (in);
+
+ if (wc < 0x10000)
+ {
+ result[i++] = wc;
+ }
+ else
+ {
+ result[i++] = (wc - 0x10000) / 0x400 + 0xd800;
+ result[i++] = (wc - 0x10000) % 0x400 + 0xdc00;
+ }
+
+ in = g_utf8_next_char (in);
+ }
+
+ result[i] = 0;
+
+ if (items_written)
+ *items_written = n16;
+
+ err_out:
+ if (items_read)
+ *items_read = in - str;
+
+ return result;
+}
+
+/**
+ * g_ucs4_to_utf16:
+ * @str: a UCS-4 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ * the string is terminated with a zero character.
+ * @items_read: location to store number of bytes read, or %NULL.
+ * If an error occurs then the index of the invalid input
+ * is stored here.
+ * @items_written: location to store number of words written, or %NULL.
+ * The value stored here does not include the trailing
+ * 0 word.
+ * @error: location to store the error occuring, or %NULL to ignore
+ * errors. Any of the errors in #GConvertError other than
+ * %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
+ * Convert a string from UCS-4 to UTF-16. A 0 word will be
+ * added to the result after the converted text.
+ *
+ * Return value: a pointer to a newly allocated UTF-16 string.
+ * This value must be freed with g_free(). If an
+ * error occurs, %NULL will be returned and
+ * @error set.
+ **/
+gunichar2 *
+g_ucs4_to_utf16 (const gunichar *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error)
+{
+ gunichar2 *result = NULL;
+ gint n16;
+ gint i, j;
+
+ n16 = 0;
+ i = 0;
+ while ((len < 0 || i < len) && str[i])
+ {
+ gunichar wc = str[i];
+
+ if (wc < 0xd800)
+ n16 += 1;
+ else if (wc < 0xe000)
+ {
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Invalid sequence in conversion input"));
+
+ goto err_out;
+ }
+ else if (wc < 0x10000)
+ n16 += 1;
+ else if (wc < 0x110000)
+ n16 += 2;
+ else
+ {
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Character out of range for UTF-16"));
+
+ goto err_out;
+ }
+
+ i++;
+ }
+
+ result = g_new (gunichar2, n16 + 1);
+
+ for (i = 0, j = 0; j < n16; i++)
+ {
+ gunichar wc = str[i];
+
+ if (wc < 0x10000)
+ {
+ result[j++] = wc;
+ }
+ else
+ {
+ result[j++] = (wc - 0x10000) / 0x400 + 0xd800;
+ result[j++] = (wc - 0x10000) % 0x400 + 0xdc00;
+ }
+ }
+ result[j] = 0;
+
+ if (items_written)
+ *items_written = n16;
+
+ err_out:
+ if (items_read)
+ *items_read = i;
+
return result;
}
{
const gchar *p;
+
+ g_return_val_if_fail (str != NULL, FALSE);
if (end)
*end = str;
UTF8_GET (result, p, i, mask, len);
+ if (UTF8_LENGTH (result) != len) /* Check for overlong UTF-8 */
+ break;
+
if (result == (gunichar)-1)
break;
+
+ if (!UNICODE_VALID (result))
+ break;
p += len;
}
gchar *g_utf8_strrchr (const gchar *p,
gunichar c);
-gunichar2 *g_utf8_to_utf16 (const gchar *str,
- gint len);
-gunichar * g_utf8_to_ucs4 (const gchar *str,
- gint len);
-gunichar * g_utf16_to_ucs4 (const gunichar2 *str,
- gint len);
-gchar * g_utf16_to_utf8 (const gunichar2 *str,
- gint len);
-gunichar * g_ucs4_to_utf16 (const gunichar *str,
- gint len);
-gchar * g_ucs4_to_utf8 (const gunichar *str,
- gint len);
+gunichar2 *g_utf8_to_utf16 (const gchar *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error);
+gunichar * g_utf8_to_ucs4 (const gchar *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error);
+gunichar * g_utf8_to_ucs4_fast (const gchar *str,
+ gint len,
+ gint *items_written);
+gunichar * g_utf16_to_ucs4 (const gunichar2 *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error);
+gchar * g_utf16_to_utf8 (const gunichar2 *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error);
+gunichar2 *g_ucs4_to_utf16 (const gunichar *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error);
+gchar * g_ucs4_to_utf8 (const gunichar *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error);
/* Convert a single character into UTF-8. outbuf must have at
* least 6 bytes of space. Returns the number of bytes in the
#include <windows.h>
#endif
+#define _(s) (s)
+
#define UTF8_COMPUTE(Char, Mask, Len) \
if (Char < 128) \
{ \
else \
Len = -1;
+#define UTF8_LENGTH(Char) \
+ ((Char) < 0x80 ? 1 : \
+ ((Char) < 0x800 ? 2 : \
+ ((Char) < 0x10000 ? 3 : \
+ ((Char) < 0x200000 ? 4 : \
+ ((Char) < 0x4000000 ? 5 : 6)))))
+
+
#define UTF8_GET(Result, Chars, Count, Mask, Len) \
(Result) = (Chars)[0] & (Mask); \
for ((Count) = 1; (Count) < (Len); ++(Count)) \
(Result) <<= 6; \
(Result) |= ((Chars)[(Count)] & 0x3f); \
}
+
+#define UNICODE_VALID(Char) \
+ ((Char) < 0x110000 && \
+ ((Char) < 0xD800 || (Char) >= 0xE000) && \
+ (Char) != 0xFFFE && (Char) != 0xFFFF)
+
+
gchar g_utf8_skip[256] = {
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
#endif
+/* Like g_utf8_get_char, but take a maximum length
+ * and return (gunichar)-2 on incomplete trailing character
+ */
+static inline gunichar
+g_utf8_get_char_extended (const gchar *p, int max_len)
+{
+ gint i, len;
+ gunichar wc = (guchar) *p;
+
+ if (wc < 0x80)
+ {
+ return wc;
+ }
+ else if (wc < 0xc0)
+ {
+ return (gunichar)-1;
+ }
+ else if (wc < 0xe0)
+ {
+ len = 2;
+ wc &= 0x1f;
+ }
+ else if (wc < 0xf0)
+ {
+ len = 3;
+ wc &= 0x0f;
+ }
+ else if (wc < 0xf8)
+ {
+ len = 4;
+ wc &= 0x07;
+ }
+ else if (wc < 0xfc)
+ {
+ len = 5;
+ wc &= 0x03;
+ }
+ else if (wc < 0xfe)
+ {
+ len = 6;
+ wc &= 0x01;
+ }
+ else
+ {
+ return (gunichar)-1;
+ }
+
+ if (len == -1)
+ return (gunichar)-1;
+ if (max_len >= 0 && len > max_len)
+ {
+ for (i = 1; i < max_len; i++)
+ {
+ if ((((guchar *)p)[i] & 0xc0) != 0x80)
+ return (gunichar)-1;
+ }
+ return (gunichar)-2;
+ }
+
+ for (i = 1; i < len; ++i)
+ {
+ gunichar ch = ((guchar *)p)[i];
+
+ if ((ch & 0xc0) != 0x80)
+ {
+ if (ch)
+ return (gunichar)-1;
+ else
+ return (gunichar)-2;
+ }
+
+ wc <<= 6;
+ wc |= (ch & 0x3f);
+ }
+
+ if (UTF8_LENGTH(wc) != len)
+ return (gunichar)-1;
+
+ return wc;
+}
+
/**
- * g_utf8_to_ucs4:
- * @str: a UTF-8 encoded strnig
- * @len: the length of @
- *
+ * g_utf8_to_ucs4_fast:
+ * @str: a UTF-8 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ * the string is %NULL terminated.
+ * @items_written: location to store the number of characters in the
+ * result, or %NULL.
+ *
* Convert a string from UTF-8 to a 32-bit fixed width
- * representation as UCS-4.
+ * representation as UCS-4, assuming valid UTF-8 input.
+ * This function is roughly twice as fast as g_utf8_to_ucs4()
+ * but does no error checking on the input.
*
* Return value: a pointer to a newly allocated UCS-4 string.
* This value must be freed with g_free()
**/
gunichar *
-g_utf8_to_ucs4 (const char *str, int len)
+g_utf8_to_ucs4_fast (const gchar *str,
+ gint len,
+ gint *items_written)
{
+ gint j, charlen;
gunichar *result;
gint n_chars, i;
const gchar *p;
+
+ g_return_val_if_fail (str != NULL, NULL);
+
+ p = str;
+ n_chars = 0;
+ if (len < 0)
+ {
+ while (*p)
+ {
+ p = g_utf8_next_char (p);
+ ++n_chars;
+ }
+ }
+ else
+ {
+ while (*p && p < str + len)
+ {
+ p = g_utf8_next_char (p);
+ ++n_chars;
+ }
+ }
- n_chars = g_utf8_strlen (str, len);
- result = g_new (gunichar, n_chars);
+ result = g_new (gunichar, n_chars + 1);
p = str;
for (i=0; i < n_chars; i++)
{
- result[i] = g_utf8_get_char (p);
- p = g_utf8_next_char (p);
+ gunichar wc = ((unsigned char *)p)[0];
+
+ if (wc < 0x80)
+ {
+ result[i] = wc;
+ p++;
+ }
+ else
+ {
+ if (wc < 0xe0)
+ {
+ charlen = 2;
+ wc &= 0x1f;
+ }
+ else if (wc < 0xf0)
+ {
+ charlen = 3;
+ wc &= 0x0f;
+ }
+ else if (wc < 0xf8)
+ {
+ charlen = 4;
+ wc &= 0x07;
+ }
+ else if (wc < 0xfc)
+ {
+ charlen = 5;
+ wc &= 0x03;
+ }
+ else
+ {
+ charlen = 6;
+ wc &= 0x01;
+ }
+
+ for (j = 1; j < charlen; j++)
+ {
+ wc <<= 6;
+ wc |= ((unsigned char *)p)[j] & 0x3f;
+ }
+
+ result[i] = wc;
+ p += charlen;
+ }
}
+ result[i] = 0;
+
+ if (items_written)
+ *items_written = i;
+
+ return result;
+}
+
+/**
+ * g_utf8_to_ucs4:
+ * @str: a UTF-8 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ * the string is %NULL terminated.
+ * @items_read: location to store number of bytes read, or %NULL.
+ * If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be
+ * returned in case @str contains a trailing partial
+ * character. If an error occurs then the index of the
+ * invalid input is stored here.
+ * @items_written: location to store number of characters written or %NULL.
+ * The value here stored does not include the trailing 0
+ * character.
+ * @error: location to store the error occuring, or %NULL to ignore
+ * errors. Any of the errors in #GConvertError other than
+ * %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
+ * Convert a string from UTF-8 to a 32-bit fixed width
+ * representation as UCS-4. A trailing 0 will be added to the
+ * string after the converted text.
+ *
+ * Return value: a pointer to a newly allocated UCS-4 string.
+ * This value must be freed with g_free(). If an
+ * error occurs, %NULL will be returned and
+ * @error set.
+ **/
+gunichar *
+g_utf8_to_ucs4 (const gchar *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error)
+{
+ gunichar *result = NULL;
+ gint n_chars, i;
+ const gchar *in;
+
+ in = str;
+ n_chars = 0;
+ while ((len < 0 || str + len - in > 0) && *in)
+ {
+ gunichar wc = g_utf8_get_char_extended (in, str + len - in);
+ if (wc & 0x80000000)
+ {
+ if (wc == (gunichar)-2)
+ {
+ if (items_read)
+ break;
+ else
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT,
+ _("Partial character sequence at end of input"));
+ }
+ else
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Invalid byte sequence in conversion input"));
+
+ goto err_out;
+ }
+
+ n_chars++;
+
+ in = g_utf8_next_char (in);
+ }
+
+ result = g_new (gunichar, n_chars + 1);
+
+ in = str;
+ for (i=0; i < n_chars; i++)
+ {
+ result[i] = g_utf8_get_char (in);
+ in = g_utf8_next_char (in);
+ }
+ result[i] = 0;
+
+ if (items_written)
+ *items_written = n_chars;
+
+ err_out:
+ if (items_read)
+ *items_read = in - str;
return result;
}
/**
* g_ucs4_to_utf8:
* @str: a UCS-4 encoded string
- * @len: the length of @
- *
+ * @len: the maximum length of @str to use. If < 0, then
+ * the string is %NULL terminated.
+ * @items_read: location to store number of characters read read, or %NULL.
+ * @items_written: location to store number of bytes written or %NULL.
+ * The value here stored does not include the trailing 0
+ * byte.
+ * @error: location to store the error occuring, or %NULL to ignore
+ * errors. Any of the errors in #GConvertError other than
+ * %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
* Convert a string from a 32-bit fixed width representation as UCS-4.
- * to UTF-8.
+ * to UTF-8. The result will be terminated with a 0 byte.
*
* Return value: a pointer to a newly allocated UTF-8 string.
- * This value must be freed with g_free()
+ * This value must be freed with g_free(). If an
+ * error occurs, %NULL will be returned and
+ * @error set.
**/
gchar *
-g_ucs4_to_utf8 (const gunichar *str, int len)
+g_ucs4_to_utf8 (const gunichar *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error)
{
gint result_length;
- gchar *result, *p;
+ gchar *result = NULL;
+ gchar *p;
gint i;
result_length = 0;
- for (i = 0; i < len ; i++)
- result_length += g_unichar_to_utf8 (str[i], NULL);
+ for (i = 0; len < 0 || i < len ; i++)
+ {
+ if (!str[i])
+ break;
- result_length++;
+ if (str[i] >= 0x80000000)
+ {
+ if (items_read)
+ *items_read = i;
+
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Character out of range for UTF-8"));
+ goto err_out;
+ }
+
+ result_length += UTF8_LENGTH (str[i]);
+ }
result = g_malloc (result_length + 1);
p = result;
- for (i = 0; i < len ; i++)
- p += g_unichar_to_utf8 (str[i], p);
+ i = 0;
+ while (p < result + result_length)
+ p += g_unichar_to_utf8 (str[i++], p);
*p = '\0';
+ if (items_written)
+ *items_written = p - result;
+
+ err_out:
+ if (items_read)
+ *items_read = i;
+
+ return result;
+}
+
+#define SURROGATE_VALUE(h,l) (((h) - 0xd800) * 0x400 + (l) - 0xdc00 + 0x10000)
+
+/**
+ * g_utf16_to_utf8:
+ * @str: a UTF-16 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ * the string is terminated with a 0 character.
+ * @items_read: location to store number of words read, or %NULL.
+ * If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be
+ * returned in case @str contains a trailing partial
+ * character. If an error occurs then the index of the
+ * invalid input is stored here.
+ * @items_written: location to store number of bytes written, or %NULL.
+ * The value stored here does not include the trailing
+ * 0 byte.
+ * @error: location to store the error occuring, or %NULL to ignore
+ * errors. Any of the errors in #GConvertError other than
+ * %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
+ * Convert a string from UTF-16 to UTF-8. The result will be
+ * terminated with a 0 byte.
+ *
+ * Return value: a pointer to a newly allocated UTF-8 string.
+ * This value must be freed with g_free(). If an
+ * error occurs, %NULL will be returned and
+ * @error set.
+ **/
+gchar *
+g_utf16_to_utf8 (const gunichar2 *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error)
+{
+ /* This function and g_utf16_to_ucs4 are almost exactly identical - The lines that differ
+ * are marked.
+ */
+ const gunichar2 *in;
+ gchar *out;
+ gchar *result = NULL;
+ gint n_bytes;
+ gunichar high_surrogate;
+
+ g_return_val_if_fail (str != 0, NULL);
+
+ n_bytes = 0;
+ in = str;
+ high_surrogate = 0;
+ while ((len < 0 || in - str < len) && *in)
+ {
+ gunichar2 c = *in;
+ gunichar wc;
+
+ if (c >= 0xdc00 && c < 0xe000) /* low surrogate */
+ {
+ if (high_surrogate)
+ {
+ wc = SURROGATE_VALUE (high_surrogate, c);
+ high_surrogate = 0;
+ }
+ else
+ {
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Invalid sequence in conversion input"));
+ goto err_out;
+ }
+ }
+ else
+ {
+ if (high_surrogate)
+ {
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Invalid sequence in conversion input"));
+ goto err_out;
+ }
+
+ if (c >= 0xd800 && c < 0xdc00) /* high surrogate */
+ {
+ high_surrogate = c;
+ goto next1;
+ }
+ else
+ wc = c;
+ }
+
+ /********** DIFFERENT for UTF8/UCS4 **********/
+ n_bytes += UTF8_LENGTH (wc);
+
+ next1:
+ in++;
+ }
+
+ if (high_surrogate && !items_read)
+ {
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT,
+ _("Partial character sequence at end of input"));
+ goto err_out;
+ }
+
+ /* At this point, everything is valid, and we just need to convert
+ */
+ /********** DIFFERENT for UTF8/UCS4 **********/
+ result = g_malloc (n_bytes + 1);
+
+ high_surrogate = 0;
+ out = result;
+ in = str;
+ while (out < result + n_bytes)
+ {
+ gunichar2 c = *in;
+ gunichar wc;
+
+ if (c >= 0xdc00 && c < 0xe000) /* low surrogate */
+ {
+ wc = SURROGATE_VALUE (high_surrogate, c);
+ high_surrogate = 0;
+ }
+ else if (c >= 0xd800 && c < 0xdc00) /* high surrogate */
+ {
+ high_surrogate = c;
+ goto next2;
+ }
+ else
+ wc = c;
+
+ /********** DIFFERENT for UTF8/UCS4 **********/
+ out += g_unichar_to_utf8 (wc, out);
+
+ next2:
+ in++;
+ }
+
+ /********** DIFFERENT for UTF8/UCS4 **********/
+ *out = '\0';
+
+ if (items_written)
+ /********** DIFFERENT for UTF8/UCS4 **********/
+ *items_written = out - result;
+
+ err_out:
+ if (items_read)
+ *items_read = in - str;
+
+ return result;
+}
+
+/**
+ * g_utf16_to_ucs4:
+ * @str: a UTF-16 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ * the string is terminated with a 0 character.
+ * @items_read: location to store number of words read, or %NULL.
+ * If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be
+ * returned in case @str contains a trailing partial
+ * character. If an error occurs then the index of the
+ * invalid input is stored here.
+ * @items_written: location to store number of characters written, or %NULL.
+ * The value stored here does not include the trailing
+ * 0 character.
+ * @error: location to store the error occuring, or %NULL to ignore
+ * errors. Any of the errors in #GConvertError other than
+ * %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
+ * Convert a string from UTF-16 to UCS-4. The result will be
+ * terminated with a 0 character.
+ *
+ * Return value: a pointer to a newly allocated UCS-4 string.
+ * This value must be freed with g_free(). If an
+ * error occurs, %NULL will be returned and
+ * @error set.
+ **/
+gunichar *
+g_utf16_to_ucs4 (const gunichar2 *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error)
+{
+ const gunichar2 *in;
+ gchar *out;
+ gchar *result = NULL;
+ gint n_bytes;
+ gunichar high_surrogate;
+
+ g_return_val_if_fail (str != 0, NULL);
+
+ n_bytes = 0;
+ in = str;
+ high_surrogate = 0;
+ while ((len < 0 || in - str < len) && *in)
+ {
+ gunichar2 c = *in;
+ gunichar wc;
+
+ if (c >= 0xdc00 && c < 0xe000) /* low surrogate */
+ {
+ if (high_surrogate)
+ {
+ wc = SURROGATE_VALUE (high_surrogate, c);
+ high_surrogate = 0;
+ }
+ else
+ {
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Invalid sequence in conversion input"));
+ goto err_out;
+ }
+ }
+ else
+ {
+ if (high_surrogate)
+ {
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Invalid sequence in conversion input"));
+ goto err_out;
+ }
+
+ if (c >= 0xd800 && c < 0xdc00) /* high surrogate */
+ {
+ high_surrogate = c;
+ goto next1;
+ }
+ else
+ wc = c;
+ }
+
+ /********** DIFFERENT for UTF8/UCS4 **********/
+ n_bytes += sizeof (gunichar);
+
+ next1:
+ in++;
+ }
+
+ if (high_surrogate && !items_read)
+ {
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT,
+ _("Partial character sequence at end of input"));
+ goto err_out;
+ }
+
+ /* At this point, everything is valid, and we just need to convert
+ */
+ /********** DIFFERENT for UTF8/UCS4 **********/
+ result = g_malloc (n_bytes + 4);
+
+ high_surrogate = 0;
+ out = result;
+ in = str;
+ while (out < result + n_bytes)
+ {
+ gunichar2 c = *in;
+ gunichar wc;
+
+ if (c >= 0xdc00 && c < 0xe000) /* low surrogate */
+ {
+ wc = SURROGATE_VALUE (high_surrogate, c);
+ high_surrogate = 0;
+ }
+ else if (c >= 0xd800 && c < 0xdc00) /* high surrogate */
+ {
+ high_surrogate = c;
+ goto next2;
+ }
+ else
+ wc = c;
+
+ /********** DIFFERENT for UTF8/UCS4 **********/
+ *(gunichar *)out = wc;
+ out += sizeof (gunichar);
+
+ next2:
+ in++;
+ }
+
+ /********** DIFFERENT for UTF8/UCS4 **********/
+ *(gunichar *)out = 0;
+
+ if (items_written)
+ /********** DIFFERENT for UTF8/UCS4 **********/
+ *items_written = (out - result) / sizeof (gunichar);
+
+ err_out:
+ if (items_read)
+ *items_read = in - str;
+
+ return (gunichar *)result;
+}
+
+/**
+ * g_utf8_to_utf16:
+ * @str: a UTF-8 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ * the string is %NULL terminated.
+
+ * @items_read: location to store number of bytes read, or %NULL.
+ * If %NULL, then %G_CONVERT_ERROR_PARTIAL_INPUT will be
+ * returned in case @str contains a trailing partial
+ * character. If an error occurs then the index of the
+ * invalid input is stored here.
+ * @items_written: location to store number of words written, or %NULL.
+ * The value stored here does not include the trailing
+ * 0 word.
+ * @error: location to store the error occuring, or %NULL to ignore
+ * errors. Any of the errors in #GConvertError other than
+ * %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
+ * Convert a string from UTF-8 to UTF-16. A 0 word will be
+ * added to the result after the converted text.
+ *
+ * Return value: a pointer to a newly allocated UTF-16 string.
+ * This value must be freed with g_free(). If an
+ * error occurs, %NULL will be returned and
+ * @error set.
+ **/
+gunichar2 *
+g_utf8_to_utf16 (const gchar *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error)
+{
+ gunichar2 *result = NULL;
+ gint n16;
+ const gchar *in;
+ gint i;
+
+ g_return_val_if_fail (str != NULL, NULL);
+
+ in = str;
+ n16 = 0;
+ while ((len < 0 || str + len - in > 0) && *in)
+ {
+ gunichar wc = g_utf8_get_char_extended (in, str + len - in);
+ if (wc & 0x80000000)
+ {
+ if (wc == (gunichar)-2)
+ {
+ if (items_read)
+ break;
+ else
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT,
+ _("Partial character sequence at end of input"));
+ }
+ else
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Invalid byte sequence in conversion input"));
+
+ goto err_out;
+ }
+
+ if (wc < 0xd800)
+ n16 += 1;
+ else if (wc < 0xe000)
+ {
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Invalid sequence in conversion input"));
+
+ goto err_out;
+ }
+ else if (wc < 0x10000)
+ n16 += 1;
+ else if (wc < 0x110000)
+ n16 += 2;
+ else
+ {
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Character out of range for UTF-16"));
+
+ goto err_out;
+ }
+
+ in = g_utf8_next_char (in);
+ }
+
+ result = g_new (gunichar2, n16 + 1);
+
+ in = str;
+ for (i = 0; i < n16;)
+ {
+ gunichar wc = g_utf8_get_char (in);
+
+ if (wc < 0x10000)
+ {
+ result[i++] = wc;
+ }
+ else
+ {
+ result[i++] = (wc - 0x10000) / 0x400 + 0xd800;
+ result[i++] = (wc - 0x10000) % 0x400 + 0xdc00;
+ }
+
+ in = g_utf8_next_char (in);
+ }
+
+ result[i] = 0;
+
+ if (items_written)
+ *items_written = n16;
+
+ err_out:
+ if (items_read)
+ *items_read = in - str;
+
+ return result;
+}
+
+/**
+ * g_ucs4_to_utf16:
+ * @str: a UCS-4 encoded string
+ * @len: the maximum length of @str to use. If < 0, then
+ * the string is terminated with a zero character.
+ * @items_read: location to store number of bytes read, or %NULL.
+ * If an error occurs then the index of the invalid input
+ * is stored here.
+ * @items_written: location to store number of words written, or %NULL.
+ * The value stored here does not include the trailing
+ * 0 word.
+ * @error: location to store the error occuring, or %NULL to ignore
+ * errors. Any of the errors in #GConvertError other than
+ * %G_CONVERT_ERROR_NO_CONVERSION may occur.
+ *
+ * Convert a string from UCS-4 to UTF-16. A 0 word will be
+ * added to the result after the converted text.
+ *
+ * Return value: a pointer to a newly allocated UTF-16 string.
+ * This value must be freed with g_free(). If an
+ * error occurs, %NULL will be returned and
+ * @error set.
+ **/
+gunichar2 *
+g_ucs4_to_utf16 (const gunichar *str,
+ gint len,
+ gint *items_read,
+ gint *items_written,
+ GError **error)
+{
+ gunichar2 *result = NULL;
+ gint n16;
+ gint i, j;
+
+ n16 = 0;
+ i = 0;
+ while ((len < 0 || i < len) && str[i])
+ {
+ gunichar wc = str[i];
+
+ if (wc < 0xd800)
+ n16 += 1;
+ else if (wc < 0xe000)
+ {
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Invalid sequence in conversion input"));
+
+ goto err_out;
+ }
+ else if (wc < 0x10000)
+ n16 += 1;
+ else if (wc < 0x110000)
+ n16 += 2;
+ else
+ {
+ g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
+ _("Character out of range for UTF-16"));
+
+ goto err_out;
+ }
+
+ i++;
+ }
+
+ result = g_new (gunichar2, n16 + 1);
+
+ for (i = 0, j = 0; j < n16; i++)
+ {
+ gunichar wc = str[i];
+
+ if (wc < 0x10000)
+ {
+ result[j++] = wc;
+ }
+ else
+ {
+ result[j++] = (wc - 0x10000) / 0x400 + 0xd800;
+ result[j++] = (wc - 0x10000) % 0x400 + 0xdc00;
+ }
+ }
+ result[j] = 0;
+
+ if (items_written)
+ *items_written = n16;
+
+ err_out:
+ if (items_read)
+ *items_read = i;
+
return result;
}
{
const gchar *p;
+
+ g_return_val_if_fail (str != NULL, FALSE);
if (end)
*end = str;
UTF8_GET (result, p, i, mask, len);
+ if (UTF8_LENGTH (result) != len) /* Check for overlong UTF-8 */
+ break;
+
if (result == (gunichar)-1)
break;
+
+ if (!UNICODE_VALID (result))
+ break;
p += len;
}
thread-test \
threadpool-test \
tree-test \
- type-test
+ type-test \
+ unicode-encoding
test_scripts = run-markup-tests.sh
threadpool_test_LDADD = $(thread_LDADD)
tree_test_LDADD = $(progs_LDADD)
type_test_LDADD = $(progs_LDADD)
+unicode_encoding_LDADD = $(progs_LDADD)
lib_LTLIBRARIES = libmoduletestplugin_a.la libmoduletestplugin_b.la
g_free (channels);
- g_main_loop_destroy (addr_data.loop);
+ g_main_loop_unref (addr_data.loop);
g_print ("Timeout run %d times\n", addr_data.count);
g_timeout_add (RECURSER_TIMEOUT, recurser_start, NULL);
g_main_loop_run (main_loop);
- g_main_loop_destroy (main_loop);
+ g_main_loop_unref (main_loop);
#endif
return 0;
--- /dev/null
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <glib.h>
+
+static gint exit_status = 0;
+
+void
+croak (char *format, ...)
+{
+ va_list va;
+
+ va_start (va, format);
+ vfprintf (stderr, format, va);
+ va_end (va);
+
+ exit (1);
+}
+
+void
+fail (char *format, ...)
+{
+ va_list va;
+
+ va_start (va, format);
+ vfprintf (stderr, format, va);
+ va_end (va);
+
+ exit_status |= 1;
+}
+
+typedef enum
+{
+ VALID,
+ INCOMPLETE,
+ NOTUNICODE,
+ OVERLONG,
+ MALFORMED
+} Status;
+
+static gboolean
+ucs4_equal (gunichar *a, gunichar *b)
+{
+ while (*a && *b && (*a == *b))
+ {
+ a++;
+ b++;
+ }
+
+ return (*a == *b);
+}
+
+static gboolean
+utf16_equal (gunichar2 *a, gunichar2 *b)
+{
+ while (*a && *b && (*a == *b))
+ {
+ a++;
+ b++;
+ }
+
+ return (*a == *b);
+}
+
+static gint
+utf16_count (gunichar2 *a)
+{
+ gint result = 0;
+
+ while (a[result])
+ result++;
+
+ return result;
+}
+
+static void
+process (gint line,
+ gchar *utf8,
+ Status status,
+ gunichar *ucs4,
+ gint ucs4_len)
+{
+ const gchar *end;
+ gboolean is_valid = g_utf8_validate (utf8, -1, &end);
+ GError *error = NULL;
+ gint items_read, items_written;
+
+ switch (status)
+ {
+ case VALID:
+ if (!is_valid)
+ {
+ fail ("line %d: valid but g_utf8_validate returned FALSE\n", line);
+ return;
+ }
+ break;
+ case NOTUNICODE:
+ case INCOMPLETE:
+ case OVERLONG:
+ case MALFORMED:
+ if (is_valid)
+ {
+ fail ("line %d: invalid but g_utf8_validate returned TRUE\n", line);
+ return;
+ }
+ break;
+ }
+
+ if (status == INCOMPLETE)
+ {
+ gunichar *ucs4_result;
+
+ ucs4_result = g_utf8_to_ucs4 (utf8, -1, NULL, NULL, &error);
+
+ if (!error || !g_error_matches (error, G_CONVERT_ERROR, G_CONVERT_ERROR_PARTIAL_INPUT))
+ {
+ fail ("line %d: incomplete input not properly detected\n", line);
+ return;
+ }
+ g_clear_error (&error);
+
+ ucs4_result = g_utf8_to_ucs4 (utf8, -1, &items_read, NULL, &error);
+
+ if (!ucs4_result || items_read == strlen (utf8))
+ {
+ fail ("line %d: incomplete input not properly detected\n", line);
+ return;
+ }
+
+ g_free (ucs4_result);
+ }
+
+ if (status == VALID || status == NOTUNICODE)
+ {
+ gunichar *ucs4_result;
+ gchar *utf8_result;
+
+ ucs4_result = g_utf8_to_ucs4 (utf8, -1, &items_read, &items_written, &error);
+ if (!ucs4_result)
+ {
+ fail ("line %d: conversion to ucs4 failed: %s\n", line, error->message);
+ return;
+ }
+
+ if (!ucs4_equal (ucs4_result, ucs4) ||
+ items_read != strlen (utf8) ||
+ items_written != ucs4_len)
+ {
+ fail ("line %d: results of conversion to ucs4 do not match expected.\n", line);
+ return;
+ }
+
+ g_free (ucs4_result);
+
+ ucs4_result = g_utf8_to_ucs4_fast (utf8, -1, &items_written);
+
+ if (!ucs4_equal (ucs4_result, ucs4) ||
+ items_written != ucs4_len)
+ {
+ fail ("line %d: results of conversion to ucs4 do not match expected.\n", line);
+ return;
+ }
+
+ utf8_result = g_ucs4_to_utf8 (ucs4_result, -1, &items_read, &items_written, &error);
+ if (!utf8_result)
+ {
+ fail ("line %d: conversion back to utf8 failed: %s", line, error->message);
+ return;
+ }
+
+ if (strcmp (utf8_result, utf8) != 0 ||
+ items_read != ucs4_len ||
+ items_written != strlen (utf8))
+ {
+ fail ("line %d: conversion back to utf8 did not match original\n", line);
+ return;
+ }
+
+ g_free (utf8_result);
+ g_free (ucs4_result);
+ }
+
+ if (status == VALID)
+ {
+ gunichar2 *utf16_expected_tmp;
+ gunichar2 *utf16_expected;
+ gunichar2 *utf16_from_utf8;
+ gunichar2 *utf16_from_ucs4;
+ gunichar *ucs4_result;
+ gint bytes_written;
+ gint n_chars;
+ gchar *utf8_result;
+
+ if (!(utf16_expected_tmp = (gunichar2 *)g_convert (utf8, -1, "UTF-16", "UTF-8",
+ NULL, &bytes_written, NULL)))
+ {
+ fail ("line %d: could not convert to UTF-16 via g_convert\n", line);
+ return;
+ }
+
+ /* zero-terminate and remove BOM
+ */
+ n_chars = bytes_written / 2;
+ if (utf16_expected_tmp[0] == 0xfeff) /* BOM */
+ {
+ n_chars--;
+ utf16_expected = g_new (gunichar2, n_chars + 1);
+ memcpy (utf16_expected, utf16_expected_tmp + 1, sizeof(gunichar2) * n_chars);
+ }
+ else if (utf16_expected_tmp[0] == 0xfffe) /* ANTI-BOM */
+ {
+ fail ("line %d: conversion via iconv to \"UTF-16\" is not native-endian\n");
+ return;
+ }
+ else
+ {
+ utf16_expected = g_new (gunichar2, n_chars + 1);
+ memcpy (utf16_expected, utf16_expected_tmp, sizeof(gunichar2) * n_chars);
+ }
+
+ utf16_expected[n_chars] = '\0';
+
+ if (!(utf16_from_utf8 = g_utf8_to_utf16 (utf8, -1, &items_read, &items_written, &error)))
+ {
+ fail ("line %d: conversion to ucs16 failed: %s\n", line, error->message);
+ return;
+ }
+
+ if (items_read != strlen (utf8) ||
+ utf16_count (utf16_from_utf8) != items_written)
+ {
+ fail ("line %d: length error in conversion to ucs16\n", line);
+ return;
+ }
+
+ if (!(utf16_from_ucs4 = g_ucs4_to_utf16 (ucs4, -1, &items_read, &items_written, &error)))
+ {
+ fail ("line %d: conversion to ucs16 failed: %s\n", line, error->message);
+ return;
+ }
+
+ if (items_read != ucs4_len ||
+ utf16_count (utf16_from_ucs4) != items_written)
+ {
+ fail ("line %d: length error in conversion to ucs16\n", line);
+ return;
+ }
+
+ if (!utf16_equal (utf16_from_utf8, utf16_expected) ||
+ !utf16_equal (utf16_from_ucs4, utf16_expected))
+ {
+ fail ("line %d: results of conversion to ucs16 do not match\n", line);
+ return;
+ }
+
+ if (!(utf8_result = g_utf16_to_utf8 (utf16_from_utf8, -1, &items_read, &items_written, &error)))
+ {
+ fail ("line %d: conversion back to utf8 failed: %s\n", line, error->message);
+ return;
+ }
+
+ if (items_read != utf16_count (utf16_from_utf8) ||
+ items_written != strlen (utf8))
+ {
+ fail ("line %d: length error in conversion from ucs16 to utf8\n", line);
+ return;
+ }
+
+ if (!(ucs4_result = g_utf16_to_ucs4 (utf16_from_ucs4, -1, &items_read, &items_written, &error)))
+ {
+ fail ("line %d: conversion back to utf8/ucs4 failed\n", line);
+ return;
+ }
+
+ if (items_read != utf16_count (utf16_from_utf8) ||
+ items_written != ucs4_len)
+ {
+ fail ("line %d: length error in conversion from ucs16 to ucs4\n", line);
+ return;
+ }
+
+ if (strcmp (utf8, utf8_result) != 0 ||
+ !ucs4_equal (ucs4, ucs4_result))
+ {
+ fail ("line %d: conversion back to utf8/ucs4 did not match original\n", line);
+ return;
+ }
+
+ g_free (utf16_expected_tmp);
+ g_free (utf16_expected);
+ g_free (utf16_from_utf8);
+ g_free (utf16_from_ucs4);
+ g_free (utf8_result);
+ g_free (ucs4_result);
+ }
+}
+
+int
+main (int argc, char **argv)
+{
+ gchar *srcdir = getenv ("srcdir");
+ gchar *testfile;
+ gchar *contents;
+ GError *error = NULL;
+ gchar *p, *end;
+ char *tmp;
+ gint state = 0;
+ gint line = 1;
+ gint start_line = 0; /* Quiet GCC */
+ gchar *utf8 = NULL; /* Quiet GCC */
+ GArray *ucs4;
+ Status status = VALID; /* Quiet GCC */
+
+ if (!srcdir)
+ srcdir = ".";
+
+ testfile = g_strconcat (srcdir, "/", "utf8.txt", NULL);
+
+ g_file_get_contents (testfile, &contents, NULL, &error);
+ if (error)
+ croak ("Cannot open utf8.txt: %s", error->message);
+
+ ucs4 = g_array_new (TRUE, FALSE, sizeof(gunichar));
+
+ p = contents;
+
+ /* Loop over lines */
+ while (*p)
+ {
+ while (*p && (*p == ' ' || *p == '\t'))
+ p++;
+
+ end = p;
+ while (*end && *end != '\n')
+ end++;
+
+ if (!*p || *p == '#' || *p == '\n')
+ goto next_line;
+
+ tmp = g_strstrip (g_strndup (p, end - p));
+
+ switch (state)
+ {
+ case 0:
+ /* UTF-8 string */
+ start_line = line;
+ utf8 = tmp;
+ tmp = NULL;
+ break;
+
+ case 1:
+ /* Status */
+ if (!strcmp (tmp, "VALID"))
+ status = VALID;
+ else if (!strcmp (tmp, "INCOMPLETE"))
+ status = INCOMPLETE;
+ else if (!strcmp (tmp, "NOTUNICODE"))
+ status = NOTUNICODE;
+ else if (!strcmp (tmp, "OVERLONG"))
+ status = OVERLONG;
+ else if (!strcmp (tmp, "MALFORMED"))
+ status = MALFORMED;
+ else
+ croak ("Invalid status on line %d\n", line);
+
+ if (status != VALID && status != NOTUNICODE)
+ state++; /* No UCS-4 data */
+
+ break;
+
+ case 2:
+ /* UCS-4 version */
+
+ p = strtok (tmp, " \t");
+ while (p)
+ {
+ gchar *endptr;
+
+ gunichar ch = strtoul (p, &endptr, 16);
+ if (*endptr != '\0')
+ croak ("Invalid UCS-4 character on line %d\n", line);
+
+ g_array_append_val (ucs4, ch);
+
+ p = strtok (NULL, " \t");
+ }
+
+ break;
+ }
+
+ g_free (tmp);
+ state = (state + 1) % 3;
+
+ if (state == 0)
+ {
+ process (start_line, utf8, status, (gunichar *)ucs4->data, ucs4->len);
+ g_array_set_size (ucs4, 0);
+ g_free (utf8);
+ }
+
+ next_line:
+ p = end;
+ if (*p && *p == '\n')
+ p++;
+
+ line++;
+ }
+
+ return 0;
+}
--- /dev/null
+# This file is derived from
+#
+# http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
+#
+# Which was created by Markus Kuhn <mkuhn@acm.org> - 2000-09-02
+#
+# lines begining with # and blank lines are ignored
+#
+# Beyond that, this file consists of a series of test cases. Each test case consists of
+# 2 or 3 lines:
+#
+# 1. A UTF-8 string
+# 2. A status
+# VALID : The string is a valid UTF-8 representation of valid Unicode
+# INCOMPLETE : The string has a partial character at the end
+# NOTUNICODE : The string is valid UTF-8, but the characters represented
+# are not valid unicode (
+# OVERLONG : The string includes overlong sequences
+# MALFORMED : The string is not valid UTF-8
+# 3. If the status is VALID or NOTUNICODE, the UCS-4 representation of the string,
+# as a series of hex numbers.
+
+# 1 Some correct UTF-8 text
+κόσμε
+VALID
+03ba 1f79 03c3 03bc 03b5
+
+# 2.1 First possible sequence of a certain length
+#
+# FIXME - handle NULLS?
+#
+# [ NULL BYTE ]
+#VALID
+#0000
+
+\80
+VALID
+0080
+
+ࠀ
+VALID
+0800
+
+𐀀
+VALID
+00010000
+
+
+NOTUNICODE
+00200000
+
+
+NOTUNICODE
+04000000
+
+\7f
+VALID
+0000007f
+
+߿
+VALID
+000007ff
+
+
+NOTUNICODE
+0000ffff
+
+
+NOTUNICODE
+001fffff
+
+
+NOTUNICODE
+03ffffff
+
+
+NOTUNICODE
+7fffffff
+
+# 2.3 Other boundary conditions
+
+
+VALID
+d7ff
+
+
+VALID
+e000
+
+�
+VALID
+fffd
+
+
+VALID
+0010ffff
+
+
+NOTUNICODE
+00110000
+
+# 3.1 Unexpected continuation bytes
+
+\80
+MALFORMED
+¿
+MALFORMED
+\80¿
+MALFORMED
+\80¿\80
+MALFORMED
+\80¿\80¿
+MALFORMED
+\80¿\80¿\80
+MALFORMED
+\80¿\80¿\80¿
+MALFORMED
+\80¿\80¿\80¿\80
+MALFORMED
+\80\81\82\83\84\85\86\87\88\89\8a\8b\8c\8d\8e\8f\90\91\92\93\94\95\96\97\98\99\9a\9b\9c\9d\9e\9f¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿
+MALFORMED
+
+# 3.2 Lonely start characters
+
+À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
+MALFORMED
+à á â ã ä å æ ç è é ê ë ì í î ï
+MALFORMED
+ð ñ ò ó ô õ ö ÷
+MALFORMED
+ø ù ú û
+MALFORMED
+ü ý
+MALFORMED
+
+# 3.3 Sequences with last continuation byte missing
+
+À
+INCOMPLETE
+à\80
+INCOMPLETE
+ð\80\80
+INCOMPLETE
+ø\80\80\80
+INCOMPLETE
+ü\80\80\80\80
+INCOMPLETE
+ß
+INCOMPLETE
+ï¿
+INCOMPLETE
+÷¿¿
+INCOMPLETE
+û¿¿¿
+INCOMPLETE
+ý¿¿¿¿
+INCOMPLETE
+
+# 3.4 Concatenation of incomplete sequences
+
+Àà\80ð\80\80ø\80\80\80ü\80\80\80\80ßï¿÷¿¿û¿¿¿ý¿¿¿¿
+MALFORMED
+
+# 3.5 Impossible bytes
+
+þ
+MALFORMED
+ÿ
+MALFORMED
+þþÿÿ
+MALFORMED
+
+# Examples of an overlong ASCII character
+
+À¯
+OVERLONG
+à\80¯
+OVERLONG
+ð\80\80¯
+OVERLONG
+ø\80\80\80¯
+OVERLONG
+ü\80\80\80\80¯
+OVERLONG
+
+# Maximum overlong sequences
+
+Á¿
+OVERLONG
+à\9f¿
+OVERLONG
+ð\8f¿¿
+OVERLONG
+ø\87¿¿¿
+OVERLONG
+ü\83¿¿¿¿
+OVERLONG
+
+# Overlong representation of the NUL character
+
+À\80
+OVERLONG
+à\80\80
+OVERLONG
+ð\80\80\80
+OVERLONG
+ø\80\80\80\80
+OVERLONG
+ü\80\80\80\80\80
+OVERLONG
+
+# Illegal code positions
+
+# Single UTF-16 surrogates
+
+
+NOTUNICODE
+d800
+
+
+NOTUNICODE
+db7f
+
+
+NOTUNICODE
+db80
+
+
+NOTUNICODE
+dbff
+
+
+NOTUNICODE
+dc00
+
+
+NOTUNICODE
+df80
+
+
+NOTUNICODE
+dfff
+
+# Paired UTF-16 surrogates
+
+
+NOTUNICODE
+d800 dc00
+
+
+NOTUNICODE
+d800 dfff
+
+
+NOTUNICODE
+db7f dc00
+
+
+NOTUNICODE
+db7f dfff
+
+
+NOTUNICODE
+db80 dc00
+
+
+NOTUNICODE
+db80 dfff
+
+
+NOTUNICODE
+dbff dc00
+
+
+NOTUNICODE
+dbff dfff
+
+# Other illegal code positions
+
+
+NOTUNICODE
+fffe
+
+
+NOTUNICODE
+ffff
+
+################
+#
+# Some more tests, not from Markus Kuhn's file
+#
+
+# Mixed plane 0 and higher planes
+
+A𐀀BC
+VALID
+41 00010000 42 10ffff 43