Update.

author Ulrich Drepper <drepper@redhat.com>

Mon, 5 Nov 2001 08:11:26 +0000 (08:11 +0000)

committer Ulrich Drepper <drepper@redhat.com>

Mon, 5 Nov 2001 08:11:26 +0000 (08:11 +0000)
author Ulrich Drepper <drepper@redhat.com>
Mon, 5 Nov 2001 08:11:26 +0000 (08:11 +0000)
committer Ulrich Drepper <drepper@redhat.com>
Mon, 5 Nov 2001 08:11:26 +0000 (08:11 +0000)
diff --git a/ChangeLog b/ChangeLog

index a5a312d..249089e 100644 (file)
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,9 @@
+2001-11-05  Ulrich Drepper  <drepper@redhat.com>
+
+       * manual/charset.texi: Extensive editing work.
+       * manual/nss.texi: Likewise.
+       Changes by Dennis Grace <dgrace@us.ibm.com>.
+
  2001-11-04  Roland McGrath  <roland@frob.com>
  
         * hurd/set-host.c (_hurd_set_host_config): Use O_WRONLY in flags
diff --git a/manual/charset.texi b/manual/charset.texi

index b7b2f73..bae2910 100644 (file)
--- a/manual/charset.texi
+++ b/manual/charset.texi
@@ -1,2892 +1,2896 @@
-@node Character Set Handling, Locales, String and Array Utilities, Top\r
-@c %MENU% Support for extended character sets\r
-@chapter Character Set Handling\r
-\r
-@ifnottex\r
-@macro cal{text}\r
-\text\\r
-@end macro\r
-@end ifnottex\r
-\r
-Character sets used in the early days of computing had only six, seven,\r
-or eight bits for each character: there was never a case where more than\r
-eight bits (one byte) were used to represent a single character.  The\r
-limitations of this approach became more apparent as more people\r
-grappled with non-Roman character sets, where not all the characters\r
-that make up a language's character set can be represented by @math{2^8}\r
-choices.  This chapter shows the functionality that was added to the C\r
-library to support multiple character sets.\r
-\r
-@menu\r
-* Extended Char Intro::              Introduction to Extended Characters.\r
-* Charset Function Overview::        Overview about Character Handling\r
-                                      Functions.\r
-* Restartable multibyte conversion:: Restartable multibyte conversion\r
-                                      Functions.\r
-* Non-reentrant Conversion::         Non-reentrant Conversion Function.\r
-* Generic Charset Conversion::       Generic Charset Conversion.\r
-@end menu\r
-\r
-\r
-@node Extended Char Intro\r
-@section Introduction to Extended Characters\r
-\r
-A variety of solutions is available to overcome the differences between\r
-character sets with a 1:1 relation between bytes and characters and\r
-character sets with ratios of 2:1 or 4:1. The remainder of this\r
-section gives a few examples to help understand the design decisions\r
-made while developing the functionality of the @w{C library}.\r
-\r
-@cindex internal representation\r
-A distinction we have to make right away is between internal and\r
-external representation.  @dfn{Internal representation} means the\r
-representation used by a program while keeping the text in memory.\r
-External representations are used when text is stored or transmitted\r
-through some communication channel.  Examples of external\r
-representations include files waiting in a directory to be\r
-read and parsed.\r
-\r
-Traditionally there has been no difference between the two representations.\r
-It was equally comfortable and useful to use the same single-byte\r
-representation internally and externally.  This comfort level decreases\r
-with more and larger character sets.\r
-\r
-One of the problems to overcome with the internal representation is\r
-handling text that is externally encoded using different character\r
-sets.  Assume a program that reads two texts and compares them using\r
-some metric.  The comparison can be usefully done only if the texts are\r
-internally kept in a common format.\r
-\r
-@cindex wide character\r
-For such a common format (@math{=} character set) eight bits are certainly\r
-no longer enough.  So the smallest entity will have to grow: @dfn{wide\r
-characters} will now be used.  Instead of one byte per character, two or\r
-four will be used instead.  (Three are not good to address in memory and\r
-more than four bytes seem not to be necessary).\r
-\r
-@cindex Unicode\r
-@cindex ISO 10646\r
-As shown in some other part of this manual,\r
-@c !!! Ahem, wide char string functions are not yet covered -- drepper\r
-a completely new family has been created of functions that can handle wide\r
-character texts in memory. The most commonly used character sets for such\r
-internal wide character representations are Unicode and @w{ISO 10646}\r
-(also known as UCS for Universal Character Set). Unicode was originally\r
-planned as a 16-bit character set; whereas, @w{ISO 10646} was designed to\r
-be a 31-bit large code space. The two standards are practically identical.\r
-They have the same character repertoire and code table, but Unicode specifies\r
-added semantics.  At the moment, only characters in the first @code{0x10000}\r
-code positions (the so-called Basic Multilingual Plane, BMP) have been\r
-assigned, but the assignment of more specialized characters outside this\r
-16-bit space is already in progress. A number of encodings have been\r
-defined for Unicode and @w{ISO 10646} characters:\r
-@cindex UCS-2\r
-@cindex UCS-4\r
-@cindex UTF-8\r
-@cindex UTF-16\r
-UCS-2 is a 16-bit word that can only represent characters\r
-from the BMP, UCS-4 is a 32-bit word than can represent any Unicode\r
-and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where\r
-ASCII characters are represented by ASCII bytes and non-ASCII characters\r
-by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension\r
-of UCS-2 in which pairs of certain UCS-2 words can be used to encode\r
-non-BMP characters up to @code{0x10ffff}.\r
-\r
-To represent wide characters the @code{char} type is not suitable.  For\r
-this reason the @w{ISO C} standard introduces a new type that is\r
-designed to keep one character of a wide character string.  To maintain\r
-the similarity there is also a type corresponding to @code{int} for\r
-those functions that take a single wide character.\r
-\r
-@comment stddef.h\r
-@comment ISO\r
-@deftp {Data type} wchar_t\r
-This data type is used as the base type for wide character strings.\r
-I.e., arrays of objects of this type are the equivalent of @code{char[]}\r
-for multibyte character strings.  The type is defined in @file{stddef.h}.\r
-\r
-The @w{ISO C90} standard, where @code{wchar_t} was introduced, does not\r
-say anything specific about the representation.  It only requires that\r
-this type is capable of storing all elements of the basic character set.\r
-Therefore it would be legitimate to define @code{wchar_t} as @code{char},\r
-which might make sense for embedded systems.\r
-\r
-But for GNU systems @code{wchar_t} is always 32 bits wide and, therefore,\r
-capable of representing all UCS-4 values and, therefore, covering all of\r
-@w{ISO 10646}.  Some Unix systems define @code{wchar_t} as a 16-bit type\r
-and thereby follow Unicode very strictly. This definition is perfectly\r
-fine with the standard, but it also means that to represent all\r
-characters from Unicode and @w{ISO 10646} one has to use UTF-16 surrogate\r
-characters, which is in fact a multi-wide-character encoding. But\r
-resorting to multi-wide-character encoding contradicts the purpose of the\r
-@code{wchar_t} type.\r
-@end deftp\r
-\r
-@comment wchar.h\r
-@comment ISO\r
-@deftp {Data type} wint_t\r
-@code{wint_t} is a data type used for parameters and variables that\r
-contain a single wide character. As the name suggests this type is the\r
-equivalent of @code{int} when using the normal @code{char} strings.  The\r
-types @code{wchar_t} and @code{wint_t} often have the same\r
-representation if their size is 32 bits wide but if @code{wchar_t} is\r
-defined as @code{char} the type @code{wint_t} must be defined as\r
-@code{int} due to the parameter promotion.\r
-\r
-@pindex wchar.h\r
-This type is defined in @file{wchar.h} and was introduced in\r
-@w{Amendment 1} to @w{ISO C90}.\r
-@end deftp\r
-\r
-As there are for the @code{char} data type macros are available for\r
-specifying the minimum and maximum value representable in an object of\r
-type @code{wchar_t}.\r
-\r
-@comment wchar.h\r
-@comment ISO\r
-@deftypevr Macro wint_t WCHAR_MIN\r
-The macro @code{WCHAR_MIN} evaluates to the minimum value representable\r
-by an object of type @code{wint_t}.\r
-\r
-This macro was introduced in @w{Amendment 1} to @w{ISO C90}.\r
-@end deftypevr\r
-\r
-@comment wchar.h\r
-@comment ISO\r
-@deftypevr Macro wint_t WCHAR_MAX\r
-The macro @code{WCHAR_MAX} evaluates to the maximum value representable\r
-by an object of type @code{wint_t}.\r
-\r
-This macro was introduced in @w{Amendment 1} to @w{ISO C90}.\r
-@end deftypevr\r
-\r
-Another special wide character value is the equivalent to @code{EOF}.\r
-\r
-@comment wchar.h\r
-@comment ISO\r
-@deftypevr Macro wint_t WEOF\r
-The macro @code{WEOF} evaluates to a constant expression of type\r
-@code{wint_t} whose value is different from any member of the extended\r
-character set.\r
-\r
-@code{WEOF} need not be the same value as @code{EOF} and unlike\r
-@code{EOF} it also need @emph{not} be negative.  I.e., sloppy code like\r
-\r
-@smallexample\r
-@{\r
-  int c;\r
-  ...\r
-  while ((c = getc (fp)) < 0)\r
-    ...\r
-@}\r
-@end smallexample\r
-\r
-@noindent\r
-has to be rewritten to use @code{WEOF} explicitly when wide characters\r
-are used:\r
-\r
-@smallexample\r
-@{\r
-  wint_t c;\r
-  ...\r
-  while ((c = wgetc (fp)) != WEOF)\r
-    ...\r
-@}\r
-@end smallexample\r
-\r
-@pindex wchar.h\r
-This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is\r
-defined in @file{wchar.h}.\r
-@end deftypevr\r
-\r
-\r
-These internal representations present problems when it comes to storing\r
-and transmittal. Because each single wide character consists of more\r
-than one byte, they are effected by byte-ordering.  Thus, machines with\r
-different endianesses would see different values when accessing the same\r
-data. This byte ordering concern also applies for communication protocols\r
-that are all byte-based and, thereforet require that the sender has to\r
-decide about splitting the wide character in bytes. A last (but not least\r
-important) point is that wide characters often require more storage space\r
-than a customized byte-oriented character set.\r
-\r
-@cindex multibyte character\r
-@cindex EBCDIC\r
-   For all the above reasons, an external encoding that is different\r
-from the internal encoding is often used if the latter is UCS-2 or UCS-4.\r
-The external encoding is byte-based and can be chosen appropriately for\r
-the environment and for the texts to be handled. A variety of different\r
-character sets can be used for this external encoding (information that\r
-will not be exhaustively presented here--instead, a description of the\r
-major groups will suffice). All of the ASCII-based character sets\r
-[_bkoz_: do you mean Roman character sets? If not, what do you mean\r
-here?] fulfill one requirement: they are "filesystem safe."  This means\r
-that the character @code{'/'} is used in the encoding @emph{only} to\r
-represent itself.  Things are a bit different for character sets like\r
-EBCDIC (Extended Binary Coded Decimal Interchange Code, a character set\r
-family used by IBM), but if the operation system does not understand\r
-EBCDIC directly the parameters-to-system calls have to be converted first\r
-anyhow.\r
-\r
-@itemize @bullet\r
-@item \r
-The simplest character sets are single-byte character sets.  There can \r
-be only up to 256 characters (for @w{8 bit} character sets), which is \r
-not sufficient to cover all languages but might be sufficient to handle \r
-a specific text. Handling of a @w{8 bit} character sets is simple. This \r
-is not true for other kinds presented later, and therefore, the \r
-application one uses might require the use of @w{8 bit} character sets.\r
-\r
-@cindex ISO 2022\r
-@item\r
-The @w{ISO 2022} standard defines a mechanism for extended character\r
-sets where one character @emph{can} be represented by more than one\r
-byte.  This is achieved by associating a state with the text.\r
-Characters that can be used to change the state can be embedded in the\r
-text. Each byte in the text might have a different interpretation in each\r
-state.  The state might even influence whether a given byte stands for a\r
-character on its own or whether it has to be combined with some more\r
-bytes.\r
-\r
-@cindex EUC\r
-@cindex Shift_JIS\r
-@cindex SJIS\r
-In most uses of @w{ISO 2022} the defined character sets do not allow\r
-state changes which cover more than the next character.  This has the\r
-big advantage that whenever one can identify the beginning of the byte\r
-sequence of a character one can interpret a text correctly.  Examples of\r
-character sets using this policy are the various EUC character sets\r
-(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)\r
-or Shift_JIS (SJIS, a Japanese encoding).\r
-\r
-But there are also character sets using a state which is valid for more\r
-than one character and has to be changed by another byte sequence.\r
-Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.\r
-\r
-@item\r
-@cindex ISO 6937\r
-Early attempts to fix 8 bit character sets for other languages using the\r
-Roman alphabet lead to character sets like @w{ISO 6937}.  Here bytes\r
-representing characters like the acute accent do not produce output\r
-themselves: one has to combine them with other characters to get the\r
-desired result.  For example, the byte sequence @code{0xc2 0x61}\r
-(non-spacing acute accent, followed by lower-case `a') to get the ``small\r
-a with  acute'' character.  To get the acute accent character on its own,\r
-one has to write @code{0xc2 0x20} (the non-spacing acute followed by a\r
-space).\r
-\r
-Character sets like @w[ISO 6937] are used in some embedded systems such\r
-as teletex.\r
-\r
-@item\r
-@cindex UTF-8\r
-Instead of converting the Unicode or @w{ISO 10646} text used internally,\r
-it is often also sufficient to simply use an encoding different than\r
-UCS-2/UCS-4.  The Unicode and @w{ISO 10646} standards even specify such an\r
-encoding: UTF-8.  This encoding is able to represent all of @w{ISO\r
-10646} 31 bits in a byte string of length one to six.\r
-\r
-@cindex UTF-7\r
-There were a few other attempts to encode @w{ISO 10646} such as UTF-7,\r
-but UTF-8 is today the only encoding which should be used.  In fact, with\r
-any luck UTF-8 will soon be the only external encoding that has to be\r
-supported.  It proves to be universally usable and its only disadvantage\r
-is that it favors Roman languages by making the byte string\r
-representation of other scripts (Cyrillic, Greek, Asian scripts) longer\r
-than necessary if using a specific character set for these scripts.\r
-Methods like the Unicode compression scheme can alleviate these\r
-problems.\r
-@end itemize\r
-\r
-The question remaining is: how to select the character set or encoding\r
-to use.  The answer: you cannot decide about it yourself, it is decided\r
-by the developers of the system or the majority of the users.  Since the\r
-goal is interoperability one has to use whatever the other people one\r
-works with use.  If there are no constraints, the selection is based on\r
-the requirements the expected circle of users will have.  In other words,\r
-if a project is expected to be used in only, say, Russia it is fine to use\r
-KOI8-R or a similar character set.  But if at the same time people from,\r
-say, Greece are participating one should use a character set which allows\r
-all people to collaborate.\r
-\r
-The most widely useful solution seems to be: go with the most general\r
-character set, namely @w{ISO 10646}.  Use UTF-8 as the external encoding\r
-and problems about users not being able to use their own language\r
-adequately are a thing of the past.\r
-\r
-One final comment about the choice of the wide character representation\r
-is necessary at this point.  We have said above that the natural choice\r
-is using Unicode or @w{ISO 10646}.  This is not required, but at least\r
-encouraged, by the @w{ISO C} standard.  The standard defines at least a\r
-macro @code{__STDC_ISO_10646__} that is only defined on systems where\r
-the @code{wchar_t} type encodes @w{ISO 10646} characters.  If this\r
-symbol is not defined one should avoid making assumptions about the wide\r
-character representation. If the programmer uses only the functions\r
-provided by the C library to handle wide character strings there should\r
-be no compatibility problems with other systems.\r
-\r
-@node Charset Function Overview\r
-@section Overview about Character Handling Functions\r
-\r
-A Unix @w{C library} contains three different sets of functions in two \r
-families to handle character set conversion. One of the function families \r
-(the most commonly used) is specified in the @w{ISO C90} standard and, \r
-therefore, is portable even beyond the Unix world. Unfortunately this \r
-family is the least useful one. These functions should be avoided \r
-whenever possible, especially when developing libraries (as opposed to \r
-applications). \r
-\r
-The second family of functions got introduced in the early Unix standards\r
-(XPG2) and is still part of the latest and greatest Unix standard:\r
-@w{Unix 98}.  It is also the most powerful and useful set of functions.\r
-But we will start with the functions defined in @w{Amendment 1} to\r
-@w{ISO C90}.\r
-\r
-@node Restartable multibyte conversion\r
-@section Restartable Multibyte Conversion Functions\r
-\r
-The @w{ISO C} standard defines functions to convert strings from a\r
-multibyte representation to wide character strings.  There are a number\r
-of peculiarities:\r
-\r
-@itemize @bullet\r
-@item\r
-The character set assumed for the multibyte encoding is not specified\r
-as an argument to the functions.  Instead the character set specified by\r
-the @code{LC_CTYPE} category of the current locale is used; see\r
-@ref{Locale Categories}.\r
-\r
-@item\r
-The functions handling more than one character at a time require NUL\r
-terminated strings as the argument.  I.e., converting blocks of text\r
-does not work unless one can add a NUL byte at an appropriate place.\r
-The GNU C library contains some extensions to the standard that allow\r
-specifying a size, but basically they also expect terminated strings.\r
-@end itemize\r
-\r
-Despite these limitations the @w{ISO C} functions can be used in many\r
-contexts.  In graphical user interfaces, for instance, it is not\r
-uncommon to have functions that require text to be displayed in a wide\r
-character string if the text is not simple ASCII.  The text itself might come\r
-from a file with translations and the user should decide about the\r
-current locale which determines the translation and therefore also the\r
-external encoding used. In such a situation (and many others) the\r
-functions described here are perfect.  If more freedom while performing\r
-the conversion is necessary take a look at the @code{iconv} functions\r
-(@pxref{Generic Charset Conversion}).\r
-\r
-@menu\r
-* Selecting the Conversion::     Selecting the conversion and its properties.\r
-* Keeping the state::            Representing the state of the conversion.\r
-* Converting a Character::       Converting Single Characters.\r
-* Converting Strings::           Converting Multibyte and Wide Character\r
-                                  Strings.\r
-* Multibyte Conversion Example:: A Complete Multibyte Conversion Example.\r
-@end menu\r
-\r
-@node Selecting the Conversion\r
-@subsection Selecting the conversion and its properties\r
-\r
-We already said above that the currently selected locale for the\r
-@code{LC_CTYPE} category decides about the conversion which is performed\r
-by the functions we are about to describe.  Each locale uses its own\r
-character set (given as an argument to @code{localedef}) and this is the\r
-one assumed as the external multibyte encoding.  The wide character\r
-character set always is UCS-4, at least on GNU systems.\r
-\r
-A characteristic of each multibyte character set is the maximum number\r
-of bytes that can be necessary to represent one character.  This\r
-information is quite important when writing code that uses the\r
-conversion functions (as shown in the examples below).\r
-The @w{ISO C} standard defines two macros which provide this information.\r
-\r
-\r
-@comment limits.h\r
-@comment ISO\r
-@deftypevr Macro int MB_LEN_MAX\r
-@code{MB_LEN_MAX} specifies the maximum number of bytes in the multibyte\r
-sequence for a single character in any of the supported locales.  It is\r
-a compile-time constant and is defined in @file{limits.h}.\r
-@pindex limits.h\r
-@end deftypevr\r
-\r
-@comment stdlib.h\r
-@comment ISO\r
-@deftypevr Macro int MB_CUR_MAX\r
-@code{MB_CUR_MAX} expands into a positive integer expression that is the\r
-maximum number of bytes in a multibyte character in the current locale.\r
-The value is never greater than @code{MB_LEN_MAX}.  Unlike\r
-@code{MB_LEN_MAX} this macro need not be a compile-time constant, and in \r
-the GNU C library it is not.\r
-\r
-@pindex stdlib.h\r
-@code{MB_CUR_MAX} is defined in @file{stdlib.h}.\r
-@end deftypevr\r
-\r
-Two different macros are necessary since strictly @w{ISO C90} compilers\r
-do not allow variable length array definitions, but still it is desirable\r
-to avoid dynamic allocation.  This incomplete piece of code shows the\r
-problem:\r
-\r
-@smallexample\r
-@{\r
-  char buf[MB_LEN_MAX];\r
-  ssize_t len = 0;\r
-\r
-  while (! feof (fp))\r
-    @{\r
-      fread (&buf[len], 1, MB_CUR_MAX - len, fp);\r
-      /* @r{... process} buf */\r
-      len -= used;\r
-    @}\r
-@}\r
-@end smallexample\r
-\r
-The code in the inner loop is expected to have always enough bytes in\r
-the array @var{buf} to convert one multibyte character.  The array\r
-@var{buf} has to be sized statically since many compilers do not allow a\r
-variable size.  The @code{fread} call makes sure that @code{MB_CUR_MAX} \r
-bytes are always available in @var{buf}.  Note that it isn't\r
-a problem if @code{MB_CUR_MAX} is not a compile-time constant.\r
-\r
-\r
-@node Keeping the state\r
-@subsection Representing the state of the conversion\r
-\r
-@cindex stateful\r
-In the introduction of this chapter it was said that certain character\r
-sets use a @dfn{stateful} encoding.  That is, the encoded values depend \r
-in some way on the previous bytes in the text.\r
-\r
-Since the conversion functions allow converting a text in more than one\r
-step we must have a way to pass this information from one call of the\r
-functions to another.\r
-\r
-@comment wchar.h\r
-@comment ISO\r
-@deftp {Data type} mbstate_t\r
-@cindex shift state\r
-A variable of type @code{mbstate_t} can contain all the information\r
-about the @dfn{shift state} needed from one call to a conversion\r
-function to another.\r
-\r
-@pindex wchar.h\r
-@code{mbstate_t} is defined in @file{wchar.h}. It was introduced in\r
-@w{Amendment 1} to @w{ISO C90}.\r
-@end deftp\r
-\r
-To use objects of type @code{mbstate_t} the programmer has to define such \r
-objects (normally as local variables on the stack) and pass a pointer to \r
-the object to the conversion functions.  This way the conversion function\r
-can update the object if the current multibyte character set is stateful.\r
-\r
-There is no specific function or initializer to put the state object in\r
-any specific state.  The rules are that the object should always\r
-represent the initial state before the first use, and this is achieved by\r
-clearing the whole variable with code such as follows:\r
-\r
-@smallexample\r
-@{\r
-  mbstate_t state;\r
-  memset (&state, '\0', sizeof (state));\r
-  /* @r{from now on @var{state} can be used.}  */\r
-  ...\r
-@}\r
-@end smallexample\r
-\r
-When using the conversion functions to generate output it is often\r
-necessary to test whether the current state corresponds to the initial\r
-state.  This is necessary, for example, to decide whether to emit\r
-escape sequences to set the state to the initial state at certain\r
-sequence points.  Communication protocols often require this.\r
-\r
-@comment wchar.h\r
-@comment ISO\r
-@deftypefun int mbsinit (const mbstate_t *@var{ps})\r
-The @code {mbsinit} function determines whether the state object pointed \r
-to by @var{ps} is in the initial state. If @var{ps} is a null pointer or \r
-the object is in the initial state the return value is nonzero. Otherwise \r
-it is zero.\r
-\r
-@pindex wchar.h\r
-@code {mbsinit} was introduced in @w{Amendment 1} to @w{ISO C90} and is \r
-declared in @file{wchar.h}.\r
-@end deftypefun\r
-\r
-Code using @code {mbsinit} often looks similar to this:\r
-\r
-@c Fix the example to explicitly say how to generate the escape sequence\r
-@c to restore the initial state.\r
-@smallexample\r
-@{\r
-  mbstate_t state;\r
-  memset (&state, '\0', sizeof (state));\r
-  /* @r{Use @var{state}.}  */\r
-  ...\r
-  if (! mbsinit (&state))\r
-    @{\r
-      /* @r{Emit code to return to initial state.}  */\r
-      const wchar_t empty[] = L"";\r
-      const wchar_t *srcp = empty;\r
-      wcsrtombs (outbuf, &srcp, outbuflen, &state);\r
-    @}\r
-  ...\r
-@}\r
-@end smallexample\r
-\r
-The code to emit the escape sequence to get back to the initial state is\r
-interesting. The @code{wcsrtombs} function can be used to determine the\r
-necessary output code (@pxref{Converting Strings}).  Please note that on\r
-GNU systems it is not necessary to perform this extra action for the\r
-conversion from multibyte text to wide character text since the wide\r
-character encoding is not stateful.  But there is nothing mentioned in\r
-any standard which prohibits making @code{wchar_t} using a stateful\r
-encoding.\r
-\r
-@node Converting a Character\r
-@subsection Converting Single Characters\r
-\r
-The most fundamental of the conversion functions are those dealing with\r
-single characters.  Please note that this does not always mean single\r
-bytes.  But since there is very often a subset of the multibyte\r
-character set which consists of single byte sequences there are\r
-functions to help with converting bytes.  Frequently, ASCII is a subpart \r
-of the multibyte character set.  In such a scenario, each ASCII character \r
-stands for itself, and all other characters have at least a first byte \r
-that is beyond the range @math{0} to @math{127}.\r
-\r
-@comment wchar.h\r
-@comment ISO\r
-@deftypefun wint_t btowc (int @var{c})\r
-The @code{btowc} function (``byte to wide character'') converts a valid\r
-single byte character @var{c} in the initial shift state into the wide\r
-character equivalent using the conversion rules from the currently\r
-selected locale of the @code{LC_CTYPE} category.\r
-\r
-If @code{(unsigned char) @var{c}} is no valid single byte multibyte\r
-character or if @var{c} is @code{EOF}, the function returns @code{WEOF}.\r
-\r
-Please note the restriction of @var{c} being tested for validity only in\r
-the initial shift state.  No @code{mbstate_t} object is used from\r
-which the state information is taken, and the function also does not use\r
-any static state.\r
-\r
-@pindex wchar.h\r
-The @code{btowc} function was introduced in @w{Amendment 1} to @w{ISO C90} \r
-and is declared in @file{wchar.h}.\r
-@end deftypefun\r
-\r
-Despite the limitation that the single byte value always is interpreted\r
-in the initial state this function is actually useful most of the time.\r
-Most characters are either entirely single-byte character sets or they\r
-are extension to ASCII.  But then it is possible to write code like this\r
-(not that this specific example is very useful):\r
-\r
-@smallexample\r
-wchar_t *\r
-itow (unsigned long int val)\r
-@{\r
-  static wchar_t buf[30];\r
-  wchar_t *wcp = &buf[29];\r
-  *wcp = L'\0';\r
-  while (val != 0)\r
-    @{\r
-      *--wcp = btowc ('0' + val % 10);\r
-      val /= 10;\r
-    @}\r
-  if (wcp == &buf[29])\r
-    *--wcp = L'0';\r
-  return wcp;\r
-@}\r
-@end smallexample\r
-\r
-Why is it necessary to use such a complicated implementation and not\r
-simply cast @code{'0' + val % 10} to a wide character?  The answer is\r
-that there is no guarantee that one can perform this kind of arithmetic\r
-on the character of the character set used for @code{wchar_t}\r
-representation.  In other situations the bytes are not constant at\r
-compile time and so the compiler cannot do the work.  In situations like\r
-this it is necessary @code{btowc}.\r
-\r
-@noindent\r
-There also is a function for the conversion in the other direction.\r
-\r
-@comment wchar.h\r
-@comment ISO\r
-@deftypefun int wctob (wint_t @var{c})\r
-The @code{wctob} function (``wide character to byte'') takes as the\r
-parameter a valid wide character.  If the multibyte representation for\r
-this character in the initial state is exactly one byte long the return\r
-value of this function is this character.  Otherwise the return value is\r
-@code{EOF}.\r
-\r
-@pindex wchar.h\r
-@code{wctob} was introduced in @w{Amendment 1} to @w{ISO C90} and\r
-is declared in @file{wchar.h}.\r
-@end deftypefun\r
-\r
-There are more general functions to convert single character from\r
-multibyte representation to wide characters and vice versa.  These\r
-functions pose no limit on the length of the multibyte representation\r
-and they also do not require it to be in the initial state.\r
-\r
-@comment wchar.h\r
-@comment ISO\r
-@deftypefun size_t mbrtowc (wchar_t *restrict @var{pwc}, const char *restrict @var{s}, size_t @var{n}, mbstate_t *restrict @var{ps})\r
-@cindex stateful\r
-The @code{mbrtowc} function (``multibyte restartable to wide\r
-character'') converts the next multibyte character in the string pointed\r
-to by @var{s} into a wide character and stores it in the wide character\r
-string pointed to by @var{pwc}. The conversion is performed according\r
-to the locale currently selected for the @code{LC_CTYPE} category.  If\r
-the conversion for the character set used in the locale requires a state,\r
-the multibyte string is interpreted in the state represented by the\r
-object pointed to by @var{ps}. If @var{ps} is a null pointer, a static,\r
-internal state variable used only by the @code{mbrtowc} function is\r
-used.\r
-\r
-If the next multibyte character corresponds to the NUL wide character,\r
-the return value of the function is @math{0} and the state object is\r
-afterwards in the initial state. If the next @var{n} or fewer bytes\r
-form a correct multibyte character, the return value is the number of\r
-bytes starting from @var{s} that form the multibyte character.  The\r
-conversion state is updated according to the bytes consumed in the\r
-conversion. In both cases the wide character (either the @code{L'\0'}\r
-or the one found in the conversion) is stored in the string pointed to\r
-by @var{pwc} if @var{pwc} is not null.\r
-\r
-If the first @var{n} bytes of the multibyte string possibly form a valid\r
-multibyte character but there are more than @var{n} bytes needed to\r
-complete it, the return value of the function is @code{(size_t) -2} and\r
-no value is stored.  Please note that this can happen even if @var{n}\r
-has a value greater than or equal to @code{MB_CUR_MAX} since the input \r
-might contain redundant shift sequences.\r
-\r
-If the first @code{n} bytes of the multibyte string cannot possibly form\r
-a valid multibyte character, no value is stored, the global variable\r
-@code{errno} is set to the value @code{EILSEQ}, and the function returns\r
-@code{(size_t) -1}. The conversion state is afterwards undefined.\r
-\r
-@pindex wchar.h\r
-@code{mbrtowc} was introduced in @w{Amendment 1} to @w{ISO C90} and\r
-is declared in @file{wchar.h}.\r
-@end deftypefun\r
-\r
-Use of @code{mbrtowc} is straightforward.  A function which copies a\r
-multibyte string into a wide character string while at the same time\r
-converting all lowercase characters into uppercase could look like this\r
-(this is not the final version, just an example; it has no error\r
-checking, and sometimes leaks memory):\r
-\r
-@smallexample\r
-wchar_t *\r
-mbstouwcs (const char *s)\r
-@{\r
-  size_t len = strlen (s);\r
-  wchar_t *result = malloc ((len + 1) * sizeof (wchar_t));\r
-  wchar_t *wcp = result;\r
-  wchar_t tmp[1];\r
-  mbstate_t state;\r
-  size_t nbytes;\r
-\r
-  memset (&state, '\0', sizeof (state));\r
-  while ((nbytes = mbrtowc (tmp, s, len, &state)) > 0)\r
-    @{\r
-      if (nbytes >= (size_t) -2)\r
-        /* Invalid input string.  */\r
-        return NULL;\r
-      *result++ = towupper (tmp[0]);\r
-      len -= nbytes;\r
-      s += nbytes;\r
-    @}\r
-  return result;\r
-@}\r
-@end smallexample\r
-\r
-The use of @code{mbrtowc} should be clear. A single wide character is\r
-stored in @code{@var{tmp}[0]}, and the number of consumed bytes is stored\r
-in the variable @var{nbytes}. If the conversion is successful, the \r
-uppercase variant of the wide character is stored in the @var{result} \r
-array and the pointer to the input string and the number of available \r
-bytes is adjusted.\r
-\r
-The only non-obvious thing about @code{mbrtowc} might be the way memory \r
-is allocated for the result. The above code uses the fact that there \r
-can never be more wide characters in the converted results than there are\r
-bytes in the multibyte input string. This method yields a pessimistic \r
-guess about the size of the result, and if many wide character strings \r
-have to be constructed this way or if the strings are long, the extra \r
-memory required to be allocated because the input string contains \r
-multibyte characters might be significant. The allocated memory block can \r
-be resized to the correct size before returning it, but a better solution \r
-might be to allocate just the right amount of space for the result right \r
-away. Unfortunately there is no function to compute the length of the wide \r
-character string directly from the multibyte string. There is, however, a \r
-function which does part of the work.\r
-\r
-@comment wchar.h\r
-@comment ISO\r
-@deftypefun size_t mbrlen (const char *restrict @var{s}, size_t @var{n}, mbstate_t *@var{ps})\r
-The @code{mbrlen} function (``multibyte restartable length'') computes\r
-the number of at most @var{n} bytes starting at @var{s} which form the\r
-next valid and complete multibyte character.\r
-\r
-If the next multibyte character corresponds to the NUL wide character,\r
-the return value is @math{0}.  If the next @var{n} bytes form a valid\r
-multibyte character, the number of bytes belonging to this multibyte\r
-character byte sequence is returned.\r
-\r
-If the the first @var{n} bytes possibly form a valid multibyte\r
-character but the character is incomplete, the return value is \r
-@code{(size_t) -2}. Otherwise the multibyte character sequence is invalid \r
-and the return value is @code{(size_t) -1}.\r
-\r
-The multibyte sequence is interpreted in the state represented by the\r
-object pointed to by @var{ps}.  If @var{ps} is a null pointer, a state\r
-object local to @code{mbrlen} is used.\r
-\r
-@pindex wchar.h\r
-@code{mbrlen} was introduced in @w{Amendment 1} to @w{ISO C90} and\r
-is declared in @file{wchar.h}.\r
-@end deftypefun\r
-\r
-The attentive reader now will note that @code{mbrlen} can be implemented \r
-as\r
-\r
-@smallexample\r
-mbrtowc (NULL, s, n, ps != NULL ? ps : &internal)\r
-@end smallexample\r
-\r
-This is true and in fact is mentioned in the official specification.\r
-How can this function be used to determine the length of the wide\r
-character string created from a multibyte character string?  It is not\r
-directly usable, but we can define a function @code{mbslen} using it:\r
-\r
-@smallexample\r
-size_t\r
-mbslen (const char *s)\r
-@{\r
-  mbstate_t state;\r
-  size_t result = 0;\r
-  size_t nbytes;\r
-  memset (&state, '\0', sizeof (state));\r
-  while ((nbytes = mbrlen (s, MB_LEN_MAX, &state)) > 0)\r
-    @{\r
-      if (nbytes >= (size_t) -2)\r
-        /* @r{Something is wrong.}  */\r
-        return (size_t) -1;\r
-      s += nbytes;\r
-      ++result;\r
-    @}\r
-  return result;\r
-@}\r
-@end smallexample\r
-\r
-This function simply calls @code{mbrlen} for each multibyte character\r
-in the string and counts the number of function calls.  Please note that\r
-we here use @code{MB_LEN_MAX} as the size argument in the @code{mbrlen}\r
-call. This is acceptable since a) this value is larger then the length of \r
-the longest multibyte character sequence and b) we know that the string \r
-@var{s} ends with a NUL byte, which cannot be part of any other multibyte \r
-character sequence but the one representing the NUL wide character.  \r
-Therefore, the @code{mbrlen} function will never read invalid memory.\r
-\r
-Now that this function is available (just to make this clear, this\r
-function is @emph{not} part of the GNU C library) we can compute the\r
-number of wide character required to store the converted multibyte\r
-character string @var{s} using\r
-\r
-@smallexample\r
-wcs_bytes = (mbslen (s) + 1) * sizeof (wchar_t);\r
-@end smallexample\r
-\r
-Please note that the @code{mbslen} function is quite inefficient. The\r
-implementation of @code{mbstouwcs} with @code{mbslen} would have to \r
-perform the conversion of the multibyte character input string twice, and \r
-this conversion might be quite expensive. So it is necessary to think \r
-about the consequences of using the easier but imprecise method before \r
-doing the work twice.\r
-\r
-@comment wchar.h\r
-@comment ISO\r
-@deftypefun size_t wcrtomb (char *restrict @var{s}, wchar_t @var{wc}, mbstate_t *restrict @var{ps})\r
-The @code{wcrtomb} function (``wide character restartable to\r
-multibyte'') converts a single wide character into a multibyte string\r
-corresponding to that wide character.\r
-\r
-If @var{s} is a null pointer, the function resets the state stored in\r
-the objects pointed to by @var{ps} (or the internal @code{mbstate_t}\r
-object) to the initial state.  This can also be achieved by a call like\r
-this:\r
-\r
-@smallexample\r
-wcrtombs (temp_buf, L'\0', ps)\r
-@end smallexample\r
-\r
-@noindent\r
-since, if @var{s} is a null pointer, @code{wcrtomb} performs as if it\r
-writes into an internal buffer, which is guaranteed to be large enough.\r
-\r
-If @var{wc} is the NUL wide character, @code{wcrtomb} emits, if\r
-necessary, a shift sequence to get the state @var{ps} into the initial\r
-state followed by a single NUL byte, which is stored in the string \r
-@var{s}.\r
-\r
-Otherwise a byte sequence (possibly including shift sequences) is written \r
-into the string @var{s}.  This only happens if @var{wc} is a valid wide \r
-character (i.e., it has a multibyte representation in the character set \r
-selected by locale of the @code{LC_CTYPE} category).  If @var{wc} is no \r
-valid wide character, nothing is stored in the strings @var{s}, \r
-@code{errno} is set to @code{EILSEQ}, the conversion state in @var{ps} \r
-is undefined and the return value is @code{(size_t) -1}.\r
-\r
-If no error occurred the function returns the number of bytes stored in\r
-the string @var{s}.  This includes all bytes representing shift\r
-sequences.\r
-\r
-One word about the interface of the function: there is no parameter\r
-specifying the length of the array @var{s}.  Instead the function\r
-assumes that there are at least @code{MB_CUR_MAX} bytes available since\r
-this is the maximum length of any byte sequence representing a single\r
-character.  So the caller has to make sure that there is enough space\r
-available, otherwise buffer overruns can occur.\r
-\r
-@pindex wchar.h\r
-@code{wcrtomb} was introduced in @w{Amendment 1} to @w{ISO C90} and is\r
-declared in @file{wchar.h}.\r
-@end deftypefun\r
-\r
-Using @code{wcrtomb} is as easy as using @code{mbrtowc}.  The following\r
-example appends a wide character string to a multibyte character string.\r
-Again, the code is not really useful (or correct), it is simply here to\r
-demonstrate the use and some problems.\r
-\r
-@smallexample\r
-char *\r
-mbscatwcs (char *s, size_t len, const wchar_t *ws)\r
-@{\r
-  mbstate_t state;\r
-  /* @r{Find the end of the existing string.}  */\r
-  char *wp = strchr (s, '\0');\r
-  len -= wp - s;\r
-  memset (&state, '\0', sizeof (state));\r
-  do\r
-    @{\r
-      size_t nbytes;\r
-      if (len < MB_CUR_LEN)\r
-        @{\r
-          /* @r{We cannot guarantee that the next}\r
-             @r{character fits into the buffer, so}\r
-             @r{return an error.}  */\r
-          errno = E2BIG;\r
-          return NULL;\r
-        @}\r
-      nbytes = wcrtomb (wp, *ws, &state);\r
-      if (nbytes == (size_t) -1)\r
-        /* @r{Error in the conversion.}  */\r
-        return NULL;\r
-      len -= nbytes;\r
-      wp += nbytes;\r
-    @}\r
-  while (*ws++ != L'\0');\r
-  return s;\r
-@}\r
-@end smallexample\r
-\r
-First the function has to find the end of the string currently in the\r
-array @var{s}.  The @code{strchr} call does this very efficiently since a\r
-requirement for multibyte character representations is that the NUL byte\r
-is never used except to represent itself (and in this context, the end\r
-of the string).\r
-\r
-After initializing the state object the loop is entered where the first\r
-task is to make sure there is enough room in the array @var{s}.  We\r
-abort if there are not at least @code{MB_CUR_LEN} bytes available.  This\r
-is not always optimal but we have no other choice.  We might have less\r
-than @code{MB_CUR_LEN} bytes available but the next multibyte character\r
-might also be only one byte long.  At the time the @code{wcrtomb} call\r
-returns it is too late to decide whether the buffer was large enough. If \r
-this solution is unsuitable, there is a very slow but more accurate \r
-solution.\r
-\r
-@smallexample\r
-  ...\r
-  if (len < MB_CUR_LEN)\r
-    @{\r
-      mbstate_t temp_state;\r
-      memcpy (&temp_state, &state, sizeof (state));\r
-      if (wcrtomb (NULL, *ws, &temp_state) > len)\r
-        @{\r
-          /* @r{We cannot guarantee that the next}\r
-             @r{character fits into the buffer, so}\r
-             @r{return an error.}  */\r
-          errno = E2BIG;\r
-          return NULL;\r
-        @}\r
-    @}\r
-  ...\r
-@end smallexample\r
-\r
-Here we perform the conversion that might overflow the buffer so that \r
-we are afterwards in the position to make an exact decision about the \r
-buffer size. Please note the @code{NULL} argument for the destination \r
-buffer in the new @code{wcrtomb} call; since we are not interested in the \r
-converted text at this point, this is a nice way to express this. The \r
-most unusual thing about this piece of code certainly is the duplication \r
-of the conversion state object, but if a change of the state is necessary \r
-to emit the next multibyte character, we want to have the same shift state \r
-change performed in the real conversion. Therefore, we have to preserve \r
-the initial shift state information.\r
-\r
-There are certainly many more and even better solutions to this problem.\r
-This example is only provided for educational purposes.\r
-\r
-@node Converting Strings\r
-@subsection Converting Multibyte and Wide Character Strings\r
-\r
-The functions described in the previous section only convert a single\r
-character at a time.  Most operations to be performed in real-world\r
-programs include strings and therefore the @w{ISO C} standard also\r
-defines conversions on entire strings.  However, the defined set of\r
-functions is quite limited; therefore, the GNU C library contains a few\r
-extensions which can help in some important situations.\r
-\r
-@comment wchar.h\r
-@comment ISO\r
-@deftypefun size_t mbsrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps})\r
-The @code{mbsrtowcs} function (``multibyte string restartable to wide\r
-character string'') converts an NUL-terminated multibyte character\r
-string at @code{*@var{src}} into an equivalent wide character string,\r
-including the NUL wide character at the end.  The conversion is started\r
-using the state information from the object pointed to by @var{ps} or\r
-from an internal object of @code{mbsrtowcs} if @var{ps} is a null\r
-pointer. Before returning, the state object is updated to match the state \r
-after the last converted character. The state is the initial state if the\r
-terminating NUL byte is reached and converted.\r
-\r
-If @var{dst} is not a null pointer, the result is stored in the array\r
-pointed to by @var{dst}; otherwise, the conversion result is not\r
-available since it is stored in an internal buffer.\r
-\r
-If @var{len} wide characters are stored in the array @var{dst} before\r
-reaching the end of the input string, the conversion stops and @var{len}\r
-is returned. If @var{dst} is a null pointer, @var{len} is never checked.\r
-\r
-Another reason for a premature return from the function call is if the\r
-input string contains an invalid multibyte sequence.  In this case the\r
-global variable @code{errno} is set to @code{EILSEQ} and the function\r
-returns @code{(size_t) -1}.\r
-\r
-@c XXX The ISO C9x draft seems to have a problem here.  It says that PS\r
-@c is not updated if DST is NULL.  This is not said straightforward and\r
-@c none of the other functions is described like this.  It would make sense\r
-@c to define the function this way but I don't think it is meant like this.\r
-\r
-In all other cases the function returns the number of wide characters\r
-converted during this call. If @var{dst} is not null, @code{mbsrtowcs}\r
-stores in the pointer pointed to by @var{src} either a null pointer (if \r
-the NUL byte in the input string was reached) or the address of the byte\r
-following the last converted multibyte character.\r
-\r
-@pindex wchar.h\r
-@code{mbsrtowcs} was introduced in @w{Amendment 1} to @w{ISO C90} and is\r
-declared in @file{wchar.h}.\r
-@end deftypefun\r
-\r
-The definition of the @code{mbsrtowcs} function has one important \r
-limitation. The requirement that @var{dst} has to be a NUL-terminated \r
-string provides problems if one wants to convert buffers with text. A\r
-buffer is normally no collection of NUL-terminated strings but instead a\r
-continuous collection of lines, separated by newline characters.  Now\r
-assume that a function to convert one line from a buffer is needed. Since\r
-the line is not NUL-terminated the source pointer cannot directly point\r
-into the unmodified text buffer. This means, either one inserts the NUL\r
-byte at the appropriate place for the time of the @code{mbsrtowcs}\r
-function call (which is not doable for a read-only buffer or in a\r
-multi-threaded application) or one copies the line in an extra buffer\r
-where it can be terminated by a NUL byte. Note that it is not in general \r
-possible to limit the number of characters to convert by setting the \r
-parameter @var{len} to any specific value.  Since it is not known how \r
-many bytes each multibyte character sequence is in length, one can only \r
-guess.\r
-\r
-@cindex stateful\r
-There is still a problem with the method of NUL-terminating a line right\r
-after the newline character which could lead to very strange results.\r
-As said in the description of the @code{mbsrtowcs} function above the\r
-conversion state is guaranteed to be in the initial shift state after\r
-processing the NUL byte at the end of the input string.  But this NUL\r
-byte is not really part of the text.  I.e., the conversion state after\r
-the newline in the original text could be something different than the\r
-initial shift state and therefore the first character of the next line\r
-is encoded using this state.  But the state in question is never\r
-accessible to the user since the conversion stops after the NUL byte\r
-(which resets the state).  Most stateful character sets in use today\r
-require that the shift state after a newline be the initial state--but\r
-this is not a strict guarantee.  Therefore, simply NUL-terminating a\r
-piece of a running text is not always an adequate solution and, \r
-therefore, should never be used in generally used code.\r
-\r
-The generic conversion interface (@pxref{Generic Charset Conversion})\r
-does not have this limitation (it simply works on buffers, not\r
-strings), and the GNU C library contains a set of functions which take\r
-additional parameters specifying the maximal number of bytes which are\r
-consumed from the input string.  This way the problem of\r
-@code{mbsrtowcs}'s example above could be solved by determining the line\r
-length and passing this length to the function.\r
-\r
-@comment wchar.h\r
-@comment ISO\r
-@deftypefun size_t wcsrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps})\r
-The @code{wcsrtombs} function (``wide character string restartable to\r
-multibyte string'') converts the NUL-terminated wide character string at\r
-@code{*@var{src}} into an equivalent multibyte character string and \r
-stores the result in the array pointed to by @var{dst}. The NUL wide\r
-character is also converted. The conversion starts in the state\r
-described in the object pointed to by @var{ps} or by a state object\r
-locally to @code{wcsrtombs} in case @var{ps} is a null pointer. If\r
-@var{dst} is a null pointer, the conversion is performed as usual but the\r
-result is not available. If all characters of the input string were\r
-successfully converted and if @var{dst} is not a null pointer, the \r
-pointer pointed to by @var{src} gets assigned a null pointer.\r
-\r
-If one of the wide characters in the input string has no valid multibyte\r
-character equivalent, the conversion stops early, sets the global\r
-variable @code{errno} to @code{EILSEQ}, and returns @code{(size_t) -1}.\r
-\r
-Another reason for a premature stop is if @var{dst} is not a null\r
-pointer and the next converted character would require more than\r
-@var{len} bytes in total to the array @var{dst}. In this case (and if\r
-@var{dest} is not a null pointer) the pointer pointed to by @var{src} is\r
-assigned a value pointing to the wide character right after the last one\r
-successfully converted.\r
-\r
-Except in the case of an encoding error the return value of the \r
-@code{wcsrtombs} function is the number of bytes in all the multibyte \r
-character sequences stored in @var{dst}. Before returning the state in \r
-the object pointed to by @var{ps} (or the internal object in case \r
-@var{ps} is a null pointer) is updated to reflect the state after the \r
-last conversion. The state is the initial shift state in case the \r
-terminating NUL wide character was converted.\r
-\r
-@pindex wchar.h\r
-The @code{wcsrtombs} function was introduced in @w{Amendment 1} to \r
-@w{ISO C90} and is declared in @file{wchar.h}.\r
-@end deftypefun\r
-\r
-The restriction mentioned above for the @code{mbsrtowcs} function applies\r
-here also. There is no possibility of directly controlling the number of\r
-input characters. One has to place the NUL wide character at the correct \r
-place or control the consumed input indirectly via the available output \r
-array size (the @var{len} parameter).\r
-\r
-@comment wchar.h\r
-@comment GNU\r
-@deftypefun size_t mbsnrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{nmc}, size_t @var{len}, mbstate_t *restrict @var{ps})\r
-The @code{mbsnrtowcs} function is very similar to the @code{mbsrtowcs}\r
-function. All the parameters are the same except for @var{nmc} which is\r
-new. The return value is the same as for @code{mbsrtowcs}.\r
-\r
-This new parameter specifies how many bytes at most can be used from the\r
-multibyte character string.  In other words, the multibyte character \r
-string @code{*@var{src}} need not be NUL-terminated. But if a NUL byte is\r
-found within the @var{nmc} first bytes of the string, the conversion \r
-stops here.\r
-\r
-This function is a GNU extension. It is meant to work around the\r
-problems mentioned above. Now it is possible to convert a buffer with\r
-multibyte character text piece for piece without having to care about\r
-inserting NUL bytes and the effect of NUL bytes on the conversion state.\r
-@end deftypefun\r
-\r
-A function to convert a multibyte string into a wide character string\r
-and display it could be written like this (this is not a really useful\r
-example):\r
-\r
-@smallexample\r
-void\r
-showmbs (const char *src, FILE *fp)\r
-@{\r
-  mbstate_t state;\r
-  int cnt = 0;\r
-  memset (&state, '\0', sizeof (state));\r
-  while (1)\r
-    @{\r
-      wchar_t linebuf[100];\r
-      const char *endp = strchr (src, '\n');\r
-      size_t n;\r
-\r
-      /* @r{Exit if there is no more line.}  */\r
-      if (endp == NULL)\r
-        break;\r
-\r
-      n = mbsnrtowcs (linebuf, &src, endp - src, 99, &state);\r
-      linebuf[n] = L'\0';\r
-      fprintf (fp, "line %d: \"%S\"\n", linebuf);\r
-    @}\r
-@}\r
-@end smallexample\r
-\r
-There is no problem with the state after a call to @code{mbsnrtowcs}.\r
-Since we don't insert characters in the strings which were not in there\r
-right from the beginning and we use @var{state} only for the conversion\r
-of the given buffer, there is no problem with altering the state.\r
-\r
-@comment wchar.h\r
-@comment GNU\r
-@deftypefun size_t wcsnrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{nwc}, size_t @var{len}, mbstate_t *restrict @var{ps})\r
-The @code{wcsnrtombs} function implements the conversion from wide\r
-character strings to multibyte character strings. It is similar to\r
-@code{wcsrtombs} but, just like @code{mbsnrtowcs}, it takes an extra\r
-parameter, which specifies the length of the input string.\r
-\r
-No more than @var{nwc} wide characters from the input string\r
-@code{*@var{src}} are converted.  If the input string contains a NUL\r
-wide character in the first @var{nwc} characters, the conversion stops at\r
-this place.\r
-\r
-The @code{wcsnrtombs} function is a GNU extension and just like \r
-@code{mbsnrtowcs} helps in situations where no NUL-terminated input \r
-strings are available.\r
-@end deftypefun\r
-\r
-\r
-@node Multibyte Conversion Example\r
-@subsection A Complete Multibyte Conversion Example\r
-\r
-The example programs given in the last sections are only brief and do\r
-not contain all the error checking etc.  Presented here is a complete\r
-and documented example.  It features the @code{mbrtowc} function but it\r
-should be easy to derive versions using the other functions.\r
-\r
-@smallexample\r
-int\r
-file_mbsrtowcs (int input, int output)\r
-@{\r
-  /* @r{Note the use of @code{MB_LEN_MAX}.}\r
-     @r{@code{MB_CUR_MAX} cannot portably be used here.}  */\r
-  char buffer[BUFSIZ + MB_LEN_MAX];\r
-  mbstate_t state;\r
-  int filled = 0;\r
-  int eof = 0;\r
-\r
-  /* @r{Initialize the state.}  */\r
-  memset (&state, '\0', sizeof (state));\r
-\r
-  while (!eof)\r
-    @{\r
-      ssize_t nread;\r
-      ssize_t nwrite;\r
-      char *inp = buffer;\r
-      wchar_t outbuf[BUFSIZ];\r
-      wchar_t *outp = outbuf;\r
-\r
-      /* @r{Fill up the buffer from the input file.}  */\r
-      nread = read (input, buffer + filled, BUFSIZ);\r
-      if (nread < 0)\r
-        @{\r
-          perror ("read");\r
-          return 0;\r
-        @}\r
-      /* @r{If we reach end of file, make a note to read no more.} */\r
-      if (nread == 0)\r
-        eof = 1;\r
-\r
-      /* @r{@code{filled} is now the number of bytes in @code{buffer}.} */\r
-      filled += nread;\r
-\r
-      /* @r{Convert those bytes to wide characters--as many as we can.} */\r
-      while (1)\r
-        @{\r
-          size_t thislen = mbrtowc (outp, inp, filled, &state);\r
-          /* @r{Stop converting at invalid character;}\r
-             @r{this can mean we have read just the first part}\r
-             @r{of a valid character.}  */\r
-          if (thislen == (size_t) -1)\r
-            break;\r
-          /* @r{We want to handle embedded NUL bytes}\r
-             @r{but the return value is 0.  Correct this.}  */\r
-          if (thislen == 0)\r
-            thislen = 1;\r
-          /* @r{Advance past this character.} */\r
-          inp += thislen;\r
-          filled -= thislen;\r
-          ++outp;\r
-        @}\r
-\r
-      /* @r{Write the wide characters we just made.}  */\r
-      nwrite = write (output, outbuf,\r
-                      (outp - outbuf) * sizeof (wchar_t));\r
-      if (nwrite < 0)\r
-        @{\r
-          perror ("write");\r
-          return 0;\r
-        @}\r
-\r
-      /* @r{See if we have a @emph{real} invalid character.} */\r
-      if ((eof && filled > 0) || filled >= MB_CUR_MAX)\r
-        @{\r
-          error (0, 0, "invalid multibyte character");\r
-          return 0;\r
-        @}\r
-\r
-      /* @r{If any characters must be carried forward,}\r
-         @r{put them at the beginning of @code{buffer}.} */\r
-      if (filled > 0)\r
-        memmove (inp, buffer, filled);\r
-    @}\r
-\r
-  return 1;\r
-@}\r
-@end smallexample\r
-\r
-\r
-@node Non-reentrant Conversion\r
-@section Non-reentrant Conversion Function\r
-\r
-The functions described in the previous chapter are defined in\r
-@w{Amendment 1} to @w{ISO C90}, but the original @w{ISO C90} standard \r
-also contained functions for character set conversion. The reason that \r
-these original functions are not described first is that they are almost \r
-entirely useless.\r
-\r
-The problem is that all the conversion functions described in the \r
-original @w{ISO C90} use a local state. Using a local state implies that \r
-multiple conversions at the same time (not only when using threads) \r
-cannot be done, and that you cannot first convert single characters and \r
-then strings since you cannot tell the conversion functions which state \r
-to use.\r
-\r
-These original functions are therefore usable only in a very limited set \r
-of situations. One must complete converting the entire string before\r
-starting a new one, and each string/text must be converted with the same\r
-function (there is no problem with the library itself; it is guaranteed\r
-that no library function changes the state of any of these functions).\r
-@strong{For the above reasons it is highly requested that the functions\r
-described in the previous section be used in place of non-reentrant \r
-conversion functions.}\r
-\r
-@menu\r
-* Non-reentrant Character Conversion::  Non-reentrant Conversion of Single\r
-                                         Characters.\r
-* Non-reentrant String Conversion::     Non-reentrant Conversion of Strings.\r
-* Shift State::                         States in Non-reentrant Functions.\r
-@end menu\r
-\r
-@node Non-reentrant Character Conversion\r
-@subsection Non-reentrant Conversion of Single Characters\r
-\r
-@comment stdlib.h\r
-@comment ISO\r
-@deftypefun int mbtowc (wchar_t *restrict @var{result}, const char *restrict @var{string}, size_t @var{size})\r
-The @code{mbtowc} (``multibyte to wide character'') function when called\r
-with non-null @var{string} converts the first multibyte character\r
-beginning at @var{string} to its corresponding wide character code.  It\r
-stores the result in @code{*@var{result}}.\r
-\r
-@code{mbtowc} never examines more than @var{size} bytes.  (The idea is\r
-to supply for @var{size} the number of bytes of data you have in hand.)\r
-\r
-@code{mbtowc} with non-null @var{string} distinguishes three\r
-possibilities: the first @var{size} bytes at @var{string} start with\r
-valid multibyte characters, they start with an invalid byte sequence or\r
-just part of a character, or @var{string} points to an empty string (a\r
-null character).\r
-\r
-For a valid multibyte character, @code{mbtowc} converts it to a wide\r
-character and stores that in @code{*@var{result}}, and returns the\r
-number of bytes in that character (always at least @math{1} and never\r
-more than @var{size}).\r
-\r
-For an invalid byte sequence, @code{mbtowc} returns @math{-1}.  For an\r
-empty string, it returns @math{0}, also storing @code{'\0'} in\r
-@code{*@var{result}}.\r
-\r
-If the multibyte character code uses shift characters, then\r
-@code{mbtowc} maintains and updates a shift state as it scans.  If you\r
-call @code{mbtowc} with a null pointer for @var{string}, that\r
-initializes the shift state to its standard initial value.  It also\r
-returns nonzero if the multibyte character code in use actually has a\r
-shift state.  @xref{Shift State}.\r
-@end deftypefun\r
-\r
-@comment stdlib.h\r
-@comment ISO\r
-@deftypefun int wctomb (char *@var{string}, wchar_t @var{wchar})\r
-The @code{wctomb} (``wide character to multibyte'') function converts\r
-the wide character code @var{wchar} to its corresponding multibyte\r
-character sequence, and stores the result in bytes starting at\r
-@var{string}.  At most @code{MB_CUR_MAX} characters are stored.\r
-\r
-@code{wctomb} with non-null @var{string} distinguishes three\r
-possibilities for @var{wchar}: a valid wide character code (one that can\r
-be translated to a multibyte character), an invalid code, and @code{L'\0'}.\r
-\r
-Given a valid code, @code{wctomb} converts it to a multibyte character,\r
-storing the bytes starting at @var{string}.  Then it returns the number\r
-of bytes in that character (always at least @math{1} and never more\r
-than @code{MB_CUR_MAX}).\r
-\r
-If @var{wchar} is an invalid wide character code, @code{wctomb} returns\r
-@math{-1}.  If @var{wchar} is @code{L'\0'}, it returns @code{0}, also\r
-storing @code{'\0'} in @code{*@var{string}}.\r
-\r
-If the multibyte character code uses shift characters, then\r
-@code{wctomb} maintains and updates a shift state as it scans. If you\r
-call @code{wctomb} with a null pointer for @var{string}, that\r
-initializes the shift state to its standard initial value.  It also\r
-returns nonzero if the multibyte character code in use actually has a\r
-shift state.  @xref{Shift State}.\r
-\r
-Calling this function with a @var{wchar} argument of zero when\r
-@var{string} is not null has the side-effect of reinitializing the\r
-stored shift state @emph{as well as} storing the multibyte character\r
-@code{'\0'} and returning @math{0}.\r
-@end deftypefun\r
-\r
-Similar to @code{mbrlen} there is also a non-reentrant function which\r
-computes the length of a multibyte character.  It can be defined in\r
-terms of @code{mbtowc}.\r
-\r
-@comment stdlib.h\r
-@comment ISO\r
-@deftypefun int mblen (const char *@var{string}, size_t @var{size})\r
-The @code{mblen} function with a non-null @var{string} argument returns\r
-the number of bytes that make up the multibyte character beginning at\r
-@var{string}, never examining more than @var{size} bytes.  (The idea is\r
-to supply for @var{size} the number of bytes of data you have in hand.)\r
-\r
-The return value of @code{mblen} distinguishes three possibilities: the\r
-first @var{size} bytes at @var{string} start with valid multibyte\r
-characters, they start with an invalid byte sequence or just part of a\r
-character, or @var{string} points to an empty string (a null character).\r
-\r
-For a valid multibyte character, @code{mblen} returns the number of\r
-bytes in that character (always at least @code{1} and never more than\r
-@var{size}). For an invalid byte sequence, @code{mblen} returns \r
-@math{-1}. For an empty string, it returns @math{0}.\r
-\r
-If the multibyte character code uses shift characters, then @code{mblen}\r
-maintains and updates a shift state as it scans.  If you call\r
-@code{mblen} with a null pointer for @var{string}, that initializes the\r
-shift state to its standard initial value.  It also returns a nonzero\r
-value if the multibyte character code in use actually has a shift state.\r
-@xref{Shift State}.\r
-\r
-@pindex stdlib.h\r
-The function @code{mblen} is declared in @file{stdlib.h}.\r
-@end deftypefun\r
-\r
-\r
-@node Non-reentrant String Conversion\r
-@subsection Non-reentrant Conversion of Strings\r
-\r
-For convenience the @w{ISO C90} standard also defines functions to \r
-convert entire strings instead of single characters. These functions\r
-suffer from the same problems as their reentrant counterparts from\r
-@w{Amendment 1} to @w{ISO C90}; see @ref{Converting Strings}.\r
-\r
-@comment stdlib.h\r
-@comment ISO\r
-@deftypefun size_t mbstowcs (wchar_t *@var{wstring}, const char *@var{string}, size_t @var{size})\r
-The @code{mbstowcs} (``multibyte string to wide character string'')\r
-function converts the null-terminated string of multibyte characters\r
-@var{string} to an array of wide character codes, storing not more than\r
-@var{size} wide characters into the array beginning at @var{wstring}.\r
-The terminating null character counts towards the size, so if @var{size}\r
-is less than the actual number of wide characters resulting from\r
-@var{string}, no terminating null character is stored.\r
-\r
-The conversion of characters from @var{string} begins in the initial\r
-shift state.\r
-\r
-If an invalid multibyte character sequence is found, the @code{mbstowcs} \r
-function returns a value of @math{-1}. Otherwise, it returns the number \r
-of wide characters stored in the array @var{wstring}. This number does \r
-not include the terminating null character, which is present if the \r
-number is less than @var{size}.\r
-\r
-Here is an example showing how to convert a string of multibyte\r
-characters, allocating enough space for the result.\r
-\r
-@smallexample\r
-wchar_t *\r
-mbstowcs_alloc (const char *string)\r
-@{\r
-  size_t size = strlen (string) + 1;\r
-  wchar_t *buf = xmalloc (size * sizeof (wchar_t));\r
-\r
-  size = mbstowcs (buf, string, size);\r
-  if (size == (size_t) -1)\r
-    return NULL;\r
-  buf = xrealloc (buf, (size + 1) * sizeof (wchar_t));\r
-  return buf;\r
-@}\r
-@end smallexample\r
-\r
-@end deftypefun\r
-\r
-@comment stdlib.h\r
-@comment ISO\r
-@deftypefun size_t wcstombs (char *@var{string}, const wchar_t *@var{wstring}, size_t @var{size})\r
-The @code{wcstombs} (``wide character string to multibyte string'')\r
-function converts the null-terminated wide character array @var{wstring}\r
-into a string containing multibyte characters, storing not more than\r
-@var{size} bytes starting at @var{string}, followed by a terminating\r
-null character if there is room. The conversion of characters begins in\r
-the initial shift state.\r
-\r
-The terminating null character counts towards the size, so if @var{size}\r
-is less than or equal to the number of bytes needed in @var{wstring}, no\r
-terminating null character is stored.\r
-\r
-If a code that does not correspond to a valid multibyte character is\r
-found, the @code{wcstombs} function returns a value of @math{-1}. \r
-Otherwise, the return value is the number of bytes stored in the array \r
-@var{string}. This number does not include the terminating null character, \r
-which is present if the number is less than @var{size}.\r
-@end deftypefun\r
-\r
-@node Shift State\r
-@subsection States in Non-reentrant Functions\r
-\r
-In some multibyte character codes, the @emph{meaning} of any particular\r
-byte sequence is not fixed; it depends on what other sequences have come\r
-earlier in the same string. Typically there are just a few sequences that \r
-can change the meaning of other sequences; these few are called \r
-@dfn{shift sequences} and we say that they set the @dfn{shift state} for\r
-other sequences that follow.\r
-\r
-To illustrate shift state and shift sequences, suppose we decide that\r
-the sequence @code{0200} (just one byte) enters Japanese mode, in which\r
-pairs of bytes in the range from @code{0240} to @code{0377} are single\r
-characters, while @code{0201} enters Latin-1 mode, in which single bytes\r
-in the range from @code{0240} to @code{0377} are characters, and\r
-interpreted according to the ISO Latin-1 character set.  This is a\r
-multibyte code which has two alternative shift states (``Japanese mode''\r
-and ``Latin-1 mode''), and two shift sequences that specify particular\r
-shift states.\r
-\r
-When the multibyte character code in use has shift states, then\r
-@code{mblen}, @code{mbtowc}, and @code{wctomb} must maintain and update\r
-the current shift state as they scan the string. To make this work\r
-properly, you must follow these rules:\r
-\r
-@itemize @bullet\r
-@item\r
-Before starting to scan a string, call the function with a null pointer\r
-for the multibyte character address---for example, @code{mblen (NULL,\r
-0)}. This initializes the shift state to its standard initial value.\r
-\r
-@item\r
-Scan the string one character at a time, in order. Do not ``back up''\r
-and rescan characters already scanned, and do not intersperse the\r
-processing of different strings.\r
-@end itemize\r
-\r
-Here is an example of using @code{mblen} following these rules:\r
-\r
-@smallexample\r
-void\r
-scan_string (char *s)\r
-@{\r
-  int length = strlen (s);\r
-\r
-  /* @r{Initialize shift state.}  */\r
-  mblen (NULL, 0);\r
-\r
-  while (1)\r
-    @{\r
-      int thischar = mblen (s, length);\r
-      /* @r{Deal with end of string and invalid characters.}  */\r
-      if (thischar == 0)\r
-        break;\r
-      if (thischar == -1)\r
-        @{\r
-          error ("invalid multibyte character");\r
-          break;\r
-        @}\r
-      /* @r{Advance past this character.}  */\r
-      s += thischar;\r
-      length -= thischar;\r
-    @}\r
-@}\r
-@end smallexample\r
-\r
-The functions @code{mblen}, @code{mbtowc} and @code{wctomb} are not\r
-reentrant when using a multibyte code that uses a shift state.  However,\r
-no other library functions call these functions, so you don't have to\r
-worry that the shift state will be changed mysteriously.\r
-\r
-\r
-@node Generic Charset Conversion\r
-@section Generic Charset Conversion\r
-\r
-The conversion functions mentioned so far in this chapter all had in\r
-common that they operate on character sets that are not directly\r
-specified by the functions. The multibyte encoding used is specified by\r
-the currently selected locale for the @code{LC_CTYPE} category. The\r
-wide character set is fixed by the implementation (in the case of GNU C\r
-library it is always UCS-4 encoded @w{ISO 10646}.\r
-\r
-This has of course several problems when it comes to general character\r
-conversion:\r
-\r
-@itemize @bullet\r
-@item\r
-For every conversion where neither the source nor the destination \r
-character set is the character set of the locale for the @code{LC_CTYPE} \r
-category, one has to change the @code{LC_CTYPE} locale using \r
-@code{setlocale}.\r
-\r
-Changing the @code{LC_TYPE} locale introduces major problems for the rest \r
-of the programs since several more functions (e.g., the character \r
-classification functions, @pxref{Classification of Characters}) use the \r
-@code{LC_CTYPE} category.\r
-\r
-@item\r
-Parallel conversions to and from different character sets are not\r
-possible since the @code{LC_CTYPE} selection is global and shared by all\r
-threads.\r
-\r
-@item\r
-If neither the source nor the destination character set is the character\r
-set used for @code{wchar_t} representation, there is at least a two-step\r
-process necessary to convert a text using the functions above. One would \r
-have to select the source character set as the multibyte encoding, \r
-convert the text into a @code{wchar_t} text, select the destination\r
-character set as the multibyte encoding, and convert the wide character\r
-text to the multibyte (@math{=} destination) character set.\r
-\r
-Even if this is possible (which is not guaranteed) it is a very tiring\r
-work.  Plus it suffers from the other two raised points even more due to\r
-the steady changing of the locale.\r
-@end itemize\r
-\r
-The XPG2 standard defines a completely new set of functions which has\r
-none of these limitations. They are not at all coupled to the selected\r
-locales, and they have no constraints on the character sets selected for\r
-source and destination. Only the set of available conversions limits \r
-them. The standard does not specify that any conversion at all must be \r
-available. Such availability is a measure of the quality of the \r
-implementation.\r
-\r
-In the following text first the interface to @code{iconv} and then the\r
-conversion function, will be described. Comparisons with other\r
-implementations will show what obstacles stand in the way of portable\r
-applications. Finally, the implementation is described in so far as might \r
-interest the advanced user who wants to extend conversion capabilities.\r
-\r
-@menu\r
-* Generic Conversion Interface::    Generic Character Set Conversion Interface.\r
-* iconv Examples::                  A complete @code{iconv} example.\r
-* Other iconv Implementations::     Some Details about other @code{iconv}\r
-                                     Implementations.\r
-* glibc iconv Implementation::      The @code{iconv} Implementation in the GNU C\r
-                                     library.\r
-@end menu\r
-\r
-@node Generic Conversion Interface\r
-@subsection Generic Character Set Conversion Interface\r
-\r
-This set of functions follows the traditional cycle of using a resource:\r
-open--use--close.  The interface consists of three functions, each of\r
-which implements one step.\r
-\r
-Before the interfaces are described it is necessary to introduce a\r
-data type.  Just like other open--use--close interfaces the functions\r
-introduced here work using handles and the @file{iconv.h} header\r
-defines a special type for the handles used.\r
-\r
-@comment iconv.h\r
-@comment XPG2\r
-@deftp {Data Type} iconv_t\r
-This data type is an abstract type defined in @file{iconv.h}.  The user\r
-must not assume anything about the definition of this type; it must be\r
-completely opaque.\r
-\r
-Objects of this type can get assigned handles for the conversions using\r
-the @code{iconv} functions. The objects themselves need not be freed, but\r
-the conversions for which the handles stand for have to.\r
-@end deftp\r
-\r
-@noindent\r
-The first step is the function to create a handle.\r
-\r
-@comment iconv.h\r
-@comment XPG2\r
-@deftypefun iconv_t iconv_open (const char *@var{tocode}, const char *@var{fromcode})\r
-The @code{iconv_open} function has to be used before starting a\r
-conversion.  The two parameters this function takes determine the\r
-source and destination character set for the conversion, and if the\r
-implementation has the possibility to perform such a conversion, the\r
-function returns a handle.\r
-\r
-If the wanted conversion is not available, the @code{iconv_open} function \r
-returns @code{(iconv_t) -1}. In this case the global variable \r
-@code{errno} can have the following values:\r
-\r
-@table @code\r
-@item EMFILE\r
-The process already has @code{OPEN_MAX} file descriptors open.\r
-@item ENFILE\r
-The system limit of open file is reached.\r
-@item ENOMEM\r
-Not enough memory to carry out the operation.\r
-@item EINVAL\r
-The conversion from @var{fromcode} to @var{tocode} is not supported.\r
-@end table\r
-\r
-It is not possible to use the same descriptor in different threads to\r
-perform independent conversions. The data structures associated\r
-with the descriptor include information about the conversion state.\r
-This must not be messed up by using it in different conversions.\r
-\r
-An @code{iconv} descriptor is like a file descriptor as for every use a\r
-new descriptor must be created. The descriptor does not stand for all\r
-of the conversions from @var{fromset} to @var{toset}.\r
-\r
-The GNU C library implementation of @code{iconv_open} has one\r
-significant extension to other implementations. To ease the extension\r
-of the set of available conversions, the implementation allows storing\r
-the necessary files with data and code in an arbitrary number of \r
-directories. How this extension must be written will be explained below\r
-(@pxref{glibc iconv Implementation}). Here it is only important to say\r
-that all directories mentioned in the @code{GCONV_PATH} environment\r
-variable are considered only if they contain a file @file{gconv-modules}.\r
-These directories need not necessarily be created by the system\r
-administrator. In fact, this extension is introduced to help users\r
-writing and using their own, new conversions. Of course, this does not \r
-work for security reasons in SUID binaries; in this case only the system\r
-directory is considered and this normally is \r
-@file{@var{prefix}/lib/gconv}. The @code{GCONV_PATH} environment variable \r
-is examined exactly once at the first call of the @code{iconv_open} \r
-function. Later modifications of the variable have no effect.\r
-\r
-@pindex iconv.h\r
-The @code{iconv_open} function was introduced early in the X/Open \r
-Portability Guide, @w{version 2}. It is supported by all commercial \r
-Unices as it is required for the Unix branding. However, the quality and \r
-completeness of the implementation varies widely. The @code{iconv_open} \r
-function is declared in @file{iconv.h}.\r
-@end deftypefun\r
-\r
-The @code{iconv} implementation can associate large data structure with\r
-the handle returned by @code{iconv_open}. Therefore, it is crucial to \r
-free all the resources once all conversions are carried out and the \r
-conversion is not needed anymore.\r
-\r
-@comment iconv.h\r
-@comment XPG2\r
-@deftypefun int iconv_close (iconv_t @var{cd})\r
-The @code{iconv_close} function frees all resources associated with the\r
-handle @var{cd}, which must have been returned by a successful call to\r
-the @code{iconv_open} function.\r
-\r
-If the function call was successful the return value is @math{0}.\r
-Otherwise it is @math{-1} and @code{errno} is set appropriately.\r
-Defined error are:\r
-\r
-@table @code\r
-@item EBADF\r
-The conversion descriptor is invalid.\r
-@end table\r
-\r
-@pindex iconv.h\r
-The @code{iconv_close} function was introduced together with the rest \r
-of the @code{iconv} functions in XPG2 and is declared in @file{iconv.h}.\r
-@end deftypefun\r
-\r
-The standard defines only one actual conversion function.  This has,\r
-therefore, the most general interface: it allows conversion from one\r
-buffer to another.  Conversion from a file to a buffer, vice versa, or\r
-even file to file can be implemented on top of it.\r
-\r
-@comment iconv.h\r
-@comment XPG2\r
-@deftypefun size_t iconv (iconv_t @var{cd}, char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft})\r
-@cindex stateful\r
-The @code{iconv} function converts the text in the input buffer\r
-according to the rules associated with the descriptor @var{cd} and\r
-stores the result in the output buffer. It is possible to call the\r
-function for the same text several times in a row since for stateful\r
-character sets the necessary state information is kept in the data\r
-structures associated with the descriptor.\r
-\r
-The input buffer is specified by @code{*@var{inbuf}} and it contains\r
-@code{*@var{inbytesleft}} bytes.  The extra indirection is necessary for\r
-communicating the used input back to the caller (see below).  It is\r
-important to note that the buffer pointer is of type @code{char} and the\r
-length is measured in bytes even if the input text is encoded in wide\r
-characters.\r
-\r
-The output buffer is specified in a similar way.  @code{*@var{outbuf}}\r
-points to the beginning of the buffer with at least\r
-@code{*@var{outbytesleft}} bytes room for the result.  The buffer\r
-pointer again is of type @code{char} and the length is measured in\r
-bytes.  If @var{outbuf} or @code{*@var{outbuf}} is a null pointer, the\r
-conversion is performed but no output is available.\r
-\r
-If @var{inbuf} is a null pointer, the @code{iconv} function performs the\r
-necessary action to put the state of the conversion into the initial\r
-state. This is obviously a no-op for non-stateful encodings, but if the\r
-encoding has a state, such a function call might put some byte sequences\r
-in the output buffer, which perform the necessary state changes. The\r
-next call with @var{inbuf} not being a null pointer then simply goes on\r
-from the initial state. It is important that the programmer never makes\r
-any assumption as to whether the conversion has to deal with states. Even \r
-if the input and output character sets are not stateful, the \r
-implementation might still have to keep states. This is due to the\r
-implementation chosen for the GNU C library as it is described below.\r
-Therefore an @code{iconv} call to reset the state should always be\r
-performed if some protocol requires this for the output text.\r
-\r
-The conversion stops for one of three reasons. The first is that all\r
-characters from the input buffer are converted. This actually can mean\r
-two things: either all bytes from the input buffer are consumed or\r
-there are some bytes at the end of the buffer that possibly can form a\r
-complete character but the input is incomplete. The second reason for a\r
-stop is that the output buffer is full. And the third reason is that\r
-the input contains invalid characters.\r
-\r
-In all of these cases the buffer pointers after the last successful\r
-conversion, for input and output buffer, are stored in @var{inbuf} and\r
-@var{outbuf}, and the available room in each buffer is stored in\r
-@var{inbytesleft} and @var{outbytesleft}.\r
-\r
-Since the character sets selected in the @code{iconv_open} call can be\r
-almost arbitrary, there can be situations where the input buffer contains\r
-valid characters, which have no identical representation in the output\r
-character set. The behavior in this situation is undefined. The\r
-@emph{current} behavior of the GNU C library in this situation is to\r
-return with an error immediately. This certainly is not the most\r
-desirable solution; therefore, future versions will provide better ones,\r
-but they are not yet finished.\r
-\r
-If all input from the input buffer is successfully converted and stored\r
-in the output buffer, the function returns the number of non-reversible\r
-conversions performed. In all other cases the return value is\r
-@code{(size_t) -1} and @code{errno} is set appropriately. In such cases\r
-the value pointed to by @var{inbytesleft} is nonzero.\r
-\r
-@table @code\r
-@item EILSEQ\r
-The conversion stopped because of an invalid byte sequence in the input.\r
-After the call, @code{*@var{inbuf}} points at the first byte of the\r
-invalid byte sequence.\r
-\r
-@item E2BIG\r
-The conversion stopped because it ran out of space in the output buffer.\r
-\r
-@item EINVAL\r
-The conversion stopped because of an incomplete byte sequence at the end\r
-of the input buffer.\r
-\r
-@item EBADF\r
-The @var{cd} argument is invalid.\r
-@end table\r
-\r
-@pindex iconv.h\r
-The @code{iconv} function was introduced in the XPG2 standard and is \r
-declared in the @file{iconv.h} header.\r
-@end deftypefun\r
-\r
-The definition of the @code{iconv} function is quite good overall. It\r
-provides quite flexible functionality. The only problems lie in the\r
-boundary cases, which are incomplete byte sequences at the end of the\r
-input buffer and invalid input. A third problem, which is not really\r
-a design problem, is the way conversions are selected. The standard\r
-does not say anything about the legitimate names, a minimal set of\r
-available conversions. We will see how this negatively impacts other\r
-implementations, as demonstrated below.\r
-\r
-@node iconv Examples\r
-@subsection A complete @code{iconv} example\r
-\r
-The example below features a solution for a common problem.  Given that\r
-one knows the internal encoding used by the system for @code{wchar_t}\r
-strings, one often is in the position to read text from a file and store\r
-it in wide character buffers. One can do this using @code{mbsrtowcs},\r
-but then we run into the problems discussed above.\r
-\r
-@smallexample\r
-int\r
-file2wcs (int fd, const char *charset, wchar_t *outbuf, size_t avail)\r
-@{\r
-  char inbuf[BUFSIZ];\r
-  size_t insize = 0;\r
-  char *wrptr = (char *) outbuf;\r
-  int result = 0;\r
-  iconv_t cd;\r
-\r
-  cd = iconv_open ("WCHAR_T", charset);\r
-  if (cd == (iconv_t) -1)\r
-    @{\r
-      /* @r{Something went wrong.}  */\r
-      if (errno == EINVAL)\r
-        error (0, 0, "conversion from '%s' to wchar_t not available",\r
-               charset);\r
-      else\r
-        perror ("iconv_open");\r
-\r
-      /* @r{Terminate the output string.}  */\r
-      *outbuf = L'\0';\r
-\r
-      return -1;\r
-    @}\r
-\r
-  while (avail > 0)\r
-    @{\r
-      size_t nread;\r
-      size_t nconv;\r
-      char *inptr = inbuf;\r
-\r
-      /* @r{Read more input.}  */\r
-      nread = read (fd, inbuf + insize, sizeof (inbuf) - insize);\r
-      if (nread == 0)\r
-        @{\r
-          /* @r{When we come here the file is completely read.}\r
-             @r{This still could mean there are some unused}\r
-             @r{characters in the @code{inbuf}.  Put them back.}  */\r
-          if (lseek (fd, -insize, SEEK_CUR) == -1)\r
-            result = -1;\r
-\r
-          /* @r{Now write out the byte sequence to get into the}\r
-             @r{initial state if this is necessary.}  */\r
-          iconv (cd, NULL, NULL, &wrptr, &avail);\r
-\r
-          break;\r
-        @}\r
-      insize += nread;\r
-\r
-      /* @r{Do the conversion.}  */\r
-      nconv = iconv (cd, &inptr, &insize, &wrptr, &avail);\r
-      if (nconv == (size_t) -1)\r
-        @{\r
-          /* @r{Not everything went right.  It might only be}\r
-             @r{an unfinished byte sequence at the end of the}\r
-             @r{buffer.  Or it is a real problem.}  */\r
-          if (errno == EINVAL)\r
-            /* @r{This is harmless.  Simply move the unused}\r
-               @r{bytes to the beginning of the buffer so that}\r
-               @r{they can be used in the next round.}  */\r
-            memmove (inbuf, inptr, insize);\r
-          else\r
-            @{\r
-              /* @r{It is a real problem.  Maybe we ran out of}\r
-                 @r{space in the output buffer or we have invalid}\r
-                 @r{input.  In any case back the file pointer to}\r
-                 @r{the position of the last processed byte.}  */\r
-              lseek (fd, -insize, SEEK_CUR);\r
-              result = -1;\r
-              break;\r
-            @}\r
-        @}\r
-    @}\r
-\r
-  /* @r{Terminate the output string.}  */\r
-  if (avail >= sizeof (wchar_t))\r
-    *((wchar_t *) wrptr) = L'\0';\r
-\r
-  if (iconv_close (cd) != 0)\r
-    perror ("iconv_close");\r
-\r
-  return (wchar_t *) wrptr - outbuf;\r
-@}\r
-@end smallexample\r
-\r
-@cindex stateful\r
-This example shows the most important aspects of using the @code{iconv}\r
-functions.  It shows how successive calls to @code{iconv} can be used to\r
-convert large amounts of text.  The user does not have to care about\r
-stateful encodings as the functions take care of everything.\r
-\r
-An interesting point is the case where @code{iconv} returns an error and\r
-@code{errno} is set to @code{EINVAL}. This is not really an error in the \r
-transformation. It can happen whenever the input character set contains \r
-byte sequences of more than one byte for some character and texts are not \r
-processed in one piece. In this case there is a chance that a multibyte \r
-sequence is cut. The caller can then simply read the remainder of the \r
-takes and feed the offending bytes together with new character from the \r
-input to @code{iconv} and continue the work. The internal state kept in \r
-the descriptor is @emph{not} unspecified after such an event as is the \r
-case with the conversion functions from the @w{ISO C} standard.\r
-\r
-The example also shows the problem of using wide character strings with\r
-@code{iconv}. As explained in the description of the @code{iconv}\r
-function above, the function always takes a pointer to a @code{char}\r
-array and the available space is measured in bytes. In the example, the\r
-output buffer is a wide character buffer; therefore, we use a local\r
-variable @var{wrptr} of type @code{char *}, which is used in the\r
-@code{iconv} calls.\r
-\r
-This looks rather innocent but can lead to problems on platforms that\r
-have tight restriction on alignment. Therefore the caller of @code{iconv} \r
-has to make sure that the pointers passed are suitable for access of \r
-characters from the appropriate character set. Since, in the\r
-above case, the input parameter to the function is a @code{wchar_t}\r
-pointer, this is the case (unless the user violates alignment when\r
-computing the parameter). But in other situations, especially when\r
-writing generic functions where one does not know what type of character\r
-set one uses and, therefore, treats text as a sequence of bytes, it might\r
-become tricky.\r
-\r
-@node Other iconv Implementations\r
-@subsection Some Details about other @code{iconv} Implementations\r
-\r
-This is not really the place to discuss the @code{iconv} implementation\r
-of other systems but it is necessary to know a bit about them to write\r
-portable programs.  The above mentioned problems with the specification\r
-of the @code{iconv} functions can lead to portability issues.\r
-\r
-The first thing to notice is that, due to the large number of character\r
-sets in use, it is certainly not practical to encode the conversions\r
-directly in the C library. Therefore, the conversion information must\r
-come from files outside the C library. This is usually done in one or\r
-both of the following ways:\r
-\r
-@itemize @bullet\r
-@item\r
-The C library contains a set of generic conversion functions which can\r
-read the needed conversion tables and other information from data files.\r
-These files get loaded when necessary.\r
-\r
-This solution is problematic as it requires a great deal of effort to\r
-apply to all character sets (potentially an infinite set). The \r
-differences in the structure of the different character sets is so large\r
-that many different variants of the table-processing functions must be\r
-developed. In addition, the generic nature of these functions make them \r
-slower than specifically implemented functions.\r
-\r
-@item\r
-The C library only contains a framework which can dynamically load\r
-object files and execute the conversion functions contained therein.\r
-\r
-This solution provides much more flexibility. The C library itself\r
-contains only very little code and therefore reduces the general memory\r
-footprint. Also, with a documented interface between the C library and\r
-the loadable modules it is possible for third parties to extend the set\r
-of available conversion modules. A drawback of this solution is that\r
-dynamic loading must be available.\r
-@end itemize\r
-\r
-Some implementations in commercial Unices implement a mixture of these \r
-possibilities; the majority implement only the second solution. Using \r
-loadable modules moves the code out of the library itself and keeps \r
-the door open for extensions and improvements, but this design is also\r
-limiting on some platforms since not many platforms support dynamic\r
-loading in statically linked programs. On platforms without this\r
-capability it is therefore not possible to use this interface in\r
-statically linked programs. The GNU C library has, on ELF platforms, no\r
-problems with dynamic loading in these situations; therefore, this\r
-point is moot. The danger is that one gets acquainted with this situation \r
-and forgets about the restrictions on other systems.\r
-\r
-A second thing to know about other @code{iconv} implementations is that\r
-the number of available conversions is often very limited. Some\r
-implementations provide, in the standard release (not special\r
-international or developer releases), at most 100 to 200 conversion\r
-possibilities. This does not mean 200 different character sets are\r
-supported; for example, conversions from one character set to a set of 10 \r
-others might count as 10 conversions. Together with the other direction\r
-this makes 20 conversion possibilities used up by one character set. One \r
-can imagine the thin coverage these platform provide. Some Unix vendors \r
-even provide only a handful of conversions which renders them useless for \r
-almost all uses.\r
-\r
-This directly leads to a third and probably the most problematic point.\r
-The way the @code{iconv} conversion functions are implemented on all\r
-known Unix systems and the availability of the conversion functions from\r
-character set @math{@cal{A}} to @math{@cal{B}} and the conversion from\r
-@math{@cal{B}} to @math{@cal{C}} does @emph{not} imply that the\r
-conversion from @math{@cal{A}} to @math{@cal{C}} is available.\r
-\r
-This might not seem unreasonable and problematic at first, but it is a\r
-quite big problem as one will notice shortly after hitting it.  To show\r
-the problem we assume to write a program which has to convert from\r
-@math{@cal{A}} to @math{@cal{C}}. A call like\r
-\r
-@smallexample\r
-cd = iconv_open ("@math{@cal{C}}", "@math{@cal{A}}");\r
-@end smallexample\r
-\r
-@noindent\r
-fails according to the assumption above. But what does the program\r
-do now?  The conversion is necessary; therefore, simply giving up is not\r
-an option.\r
-\r
-This is a nuisance.  The @code{iconv} function should take care of this.\r
-But how should the program proceed from here on?  If it tries to convert \r
-to character set @math{@cal{B}}, first the two @code{iconv_open}\r
-calls\r
-\r
-@smallexample\r
-cd1 = iconv_open ("@math{@cal{B}}", "@math{@cal{A}}");\r
-@end smallexample\r
-\r
-@noindent\r
-and\r
-\r
-@smallexample\r
-cd2 = iconv_open ("@math{@cal{C}}", "@math{@cal{B}}");\r
-@end smallexample\r
-\r
-@noindent\r
-will succeed, but how to find @math{@cal{B}}?\r
-\r
-Unfortunately, the answer is: there is no general solution.  On some\r
-systems guessing might help. On those systems most character sets can\r
-convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text. Beside \r
-this only some very system-specific methods can help. Since the \r
-conversion functions come from loadable modules and these modules must\r
-be stored somewhere in the filesystem, one @emph{could} try to find them\r
-and determine from the available file which conversions are available\r
-and whether there is an indirect route from @math{@cal{A}} to\r
-@math{@cal{C}}.\r
-\r
-This example shows one of the design errors of @code{iconv} mentioned \r
-above. It should at least be possible to determine the list of available\r
-conversion programmatically so that if @code{iconv_open} says there is no \r
-such conversion, one could make sure this also is true for indirect\r
-routes.\r
-\r
-@node glibc iconv Implementation\r
-@subsection The @code{iconv} Implementation in the GNU C library\r
-\r
-After reading about the problems of @code{iconv} implementations in the\r
-last section it is certainly good to note that the implementation in\r
-the GNU C library has none of the problems mentioned above.  What\r
-follows is a step-by-step analysis of the points raised above.  The\r
-evaluation is based on the current state of the development (as of\r
-January 1999).  The development of the @code{iconv} functions is not\r
-complete, but basic functionality has solidified.\r
-\r
-The GNU C library's @code{iconv} implementation uses shared loadable\r
-modules to implement the conversions.  A very small number of\r
-conversions are built into the library itself but these are only rather\r
-trivial conversions.\r
-\r
-All the benefits of loadable modules are available in the GNU C library\r
-implementation.  This is especially appealing since the interface is\r
-well documented (see below), and it, therefore, is easy to write new\r
-conversion modules.  The drawback of using loadable objects is not a\r
-problem in the GNU C library, at least on ELF systems.  Since the\r
-library is able to load shared objects even in statically linked\r
-binaries, static linking need not be forbidden in case one wants to use \r
-@code{iconv}.\r
-\r
-The second mentioned problem is the number of supported conversions.\r
-Currently, the GNU C library supports more than 150 character sets.  The\r
-way the implementation is designed the number of supported conversions\r
-is greater than 22350 (@math{150} times @math{149}).  If any conversion\r
-from or to a character set is missing, it can be added easily.\r
-\r
-Particularly impressive as it may be, this high number is due to the\r
-fact that the GNU C library implementation of @code{iconv} does not have\r
-the third problem mentioned above (i.e., whenever there is a conversion\r
-from a character set @math{@cal{A}} to @math{@cal{B}} and from\r
-@math{@cal{B}} to @math{@cal{C}} it is always possible to convert from\r
-@math{@cal{A}} to @math{@cal{C}} directly).  If the @code{iconv_open}\r
-returns an error and sets @code{errno} to @code{EINVAL}, there is no \r
-known way, directly or indirectly, to perform the wanted conversion.\r
-\r
-@cindex triangulation\r
-Triangulation is achieved by providing for each character set a \r
-conversion from and to UCS-4 encoded @w{ISO 10646}.  Using @w{ISO 10646} \r
-as an intermediate representation it is possible to @dfn{triangulate}\r
-(i.e., convert with an intermediate representation).\r
-\r
-There is no inherent requirement to provide a conversion to @w{ISO\r
-10646} for a new character set, and it is also possible to provide other\r
-conversions where neither source nor destination character set is @w{ISO\r
-10646}.  The existing set of conversions is simply meant to cover all \r
-conversions that might be of interest.\r
-\r
-@cindex ISO-2022-JP\r
-@cindex EUC-JP\r
-All currently available conversions use the triangulation method above,\r
-making conversion run unnecessarily slow. If, for example, somebody \r
-often needs the conversion from ISO-2022-JP to EUC-JP, a quicker solution\r
-would involve direct conversion between the two character sets, skipping\r
-the input to @w{ISO 10646} first. The two character sets of interest\r
-are much more similar to each other than to @w{ISO 10646}.\r
-\r
-In such a situation one easily can write a new conversion and provide it\r
-as a better alternative. The GNU C library @code{iconv} implementation\r
-would automatically use the module implementing the conversion if it is\r
-specified to be more efficient.\r
-\r
-@subsubsection Format of @file{gconv-modules} files\r
-\r
-All information about the available conversions comes from a file named\r
-@file{gconv-modules} which can be found in any of the directories along\r
-the @code{GCONV_PATH}. The @file{gconv-modules} files are line-oriented\r
-text files, where each of the lines has one of the following formats:\r
-\r
-@itemize @bullet\r
-@item\r
-If the first non-whitespace character is a @kbd{#} the line contains only \r
-comments and is ignored.\r
-\r
-@item\r
-Lines starting with @code{alias} define an alias name for a character \r
-set. Two more words are expected on the line.  The first word \r
-defines the alias name, and the second defines the original name of the\r
-character set. The effect is that it is possible to use the alias name\r
-in the @var{fromset} or @var{toset} parameters of @code{iconv_open} and\r
-achieve the same result as when using the real character set name.\r
-\r
-This is quite important as a character set has often many different\r
-names. There is normally an official name but this need not correspond to \r
-the most popular name.  Beside this many character sets have special \r
-names that are somehow constructed.  For example, all character sets \r
-specified by the ISO have an alias of the form @code{ISO-IR-@var{nnn}} \r
-where @var{nnn} is the registration number. This allows programs which \r
-know about the registration number to construct character set names and \r
-use them in @code{iconv_open} calls. More on the available names and \r
-aliases follows below.\r
-\r
-@item\r
-Lines starting with @code{module} introduce an available conversion\r
-module. These lines must contain three or four more words.\r
-\r
-The first word specifies the source character set, the second word the\r
-destination character set of conversion implemented in this module, and \r
-the third word is the name of the loadable module. The filename is\r
-constructed by appending the usual shared object suffix (normally\r
-@file{.so}) and this file is then supposed to be found in the same\r
-directory the @file{gconv-modules} file is in. The last word on the line, \r
-which is optional, is a numeric value representing the cost of the\r
-conversion. If this word is missing, a cost of @math{1} is assumed. The\r
-numeric value itself does not matter that much; what counts are the\r
-relative values of the sums of costs for all possible conversion paths.\r
-Below is a more precise description of the use of the cost value.\r
-@end itemize\r
-\r
-Returning to the example above where one has written a module to directly\r
-convert from ISO-2022-JP to EUC-JP and back. All that has to be done is\r
-to put the new module, let its name be ISO2022JP-EUCJP.so, in a directory\r
-and add a file @file{gconv-modules} with the following content in the\r
-same directory:\r
-\r
-@smallexample\r
-module  ISO-2022-JP//   EUC-JP//        ISO2022JP-EUCJP    1\r
-module  EUC-JP//        ISO-2022-JP//   ISO2022JP-EUCJP    1\r
-@end smallexample\r
-\r
-To see why this is sufficient, it is necessary to understand how the\r
-conversion used by @code{iconv} (and described in the descriptor) is\r
-selected. The approach to this problem is quite simple.\r
-\r
-At the first call of the @code{iconv_open} function the program reads\r
-all available @file{gconv-modules} files and builds up two tables: one\r
-containing all the known aliases and another that contains the\r
-information about the conversions and which shared object implements\r
-them.\r
-\r
-@subsubsection Finding the conversion path in @code{iconv}\r
-\r
-The set of available conversions form a directed graph with weighted\r
-edges. The weights on the edges are the costs specified in the\r
-@file{gconv-modules} files. The @code{iconv_open} function uses an\r
-algorithm suitable for search for the best path in such a graph and so\r
-constructs a list of conversions which must be performed in succession\r
-to get the transformation from the source to the destination character\r
-set.\r
-\r
-Explaining why the above @file{gconv-modules} files allows the\r
-@code{iconv} implementation to resolve the specific ISO-2022-JP to\r
-EUC-JP conversion module instead of the conversion coming with the\r
-library itself is straightforward. Since the latter conversion takes two\r
-steps (from ISO-2022-JP to @w{ISO 10646} and then from @w{ISO 10646} to\r
-EUC-JP), the cost is @math{1+1 = 2}.  The above @file{gconv-modules}\r
-file, however, specifies that the new conversion modules can perform this\r
-conversion with only the cost of @math{1}.\r
-\r
-A mysterious item about the @file{gconv-modules} file above (and also\r
-the file coming with the GNU C library) are the names of the character\r
-sets specified in the @code{module} lines. Why do almost all the names\r
-end in @code{//}?  And this is not all: the names can actually be\r
-regular expressions.  At this point in time this mystery should not be\r
-revealed, unless you have the relevant spell-casting materials: ashes\r
-from an original @w{DOS 6.2} boot disk burnt in effigy, a crucifix\r
-blessed by St.@: Emacs, assorted herbal roots from Central America, sand\r
-from Cebu, etc.  Sorry!  @strong{The part of the implementation where\r
-this is used is not yet finished.  For now please simply follow the\r
-existing examples.  It'll become clearer once it is. --drepper}\r
-\r
-A last remark about the @file{gconv-modules} is about the names not\r
-ending with @code{//}. Aa character set named @code{INTERNAL} is often \r
-mentioned. From the discussion above and the chosen name it should have \r
-become clear that this is the name for the representation used in the \r
-intermediate step of the triangulation. We have said that this is UCS-4 \r
-but actually that is not quite right. The UCS-4 specification also \r
-includes the specification of the byte ordering used. Since a UCS-4 value \r
-consists of four bytes, a stored value is effected by byte ordering.  The \r
-internal representation is @emph{not} the same as UCS-4 in case the byte \r
-ordering of the processor (or at least the running process) is not the \r
-same as the one required for UCS-4. This is done for performance reasons \r
-as one does not want to perform unnecessary byte-swapping operations if \r
-one is not interested in actually seeing the result in UCS-4. To avoid \r
-trouble with endianess, the internal representation consistently is named \r
-@code{INTERNAL} even on big-endian systems where the representations are \r
-identical.\r
-\r
-@subsubsection @code{iconv} module data structures\r
-\r
-So far this section has described how modules are located and considered \r
-to be used. What remains to be described is the interface of the modules\r
-so that one can write new ones. This section describes the interface as\r
-it is in use in January 1999. The interface will change a bit in the \r
-future but, with luck, only in an upwardly compatible way.\r
-\r
-The definitions necessary to write new modules are publicly available\r
-in the non-standard header @file{gconv.h}.  The following text,\r
-therefore, describes the definitions from this header file.  First, \r
-however, it is necessary to get an overview.\r
-\r
-From the perspective of the user of @code{iconv} the interface is quite\r
-simple: the @code{iconv_open} function returns a handle that can be used \r
-in calls to @code{iconv}, and finally the handle is freed with a call to \r
-@code{iconv_close}. The problem is that the handle has to be able to\r
-represent the possibly long sequences of conversion steps and also the\r
-state of each conversion since the handle is all that is passed to the\r
-@code{iconv} function. Therefore, the data structures are really the\r
-elements necessary to understanding the implementation.\r
-\r
-We need two different kinds of data structures. The first describes the\r
-conversion and the second describes the state etc. There are really two\r
-type definitions like this in @file{gconv.h}.\r
-@pindex gconv.h\r
-\r
-@comment gconv.h\r
-@comment GNU\r
-@deftp {Data type} {struct __gconv_step}\r
-This data structure describes one conversion a module can perform.  For\r
-each function in a loaded module with conversion functions there is\r
-exactly one object of this type.  This object is shared by all users of\r
-the conversion (i.e., this object does not contain any information\r
-corresponding to an actual conversion; it only describes the conversion\r
-itself).\r
-\r
-@table @code\r
-@item struct __gconv_loaded_object *__shlib_handle\r
-@itemx const char *__modname\r
-@itemx int __counter\r
-All these elements of the structure are used internally in the C library\r
-to coordinate loading and unloading the shared. One must not expect any\r
-of the other elements to be available or initialized.\r
-\r
-@item const char *__from_name\r
-@itemx const char *__to_name\r
-@code{__from_name} and @code{__to_name} contain the names of the source and\r
-destination character sets. They can be used to identify the actual\r
-conversion to be carried out since one module might implement conversions \r
-for more than one character set and/or direction.\r
-\r
-@item gconv_fct __fct\r
-@itemx gconv_init_fct __init_fct\r
-@itemx gconv_end_fct __end_fct\r
-These elements contain pointers to the functions in the loadable module.\r
-The interface will be explained below.\r
-\r
-@item int __min_needed_from\r
-@itemx int __max_needed_from\r
-@itemx int __min_needed_to\r
-@itemx int __max_needed_to;\r
-These values have to be supplied in the init function of the module. The\r
-@code{__min_needed_from} value specifies how many bytes a character of\r
-the source character set at least needs. The @code{__max_needed_from}\r
-specifies the maximum value that also includes possible shift sequences.\r
-\r
-The @code{__min_needed_to} and @code{__max_needed_to} values serve the\r
-same purpose as @code{__min_needed_from} and @code{__max_needed_from} but \r
-this time for the destination character set.\r
-\r
-It is crucial that these values be accurate since otherwise the\r
-conversion functions will have problems or not work at all.\r
-\r
-@item int __stateful\r
-This element must also be initialized by the init function. \r
-@code{int __stateful} is nonzero if the source character set is stateful. \r
-Otherwise it is zero.\r
-\r
-@item void *__data\r
-This element can be used freely by the conversion functions in the\r
-module. @code{void *__data} can be used to communicate extra information \r
-from one call to another. @code{void *__data} need not be initialized if \r
-not needed at all. If @code{void *__data} element is assigned a pointer \r
-to dynamically allocated memory (presumably in the init function) it has \r
-to be made sure that the end function deallocates the memory. Otherwise \r
-the application will leak memory.\r
-\r
-It is important to be aware that this data structure is shared by all\r
-users of this specification conversion and therefore the @code{__data}\r
-element must not contain data specific to one specific use of the\r
-conversion function.\r
-@end table\r
-@end deftp\r
-\r
-@comment gconv.h\r
-@comment GNU\r
-@deftp {Data type} {struct __gconv_step_data}\r
-This is the data structure that contains the information specific to\r
-each use of the conversion functions.\r
-\r
-\r
-@table @code\r
-@item char *__outbuf\r
-@itemx char *__outbufend\r
-These elements specify the output buffer for the conversion step. The\r
-@code{__outbuf} element points to the beginning of the buffer, and\r
-@code{__outbufend} points to the byte following the last byte in the\r
-buffer. The conversion function must not assume anything about the size\r
-of the buffer but it can be safely assumed the there is room for at\r
-least one complete character in the output buffer.\r
-\r
-Once the conversion is finished, if the conversion is the last step, the\r
-@code{__outbuf} element must be modified to point after the last byte\r
-written into the buffer to signal how much output is available. If this\r
-conversion step is not the last one, the element must not be modified.\r
-The @code{__outbufend} element must not be modified.\r
-\r
-@item int __is_last\r
-This element is nonzero if this conversion step is the last one. This\r
-information is necessary for the recursion.  See the description of the\r
-conversion function internals below.  This element must never be\r
-modified.\r
-\r
-@item int __invocation_counter\r
-The conversion function can use this element to see how many calls of \r
-the conversion function already happened. Some character sets require a \r
-certain prolog when generating output, and by comparing this value with\r
-zero, one can find out whether it is the first call and whether, \r
-therefore, the prolog should be emitted. This element must never be \r
-modified.\r
-\r
-@item int __internal_use\r
-This element is another one rarely used but needed in certain\r
-situations. It is assigned a nonzero value in case the conversion\r
-functions are used to implement @code{mbsrtowcs} et.al.@: (i.e., the\r
-function is not used directly through the @code{iconv} interface).\r
-\r
-This sometimes makes a difference as it is expected that the\r
-@code{iconv} functions are used to translate entire texts while the\r
-@code{mbsrtowcs} functions are normally used only to convert single\r
-strings and might be used multiple times to convert entire texts.\r
-\r
-But in this situation we would have problem complying with some rules of\r
-the character set specification. Some character sets require a prolog\r
-which must appear exactly once for an entire text. If a number of\r
-@code{mbsrtowcs} calls are used to convert the text, only the first call\r
-must add the prolog.  However, because there is no communication between the\r
-different calls of @code{mbsrtowcs}, the conversion functions have no\r
-possibility to find this out. The situation is different for sequences\r
-of @code{iconv} calls since the handle allows access to the needed\r
-information.\r
-\r
-The @code{int __internal_use} element is mostly used together with \r
-@code{__invocation_counter} as follows:\r
-\r
-@smallexample\r
-if (!data->__internal_use\r
-     && data->__invocation_counter == 0)\r
-  /* @r{Emit prolog.}  */\r
-  ...\r
-@end smallexample\r
-\r
-This element must never be modified.\r
-\r
-@item mbstate_t *__statep\r
-The @code{__statep} element points to an object of type @code{mbstate_t}\r
-(@pxref{Keeping the state}). The conversion of a stateful character\r
-set must use the object pointed to by @code{__statep} to store \r
-information about the conversion state. The @code{__statep} element \r
-itself must never be modified.\r
-\r
-@item mbstate_t __state\r
-This element must @emph{never} be used directly.  It is only part of\r
-this structure to have the needed space allocated.\r
-@end table\r
-@end deftp\r
-\r
-@subsubsection @code{iconv} module interfaces\r
-\r
-With the knowledge about the data structures we now can describe the\r
-conversion function itself. To understand the interface a bit of\r
-knowledge is necessary about the functionality in the C library that \r
-loads the objects with the conversions.\r
-\r
-It is often the case that one conversion is used more than once (i.e.,\r
-there are several @code{iconv_open} calls for the same set of character\r
-sets during one program run).  The @code{mbsrtowcs} et.al.@: functions in\r
-the GNU C library also use the @code{iconv} functionality, which \r
-increases the number of uses of the same functions even more.\r
-\r
-Because of this multiple use of conversions, the modules do not get \r
-loaded exclusively for one conversion. Instead a module once loaded can \r
-be used by an arbitrary number of @code{iconv} or @code{mbsrtowcs} calls \r
-at the same time. The splitting of the information between conversion-\r
-function-specific information and conversion data makes this possible. \r
-The last section showed the two data structures used to do this.\r
-\r
-This is of course also reflected in the interface and semantics of the\r
-functions that the modules must provide. There are three functions that\r
-must have the following names:\r
-\r
-@table @code\r
-@item gconv_init\r
-The @code{gconv_init} function initializes the conversion function\r
-specific data structure.  This very same object is shared by all\r
-conversions that use this conversion and, therefore, no state information\r
-about the conversion itself must be stored in here. If a module \r
-implements more than one conversion, the @code{gconv_init} function will \r
-be called multiple times.\r
-\r
-@item gconv_end\r
-The @code{gconv_end} function is responsible for freeing all resources\r
-allocated by the @code{gconv_init} function. If there is nothing to do,\r
-this function can be missing. Special care must be taken if the module\r
-implements more than one conversion and the @code{gconv_init} function\r
-does not allocate the same resources for all conversions.\r
-\r
-@item gconv\r
-This is the actual conversion function. It is called to convert one\r
-block of text. It gets passed the conversion step information\r
-initialized by @code{gconv_init} and the conversion data, specific to\r
-this use of the conversion functions.\r
-@end table\r
-\r
-There are three data types defined for the three module interface\r
-functions and these define the interface.\r
-\r
-@comment gconv.h\r
-@comment GNU\r
-@deftypevr {Data type} int {(*__gconv_init_fct)} (struct __gconv_step *)\r
-This specifies the interface of the initialization function of the\r
-module. It is called exactly once for each conversion the module\r
-implements.\r
-\r
-As explained in the description of the @code{struct __gconv_step} data\r
-structure above the initialization function has to initialize parts of\r
-it.\r
-\r
-@table @code\r
-@item __min_needed_from\r
-@itemx __max_needed_from\r
-@itemx __min_needed_to\r
-@itemx __max_needed_to\r
-These elements must be initialized to the exact numbers of the minimum\r
-and maximum number of bytes used by one character in the source and\r
-destination character sets, respectively. If the characters all have the\r
-same size, the minimum and maximum values are the same.\r
-\r
-@item __stateful\r
-This element must be initialized to an nonzero value if the source\r
-character set is stateful. Otherwise it must be zero.\r
-@end table\r
-\r
-If the initialization function needs to communicate some information\r
-to the conversion function, this communication can happen using the \r
-@code{__data} element of the @code{__gconv_step} structure. But since \r
-this data is shared by all the conversions, it must not be modified by \r
-the conversion function. The example below shows how this can be used.\r
-\r
-@smallexample\r
-#define MIN_NEEDED_FROM         1\r
-#define MAX_NEEDED_FROM         4\r
-#define MIN_NEEDED_TO           4\r
-#define MAX_NEEDED_TO           4\r
-\r
-int\r
-gconv_init (struct __gconv_step *step)\r
-@{\r
-  /* @r{Determine which direction.}  */\r
-  struct iso2022jp_data *new_data;\r
-  enum direction dir = illegal_dir;\r
-  enum variant var = illegal_var;\r
-  int result;\r
-\r
-  if (__strcasecmp (step->__from_name, "ISO-2022-JP//") == 0)\r
-    @{\r
-      dir = from_iso2022jp;\r
-      var = iso2022jp;\r
-    @}\r
-  else if (__strcasecmp (step->__to_name, "ISO-2022-JP//") == 0)\r
-    @{\r
-      dir = to_iso2022jp;\r
-      var = iso2022jp;\r
-    @}\r
-  else if (__strcasecmp (step->__from_name, "ISO-2022-JP-2//") == 0)\r
-    @{\r
-      dir = from_iso2022jp;\r
-      var = iso2022jp2;\r
-    @}\r
-  else if (__strcasecmp (step->__to_name, "ISO-2022-JP-2//") == 0)\r
-    @{\r
-      dir = to_iso2022jp;\r
-      var = iso2022jp2;\r
-    @}\r
-\r
-  result = __GCONV_NOCONV;\r
-  if (dir != illegal_dir)\r
-    @{\r
-      new_data = (struct iso2022jp_data *)\r
-        malloc (sizeof (struct iso2022jp_data));\r
-\r
-      result = __GCONV_NOMEM;\r
-      if (new_data != NULL)\r
-        @{\r
-          new_data->dir = dir;\r
-          new_data->var = var;\r
-          step->__data = new_data;\r
-\r
-          if (dir == from_iso2022jp)\r
-            @{\r
-              step->__min_needed_from = MIN_NEEDED_FROM;\r
-              step->__max_needed_from = MAX_NEEDED_FROM;\r
-              step->__min_needed_to = MIN_NEEDED_TO;\r
-              step->__max_needed_to = MAX_NEEDED_TO;\r
-            @}\r
-          else\r
-            @{\r
-              step->__min_needed_from = MIN_NEEDED_TO;\r
-              step->__max_needed_from = MAX_NEEDED_TO;\r
-              step->__min_needed_to = MIN_NEEDED_FROM;\r
-              step->__max_needed_to = MAX_NEEDED_FROM + 2;\r
-            @}\r
-\r
-          /* @r{Yes, this is a stateful encoding.}  */\r
-          step->__stateful = 1;\r
-\r
-          result = __GCONV_OK;\r
-        @}\r
-    @}\r
-\r
-  return result;\r
-@}\r
-@end smallexample\r
-\r
-The function first checks which conversion is wanted. The module from\r
-which this function is taken implements four different conversions; \r
-which one is selected can be determined by comparing the names. The\r
-comparison should always be done without paying attention to the case.\r
-\r
-Next, a data structure, which contains the necessary information about \r
-which conversion is selected, is allocated. The data structure\r
-@code{struct iso2022jp_data} is locally defined since, outside the \r
-module, this data is not used at all. Please note that if all four \r
-conversions this modules supports are requested there are four data \r
-blocks.\r
-\r
-One interesting thing is the initialization of the @code{__min_} and\r
-@code{__max_} elements of the step data object. A single ISO-2022-JP\r
-character can consist of one to four bytes. Therefore the\r
-@code{MIN_NEEDED_FROM} and @code{MAX_NEEDED_FROM} macros are defined\r
-this way. The output is always the @code{INTERNAL} character set (aka\r
-UCS-4) and therefore each character consists of exactly four bytes. For\r
-the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into\r
-account that escape sequences might be necessary to switch the character\r
-sets.  Therefore the @code{__max_needed_to} element for this direction\r
-gets assigned @code{MAX_NEEDED_FROM + 2}. This takes into account the\r
-two bytes needed for the escape sequences to single the switching. The\r
-asymmetry in the maximum values for the two directions can be explained\r
-easily: when reading ISO-2022-JP text, escape sequences can be handled\r
-alone (i.e., it is not necessary to process a real character since the\r
-effect of the escape sequence can be recorded in the state information).\r
-The situation is different for the other direction. Since it is in\r
-general not known which character comes next, one cannot emit escape\r
-sequences to change the state in advance. This means the escape\r
-sequences that have to be emitted together with the next character.\r
-Therefore one needs more room than only for the character itself.\r
-\r
-The possible return values of the initialization function are:\r
-\r
-@table @code\r
-@item __GCONV_OK\r
-The initialization succeeded\r
-@item __GCONV_NOCONV\r
-The requested conversion is not supported in the module.  This can\r
-happen if the @file{gconv-modules} file has errors.\r
-@item __GCONV_NOMEM\r
-Memory required to store additional information could not be allocated.\r
-@end table\r
-@end deftypevr\r
-\r
-The function called before the module is unloaded is significantly\r
-easier. It often has nothing at all to do; in which case it can be left\r
-out completely.\r
-\r
-@comment gconv.h\r
-@comment GNU\r
-@deftypevr {Data type} void {(*__gconv_end_fct)} (struct gconv_step *)\r
-The task of this function is to free all resources allocated in the\r
-initialization function. Therefore only the @code{__data} element of\r
-the object pointed to by the argument is of interest. Continuing the\r
-example from the initialization function, the finalization function\r
-looks like this:\r
-\r
-@smallexample\r
-void\r
-gconv_end (struct __gconv_step *data)\r
-@{\r
-  free (data->__data);\r
-@}\r
-@end smallexample\r
-@end deftypevr\r
-\r
-The most important function is the conversion function itself, which can\r
-get quite complicated for complex character sets. But since this is not\r
-of interest here, we will only describe a possible skeleton for the\r
-conversion function.\r
-\r
-@comment gconv.h\r
-@comment GNU\r
-@deftypevr {Data type} int {(*__gconv_fct)} (struct __gconv_step *, struct __gconv_step_data *, const char **, const char *, size_t *, int)\r
-The conversion function can be called for two basic reason: to convert\r
-text or to reset the state. From the description of the @code{iconv}\r
-function it can be seen why the flushing mode is necessary. What mode\r
-is selected is determined by the sixth argument, an integer.  This \r
-argument being nonzero means that flushing is selected.\r
-\r
-Common to both modes is where the output buffer can be found. The\r
-information about this buffer is stored in the conversion step data. A\r
-pointer to this information is passed as the second argument to this \r
-function. The description of the @code{struct __gconv_step_data} \r
-structure has more information on the conversion step data.\r
-\r
-@cindex stateful\r
-What has to be done for flushing depends on the source character set.\r
-If the source character set is not stateful, nothing has to be done. \r
-Otherwise the function has to emit a byte sequence to bring the state \r
-object into the initial state. Once this all happened the other \r
-conversion modules in the chain of conversions have to get the same \r
-chance. Whether another step follows can be determined from the \r
-@code{__is_last} element of the step data structure to which the first \r
-parameter points.\r
-\r
-The more interesting mode is when actual text has to be converted. The \r
-first step in this case is to convert as much text as possible from the \r
-input buffer and store the result in the output buffer. The start of the \r
-input buffer is determined by the third argument which is a pointer to a \r
-pointer variable referencing the beginning of the buffer. The fourth \r
-argument is a pointer to the byte right after the last byte in the buffer.\r
-\r
-The conversion has to be performed according to the current state if the\r
-character set is stateful. The state is stored in an object pointed to\r
-by the @code{__statep} element of the step data (second argument). Once\r
-either the input buffer is empty or the output buffer is full the\r
-conversion stops. At this point, the pointer variable referenced by the\r
-third parameter must point to the byte following the last processed\r
-byte (i.e., if all of the input is consumed, this pointer and the fourth\r
-parameter have the same value).\r
-\r
-What now happens depends on whether this step is the last one. If it is \r
-the last step, the only thing that has to be done is to update the \r
-@code{__outbuf} element of the step data structure to point after the\r
-last written byte. This update gives the caller the information on how \r
-much text is available in the output buffer. In addition, the variable\r
-pointed to by the fifth parameter, which is of type @code{size_t}, must\r
-be incremented by the number of characters (@emph{not bytes}) that were\r
-converted in a non-reversible way. Then, the function can return.\r
-\r
-In case the step is not the last one, the later conversion functions have\r
-to get a chance to do their work. Therefore, the appropriate conversion\r
-function has to be called. The information about the functions is\r
-stored in the conversion data structures, passed as the first parameter.\r
-This information and the step data are stored in arrays, so the next\r
-element in both cases can be found by simple pointer arithmetic:\r
-\r
-@smallexample\r
-int\r
-gconv (struct __gconv_step *step, struct __gconv_step_data *data,\r
-       const char **inbuf, const char *inbufend, size_t *written,\r
-       int do_flush)\r
-@{\r
-  struct __gconv_step *next_step = step + 1;\r
-  struct __gconv_step_data *next_data = data + 1;\r
-  ...\r
-@end smallexample\r
-\r
-The @code{next_step} pointer references the next step information and\r
-@code{next_data} the next data record.  The call of the next function\r
-therefore will look similar to this:\r
-\r
-@smallexample\r
-  next_step->__fct (next_step, next_data, &outerr, outbuf,\r
-                    written, 0)\r
-@end smallexample\r
-\r
-But this is not yet all. Once the function call returns the conversion\r
-function might have some more to do. If the return value of the function \r
-is @code{__GCONV_EMPTY_INPUT}, more room is available in the output \r
-buffer. Unless the input buffer is empty the conversion, functions start \r
-all over again and process the rest of the input buffer. If the return \r
-value is not @code{__GCONV_EMPTY_INPUT}, something went wrong and we have \r
-to recover from this.\r
-\r
-A requirement for the conversion function is that the input buffer\r
-pointer (the third argument) always point to the last character that\r
-was put in converted form into the output buffer. This is trivially\r
-true after the conversion performed in the current step, but if the\r
-conversion functions deeper downstream stop prematurely, not all\r
-characters from the output buffer are consumed and, therefore, the input\r
-buffer pointers must be backed off to the right position.\r
-\r
-Correcting the input buffers is easy to do if the input and output \r
-character sets have a fixed width for all characters. In this situation \r
-we can compute how many characters are left in the output buffer and, \r
-therefore, can correct the input buffer pointer appropriately with a \r
-similar computation. Things are getting tricky if either character set \r
-has characters represented with variable length byte sequences, and it \r
-gets even more complicated if the conversion has to take care of the \r
-state. In these cases the conversion has to be performed once again, from \r
-the known state before the initial conversion (i.e., if necessary the \r
-state of the conversion has to be reset and the conversion loop has to be \r
-executed again). The difference now is that it is known how much input \r
-must be created, and the conversion can stop before converting the first \r
-unused character. Once this is done the input buffer pointers must be \r
-updated again and the function can return.\r
-\r
-One final thing should be mentioned. If it is necessary for the\r
-conversion to know whether it is the first invocation (in case a prolog\r
-has to be emitted), the conversion function should increment the \r
-@code{__invocation_counter} element of the step data structure just \r
-before returning to the caller. See the description of the @code{struct\r
-__gconv_step_data} structure above for more information on how this can\r
-be used.\r
-\r
-The return value must be one of the following values:\r
-\r
-@table @code\r
-@item __GCONV_EMPTY_INPUT\r
-All input was consumed and there is room left in the output buffer.\r
-@item __GCONV_FULL_OUTPUT\r
-No more room in the output buffer. In case this is not the last step\r
-this value is propagated down from the call of the next conversion\r
-function in the chain.\r
-@item __GCONV_INCOMPLETE_INPUT\r
-The input buffer is not entirely empty since it contains an incomplete\r
-character sequence.\r
-@end table\r
-\r
-The following example provides a framework for a conversion function.\r
-In case a new conversion has to be written the holes in this\r
-implementation have to be filled and that is it.\r
-\r
-@smallexample\r
-int\r
-gconv (struct __gconv_step *step, struct __gconv_step_data *data,\r
-       const char **inbuf, const char *inbufend, size_t *written,\r
-       int do_flush)\r
-@{\r
-  struct __gconv_step *next_step = step + 1;\r
-  struct __gconv_step_data *next_data = data + 1;\r
-  gconv_fct fct = next_step->__fct;\r
-  int status;\r
-\r
-  /* @r{If the function is called with no input this means we have}\r
-     @r{to reset to the initial state.  The possibly partly}\r
-     @r{converted input is dropped.}  */\r
-  if (do_flush)\r
-    @{\r
-      status = __GCONV_OK;\r
-\r
-      /* @r{Possible emit a byte sequence which put the state object}\r
-         @r{into the initial state.}  */\r
-\r
-      /* @r{Call the steps down the chain if there are any but only}\r
-         @r{if we successfully emitted the escape sequence.}  */\r
-      if (status == __GCONV_OK && ! data->__is_last)\r
-        status = fct (next_step, next_data, NULL, NULL,\r
-                      written, 1);\r
-    @}\r
-  else\r
-    @{\r
-      /* @r{We preserve the initial values of the pointer variables.}  */\r
-      const char *inptr = *inbuf;\r
-      char *outbuf = data->__outbuf;\r
-      char *outend = data->__outbufend;\r
-      char *outptr;\r
-\r
-      do\r
-        @{\r
-          /* @r{Remember the start value for this round.}  */\r
-          inptr = *inbuf;\r
-          /* @r{The outbuf buffer is empty.}  */\r
-          outptr = outbuf;\r
-\r
-          /* @r{For stateful encodings the state must be safe here.}  */\r
-\r
-          /* @r{Run the conversion loop.  @code{status} is set}\r
-             @r{appropriately afterwards.}  */\r
-\r
-          /* @r{If this is the last step, leave the loop. There is}\r
-             @r{nothing we can do.}  */\r
-          if (data->__is_last)\r
-            @{\r
-              /* @r{Store information about how many bytes are}\r
-                 @r{available.}  */\r
-              data->__outbuf = outbuf;\r
-\r
-             /* @r{If any non-reversible conversions were performed,}\r
-                @r{add the number to @code{*written}.}  */\r
-\r
-             break;\r
-           @}\r
-\r
-          /* @r{Write out all output which was produced.}  */\r
-          if (outbuf > outptr)\r
-            @{\r
-              const char *outerr = data->__outbuf;\r
-              int result;\r
-\r
-              result = fct (next_step, next_data, &outerr,\r
-                            outbuf, written, 0);\r
-\r
-              if (result != __GCONV_EMPTY_INPUT)\r
-                @{\r
-                  if (outerr != outbuf)\r
-                    @{\r
-                      /* @r{Reset the input buffer pointer.  We}\r
-                         @r{document here the complex case.}  */\r
-                      size_t nstatus;\r
-\r
-                      /* @r{Reload the pointers.}  */\r
-                      *inbuf = inptr;\r
-                      outbuf = outptr;\r
-\r
-                      /* @r{Possibly reset the state.}  */\r
-\r
-                      /* @r{Redo the conversion, but this time}\r
-                         @r{the end of the output buffer is at}\r
-                         @r{@code{outerr}.}  */\r
-                    @}\r
-\r
-                  /* @r{Change the status.}  */\r
-                  status = result;\r
-                @}\r
-              else\r
-                /* @r{All the output is consumed, we can make}\r
-                   @r{ another run if everything was ok.}  */\r
-                if (status == __GCONV_FULL_OUTPUT)\r
-                  status = __GCONV_OK;\r
-           @}\r
-        @}\r
-      while (status == __GCONV_OK);\r
-\r
-      /* @r{We finished one use of this step.}  */\r
-      ++data->__invocation_counter;\r
-    @}\r
-\r
-  return status;\r
-@}\r
-@end smallexample\r
-@end deftypevr\r
-\r
-This information should be sufficient to write new modules.  Anybody\r
-doing so should also take a look at the available source code in the GNU\r
-C library sources.  It contains many examples of working and optimized\r
-modules.\r
-\r
+@node Character Set Handling, Locales, String and Array Utilities, Top
+@c %MENU% Support for extended character sets
+@chapter Character Set Handling
+
+@ifnottex
+@macro cal{text}
+\text\
+@end macro
+@end ifnottex
+
+Character sets used in the early days of computing had only six, seven,
+or eight bits for each character: there was never a case where more than
+eight bits (one byte) were used to represent a single character.  The
+limitations of this approach became more apparent as more people
+grappled with non-Roman character sets, where not all the characters
+that make up a language's character set can be represented by @math{2^8}
+choices.  This chapter shows the functionality that was added to the C
+library to support multiple character sets.
+
+@menu
+* Extended Char Intro::              Introduction to Extended Characters.
+* Charset Function Overview::        Overview about Character Handling
+                                      Functions.
+* Restartable multibyte conversion:: Restartable multibyte conversion
+                                      Functions.
+* Non-reentrant Conversion::         Non-reentrant Conversion Function.
+* Generic Charset Conversion::       Generic Charset Conversion.
+@end menu
+
+
+@node Extended Char Intro
+@section Introduction to Extended Characters
+
+A variety of solutions is available to overcome the differences between
+character sets with a 1:1 relation between bytes and characters and
+character sets with ratios of 2:1 or 4:1.  The remainder of this
+section gives a few examples to help understand the design decisions
+made while developing the functionality of the @w{C library}.
+
+@cindex internal representation
+A distinction we have to make right away is between internal and
+external representation.  @dfn{Internal representation} means the
+representation used by a program while keeping the text in memory.
+External representations are used when text is stored or transmitted
+through some communication channel.  Examples of external
+representations include files waiting in a directory to be
+read and parsed.
+
+Traditionally there has been no difference between the two representations.
+It was equally comfortable and useful to use the same single-byte
+representation internally and externally.  This comfort level decreases
+with more and larger character sets.
+
+One of the problems to overcome with the internal representation is
+handling text that is externally encoded using different character
+sets.  Assume a program that reads two texts and compares them using
+some metric.  The comparison can be usefully done only if the texts are
+internally kept in a common format.
+
+@cindex wide character
+For such a common format (@math{=} character set) eight bits are certainly
+no longer enough.  So the smallest entity will have to grow: @dfn{wide
+characters} will now be used.  Instead of one byte per character, two or
+four will be used instead.  (Three are not good to address in memory and
+more than four bytes seem not to be necessary).
+
+@cindex Unicode
+@cindex ISO 10646
+As shown in some other part of this manual,
+@c !!! Ahem, wide char string functions are not yet covered -- drepper
+a completely new family has been created of functions that can handle wide
+character texts in memory.  The most commonly used character sets for such
+internal wide character representations are Unicode and @w{ISO 10646}
+(also known as UCS for Universal Character Set).  Unicode was originally
+planned as a 16-bit character set; whereas, @w{ISO 10646} was designed to
+be a 31-bit large code space.  The two standards are practically identical.
+They have the same character repertoire and code table, but Unicode specifies
+added semantics.  At the moment, only characters in the first @code{0x10000}
+code positions (the so-called Basic Multilingual Plane, BMP) have been
+assigned, but the assignment of more specialized characters outside this
+16-bit space is already in progress.  A number of encodings have been
+defined for Unicode and @w{ISO 10646} characters:
+@cindex UCS-2
+@cindex UCS-4
+@cindex UTF-8
+@cindex UTF-16
+UCS-2 is a 16-bit word that can only represent characters
+from the BMP, UCS-4 is a 32-bit word than can represent any Unicode
+and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where
+ASCII characters are represented by ASCII bytes and non-ASCII characters
+by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension
+of UCS-2 in which pairs of certain UCS-2 words can be used to encode
+non-BMP characters up to @code{0x10ffff}.
+
+To represent wide characters the @code{char} type is not suitable.  For
+this reason the @w{ISO C} standard introduces a new type that is
+designed to keep one character of a wide character string.  To maintain
+the similarity there is also a type corresponding to @code{int} for
+those functions that take a single wide character.
+
+@comment stddef.h
+@comment ISO
+@deftp {Data type} wchar_t
+This data type is used as the base type for wide character strings.
+In other words, arrays of objects of this type are the equivalent of 
+@code{char[]} for multibyte character strings.  The type is defined in 
+@file{stddef.h}.
+
+The @w{ISO C90} standard, where @code{wchar_t} was introduced, does not
+say anything specific about the representation.  It only requires that
+this type is capable of storing all elements of the basic character set.
+Therefore it would be legitimate to define @code{wchar_t} as @code{char},
+which might make sense for embedded systems.
+
+But for GNU systems @code{wchar_t} is always 32 bits wide and, therefore,
+capable of representing all UCS-4 values and, therefore, covering all of
+@w{ISO 10646}.  Some Unix systems define @code{wchar_t} as a 16-bit type
+and thereby follow Unicode very strictly.  This definition is perfectly
+fine with the standard, but it also means that to represent all
+characters from Unicode and @w{ISO 10646} one has to use UTF-16 surrogate
+characters, which is in fact a multi-wide-character encoding.  But
+resorting to multi-wide-character encoding contradicts the purpose of the
+@code{wchar_t} type.
+@end deftp
+
+@comment wchar.h
+@comment ISO
+@deftp {Data type} wint_t
+@code{wint_t} is a data type used for parameters and variables that
+contain a single wide character.  As the name suggests this type is the
+equivalent of @code{int} when using the normal @code{char} strings.  The
+types @code{wchar_t} and @code{wint_t} often have the same
+representation if their size is 32 bits wide but if @code{wchar_t} is
+defined as @code{char} the type @code{wint_t} must be defined as
+@code{int} due to the parameter promotion.
+
+@pindex wchar.h
+This type is defined in @file{wchar.h} and was introduced in
+@w{Amendment 1} to @w{ISO C90}.
+@end deftp
+
+As there are for the @code{char} data type macros are available for
+specifying the minimum and maximum value representable in an object of
+type @code{wchar_t}.
+
+@comment wchar.h
+@comment ISO
+@deftypevr Macro wint_t WCHAR_MIN
+The macro @code{WCHAR_MIN} evaluates to the minimum value representable
+by an object of type @code{wint_t}.
+
+This macro was introduced in @w{Amendment 1} to @w{ISO C90}.
+@end deftypevr
+
+@comment wchar.h
+@comment ISO
+@deftypevr Macro wint_t WCHAR_MAX
+The macro @code{WCHAR_MAX} evaluates to the maximum value representable
+by an object of type @code{wint_t}.
+
+This macro was introduced in @w{Amendment 1} to @w{ISO C90}.
+@end deftypevr
+
+Another special wide character value is the equivalent to @code{EOF}.
+
+@comment wchar.h
+@comment ISO
+@deftypevr Macro wint_t WEOF
+The macro @code{WEOF} evaluates to a constant expression of type
+@code{wint_t} whose value is different from any member of the extended
+character set.
+
+@code{WEOF} need not be the same value as @code{EOF} and unlike
+@code{EOF} it also need @emph{not} be negative.  In other words, sloppy 
+code like
+
+@smallexample
+@{
+  int c;
+  ...
+  while ((c = getc (fp)) < 0)
+    ...
+@}
+@end smallexample
+
+@noindent
+has to be rewritten to use @code{WEOF} explicitly when wide characters
+are used:
+
+@smallexample
+@{
+  wint_t c;
+  ...
+  while ((c = wgetc (fp)) != WEOF)
+    ...
+@}
+@end smallexample
+
+@pindex wchar.h
+This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is
+defined in @file{wchar.h}.
+@end deftypevr
+
+
+These internal representations present problems when it comes to storing
+and transmittal.  Because each single wide character consists of more
+than one byte, they are effected by byte-ordering.  Thus, machines with
+different endianesses would see different values when accessing the same
+data.  This byte ordering concern also applies for communication protocols
+that are all byte-based and, thereforet require that the sender has to
+decide about splitting the wide character in bytes.  A last (but not least
+important) point is that wide characters often require more storage space
+than a customized byte-oriented character set.
+
+@cindex multibyte character
+@cindex EBCDIC
+   For all the above reasons, an external encoding that is different
+from the internal encoding is often used if the latter is UCS-2 or UCS-4.
+The external encoding is byte-based and can be chosen appropriately for
+the environment and for the texts to be handled.  A variety of different
+character sets can be used for this external encoding (information that
+will not be exhaustively presented here--instead, a description of the
+major groups will suffice).  All of the ASCII-based character sets
+[_bkoz_: do you mean Roman character sets? If not, what do you mean
+here?] fulfill one requirement: they are "filesystem safe."  This means
+that the character @code{'/'} is used in the encoding @emph{only} to
+represent itself.  Things are a bit different for character sets like
+EBCDIC (Extended Binary Coded Decimal Interchange Code, a character set
+family used by IBM), but if the operation system does not understand
+EBCDIC directly the parameters-to-system calls have to be converted first
+anyhow.
+
+@itemize @bullet
+@item 
+The simplest character sets are single-byte character sets.  There can 
+be only up to 256 characters (for @w{8 bit} character sets), which is 
+not sufficient to cover all languages but might be sufficient to handle 
+a specific text.  Handling of a @w{8 bit} character sets is simple.  This 
+is not true for other kinds presented later, and therefore, the 
+application one uses might require the use of @w{8 bit} character sets.
+
+@cindex ISO 2022
+@item
+The @w{ISO 2022} standard defines a mechanism for extended character
+sets where one character @emph{can} be represented by more than one
+byte.  This is achieved by associating a state with the text.
+Characters that can be used to change the state can be embedded in the
+text.  Each byte in the text might have a different interpretation in each
+state.  The state might even influence whether a given byte stands for a
+character on its own or whether it has to be combined with some more
+bytes.
+
+@cindex EUC
+@cindex Shift_JIS
+@cindex SJIS
+In most uses of @w{ISO 2022} the defined character sets do not allow
+state changes that cover more than the next character.  This has the
+big advantage that whenever one can identify the beginning of the byte
+sequence of a character one can interpret a text correctly.  Examples of
+character sets using this policy are the various EUC character sets
+(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
+or Shift_JIS (SJIS, a Japanese encoding).
+
+But there are also character sets using a state that is valid for more
+than one character and has to be changed by another byte sequence.
+Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.
+
+@item
+@cindex ISO 6937
+Early attempts to fix 8 bit character sets for other languages using the
+Roman alphabet lead to character sets like @w{ISO 6937}.  Here bytes
+representing characters like the acute accent do not produce output
+themselves: one has to combine them with other characters to get the
+desired result.  For example, the byte sequence @code{0xc2 0x61}
+(non-spacing acute accent, followed by lower-case `a') to get the ``small
+a with  acute'' character.  To get the acute accent character on its own,
+one has to write @code{0xc2 0x20} (the non-spacing acute followed by a
+space).
+
+Character sets like @w[ISO 6937] are used in some embedded systems such
+as teletex.
+
+@item
+@cindex UTF-8
+Instead of converting the Unicode or @w{ISO 10646} text used internally,
+it is often also sufficient to simply use an encoding different than
+UCS-2/UCS-4.  The Unicode and @w{ISO 10646} standards even specify such an
+encoding: UTF-8.  This encoding is able to represent all of @w{ISO
+10646} 31 bits in a byte string of length one to six.
+
+@cindex UTF-7
+There were a few other attempts to encode @w{ISO 10646} such as UTF-7,
+but UTF-8 is today the only encoding that should be used.  In fact, with
+any luck UTF-8 will soon be the only external encoding that has to be
+supported.  It proves to be universally usable and its only disadvantage
+is that it favors Roman languages by making the byte string
+representation of other scripts (Cyrillic, Greek, Asian scripts) longer
+than necessary if using a specific character set for these scripts.
+Methods like the Unicode compression scheme can alleviate these
+problems.
+@end itemize
+
+The question remaining is: how to select the character set or encoding
+to use.  The answer: you cannot decide about it yourself, it is decided
+by the developers of the system or the majority of the users.  Since the
+goal is interoperability one has to use whatever the other people one
+works with use.  If there are no constraints, the selection is based on
+the requirements the expected circle of users will have.  In other words,
+if a project is expected to be used in only, say, Russia it is fine to use
+KOI8-R or a similar character set.  But if at the same time people from,
+say, Greece are participating one should use a character set that allows
+all people to collaborate.
+
+The most widely useful solution seems to be: go with the most general
+character set, namely @w{ISO 10646}.  Use UTF-8 as the external encoding
+and problems about users not being able to use their own language
+adequately are a thing of the past.
+
+One final comment about the choice of the wide character representation
+is necessary at this point.  We have said above that the natural choice
+is using Unicode or @w{ISO 10646}.  This is not required, but at least
+encouraged, by the @w{ISO C} standard.  The standard defines at least a
+macro @code{__STDC_ISO_10646__} that is only defined on systems where
+the @code{wchar_t} type encodes @w{ISO 10646} characters.  If this
+symbol is not defined one should avoid making assumptions about the wide
+character representation.  If the programmer uses only the functions
+provided by the C library to handle wide character strings there should
+be no compatibility problems with other systems.
+
+@node Charset Function Overview
+@section Overview about Character Handling Functions
+
+A Unix @w{C library} contains three different sets of functions in two 
+families to handle character set conversion.  One of the function families 
+(the most commonly used) is specified in the @w{ISO C90} standard and, 
+therefore, is portable even beyond the Unix world.  Unfortunately this 
+family is the least useful one.  These functions should be avoided 
+whenever possible, especially when developing libraries (as opposed to 
+applications). 
+
+The second family of functions got introduced in the early Unix standards
+(XPG2) and is still part of the latest and greatest Unix standard:
+@w{Unix 98}.  It is also the most powerful and useful set of functions.
+But we will start with the functions defined in @w{Amendment 1} to
+@w{ISO C90}.
+
+@node Restartable multibyte conversion
+@section Restartable Multibyte Conversion Functions
+
+The @w{ISO C} standard defines functions to convert strings from a
+multibyte representation to wide character strings.  There are a number
+of peculiarities:
+
+@itemize @bullet
+@item
+The character set assumed for the multibyte encoding is not specified
+as an argument to the functions.  Instead the character set specified by
+the @code{LC_CTYPE} category of the current locale is used; see
+@ref{Locale Categories}.
+
+@item
+The functions handling more than one character at a time require NUL
+terminated strings as the argument (i.e., converting blocks of text
+does not work unless one can add a NUL byte at an appropriate place). 
+The GNU C library contains some extensions to the standard that allow
+specifying a size, but basically they also expect terminated strings.
+@end itemize
+
+Despite these limitations the @w{ISO C} functions can be used in many
+contexts.  In graphical user interfaces, for instance, it is not
+uncommon to have functions that require text to be displayed in a wide
+character string if the text is not simple ASCII.  The text itself might 
+come from a file with translations and the user should decide about the
+current locale, which determines the translation and therefore also the
+external encoding used.  In such a situation (and many others) the
+functions described here are perfect.  If more freedom while performing
+the conversion is necessary take a look at the @code{iconv} functions
+(@pxref{Generic Charset Conversion}).
+
+@menu
+* Selecting the Conversion::     Selecting the conversion and its properties.
+* Keeping the state::            Representing the state of the conversion.
+* Converting a Character::       Converting Single Characters.
+* Converting Strings::           Converting Multibyte and Wide Character
+                                  Strings.
+* Multibyte Conversion Example:: A Complete Multibyte Conversion Example.
+@end menu
+
+@node Selecting the Conversion
+@subsection Selecting the conversion and its properties
+
+We already said above that the currently selected locale for the
+@code{LC_CTYPE} category decides about the conversion that is performed
+by the functions we are about to describe.  Each locale uses its own
+character set (given as an argument to @code{localedef}) and this is the
+one assumed as the external multibyte encoding.  The wide character
+character set always is UCS-4, at least on GNU systems.
+
+A characteristic of each multibyte character set is the maximum number
+of bytes that can be necessary to represent one character.  This
+information is quite important when writing code that uses the
+conversion functions (as shown in the examples below).
+The @w{ISO C} standard defines two macros that provide this information.
+
+
+@comment limits.h
+@comment ISO
+@deftypevr Macro int MB_LEN_MAX
+@code{MB_LEN_MAX} specifies the maximum number of bytes in the multibyte
+sequence for a single character in any of the supported locales.  It is
+a compile-time constant and is defined in @file{limits.h}.
+@pindex limits.h
+@end deftypevr
+
+@comment stdlib.h
+@comment ISO
+@deftypevr Macro int MB_CUR_MAX
+@code{MB_CUR_MAX} expands into a positive integer expression that is the
+maximum number of bytes in a multibyte character in the current locale.
+The value is never greater than @code{MB_LEN_MAX}.  Unlike
+@code{MB_LEN_MAX} this macro need not be a compile-time constant, and in 
+the GNU C library it is not.
+
+@pindex stdlib.h
+@code{MB_CUR_MAX} is defined in @file{stdlib.h}.
+@end deftypevr
+
+Two different macros are necessary since strictly @w{ISO C90} compilers
+do not allow variable length array definitions, but still it is desirable
+to avoid dynamic allocation.  This incomplete piece of code shows the
+problem:
+
+@smallexample
+@{
+  char buf[MB_LEN_MAX];
+  ssize_t len = 0;
+
+  while (! feof (fp))
+    @{
+      fread (&buf[len], 1, MB_CUR_MAX - len, fp);
+      /* @r{... process} buf */
+      len -= used;
+    @}
+@}
+@end smallexample
+
+The code in the inner loop is expected to have always enough bytes in
+the array @var{buf} to convert one multibyte character.  The array
+@var{buf} has to be sized statically since many compilers do not allow a
+variable size.  The @code{fread} call makes sure that @code{MB_CUR_MAX} 
+bytes are always available in @var{buf}.  Note that it isn't
+a problem if @code{MB_CUR_MAX} is not a compile-time constant.
+
+
+@node Keeping the state
+@subsection Representing the state of the conversion
+
+@cindex stateful
+In the introduction of this chapter it was said that certain character
+sets use a @dfn{stateful} encoding.  That is, the encoded values depend 
+in some way on the previous bytes in the text.
+
+Since the conversion functions allow converting a text in more than one
+step we must have a way to pass this information from one call of the
+functions to another.
+
+@comment wchar.h
+@comment ISO
+@deftp {Data type} mbstate_t
+@cindex shift state
+A variable of type @code{mbstate_t} can contain all the information
+about the @dfn{shift state} needed from one call to a conversion
+function to another.
+
+@pindex wchar.h
+@code{mbstate_t} is defined in @file{wchar.h}.  It was introduced in
+@w{Amendment 1} to @w{ISO C90}.
+@end deftp
+
+To use objects of type @code{mbstate_t} the programmer has to define such 
+objects (normally as local variables on the stack) and pass a pointer to 
+the object to the conversion functions.  This way the conversion function
+can update the object if the current multibyte character set is stateful.
+
+There is no specific function or initializer to put the state object in
+any specific state.  The rules are that the object should always
+represent the initial state before the first use, and this is achieved by
+clearing the whole variable with code such as follows:
+
+@smallexample
+@{
+  mbstate_t state;
+  memset (&state, '\0', sizeof (state));
+  /* @r{from now on @var{state} can be used.}  */
+  ...
+@}
+@end smallexample
+
+When using the conversion functions to generate output it is often
+necessary to test whether the current state corresponds to the initial
+state.  This is necessary, for example, to decide whether to emit
+escape sequences to set the state to the initial state at certain
+sequence points.  Communication protocols often require this.
+
+@comment wchar.h
+@comment ISO
+@deftypefun int mbsinit (const mbstate_t *@var{ps})
+The @code {mbsinit} function determines whether the state object pointed 
+to by @var{ps} is in the initial state.  If @var{ps} is a null pointer or 
+the object is in the initial state the return value is nonzero.  Otherwise 
+it is zero.
+
+@pindex wchar.h
+@code {mbsinit} was introduced in @w{Amendment 1} to @w{ISO C90} and is 
+declared in @file{wchar.h}.
+@end deftypefun
+
+Code using @code {mbsinit} often looks similar to this:
+
+@c Fix the example to explicitly say how to generate the escape sequence
+@c to restore the initial state.
+@smallexample
+@{
+  mbstate_t state;
+  memset (&state, '\0', sizeof (state));
+  /* @r{Use @var{state}.}  */
+  ...
+  if (! mbsinit (&state))
+    @{
+      /* @r{Emit code to return to initial state.}  */
+      const wchar_t empty[] = L"";
+      const wchar_t *srcp = empty;
+      wcsrtombs (outbuf, &srcp, outbuflen, &state);
+    @}
+  ...
+@}
+@end smallexample
+
+The code to emit the escape sequence to get back to the initial state is
+interesting.  The @code{wcsrtombs} function can be used to determine the
+necessary output code (@pxref{Converting Strings}).  Please note that on
+GNU systems it is not necessary to perform this extra action for the
+conversion from multibyte text to wide character text since the wide
+character encoding is not stateful.  But there is nothing mentioned in
+any standard that prohibits making @code{wchar_t} using a stateful
+encoding.
+
+@node Converting a Character
+@subsection Converting Single Characters
+
+The most fundamental of the conversion functions are those dealing with
+single characters.  Please note that this does not always mean single
+bytes.  But since there is very often a subset of the multibyte
+character set that consists of single byte sequences, there are
+functions to help with converting bytes.  Frequently, ASCII is a subpart 
+of the multibyte character set.  In such a scenario, each ASCII character 
+stands for itself, and all other characters have at least a first byte 
+that is beyond the range @math{0} to @math{127}.
+
+@comment wchar.h
+@comment ISO
+@deftypefun wint_t btowc (int @var{c})
+The @code{btowc} function (``byte to wide character'') converts a valid
+single byte character @var{c} in the initial shift state into the wide
+character equivalent using the conversion rules from the currently
+selected locale of the @code{LC_CTYPE} category.
+
+If @code{(unsigned char) @var{c}} is no valid single byte multibyte
+character or if @var{c} is @code{EOF}, the function returns @code{WEOF}.
+
+Please note the restriction of @var{c} being tested for validity only in
+the initial shift state.  No @code{mbstate_t} object is used from
+which the state information is taken, and the function also does not use
+any static state.
+
+@pindex wchar.h
+The @code{btowc} function was introduced in @w{Amendment 1} to @w{ISO C90} 
+and is declared in @file{wchar.h}.
+@end deftypefun
+
+Despite the limitation that the single byte value always is interpreted
+in the initial state this function is actually useful most of the time.
+Most characters are either entirely single-byte character sets or they
+are extension to ASCII.  But then it is possible to write code like this
+(not that this specific example is very useful):
+
+@smallexample
+wchar_t *
+itow (unsigned long int val)
+@{
+  static wchar_t buf[30];
+  wchar_t *wcp = &buf[29];
+  *wcp = L'\0';
+  while (val != 0)
+    @{
+      *--wcp = btowc ('0' + val % 10);
+      val /= 10;
+    @}
+  if (wcp == &buf[29])
+    *--wcp = L'0';
+  return wcp;
+@}
+@end smallexample
+
+Why is it necessary to use such a complicated implementation and not
+simply cast @code{'0' + val % 10} to a wide character?  The answer is
+that there is no guarantee that one can perform this kind of arithmetic
+on the character of the character set used for @code{wchar_t}
+representation.  In other situations the bytes are not constant at
+compile time and so the compiler cannot do the work.  In situations like
+this it is necessary @code{btowc}.
+
+@noindent
+There also is a function for the conversion in the other direction.
+
+@comment wchar.h
+@comment ISO
+@deftypefun int wctob (wint_t @var{c})
+The @code{wctob} function (``wide character to byte'') takes as the
+parameter a valid wide character.  If the multibyte representation for
+this character in the initial state is exactly one byte long, the return
+value of this function is this character.  Otherwise the return value is
+@code{EOF}.
+
+@pindex wchar.h
+@code{wctob} was introduced in @w{Amendment 1} to @w{ISO C90} and
+is declared in @file{wchar.h}.
+@end deftypefun
+
+There are more general functions to convert single character from
+multibyte representation to wide characters and vice versa.  These
+functions pose no limit on the length of the multibyte representation
+and they also do not require it to be in the initial state.
+
+@comment wchar.h
+@comment ISO
+@deftypefun size_t mbrtowc (wchar_t *restrict @var{pwc}, const char *restrict @var{s}, size_t @var{n}, mbstate_t *restrict @var{ps})
+@cindex stateful
+The @code{mbrtowc} function (``multibyte restartable to wide
+character'') converts the next multibyte character in the string pointed
+to by @var{s} into a wide character and stores it in the wide character
+string pointed to by @var{pwc}.  The conversion is performed according
+to the locale currently selected for the @code{LC_CTYPE} category.  If
+the conversion for the character set used in the locale requires a state,
+the multibyte string is interpreted in the state represented by the
+object pointed to by @var{ps}.  If @var{ps} is a null pointer, a static,
+internal state variable used only by the @code{mbrtowc} function is
+used.
+
+If the next multibyte character corresponds to the NUL wide character,
+the return value of the function is @math{0} and the state object is
+afterwards in the initial state.  If the next @var{n} or fewer bytes
+form a correct multibyte character, the return value is the number of
+bytes starting from @var{s} that form the multibyte character.  The
+conversion state is updated according to the bytes consumed in the
+conversion.  In both cases the wide character (either the @code{L'\0'}
+or the one found in the conversion) is stored in the string pointed to
+by @var{pwc} if @var{pwc} is not null.
+
+If the first @var{n} bytes of the multibyte string possibly form a valid
+multibyte character but there are more than @var{n} bytes needed to
+complete it, the return value of the function is @code{(size_t) -2} and
+no value is stored.  Please note that this can happen even if @var{n}
+has a value greater than or equal to @code{MB_CUR_MAX} since the input 
+might contain redundant shift sequences.
+
+If the first @code{n} bytes of the multibyte string cannot possibly form
+a valid multibyte character, no value is stored, the global variable
+@code{errno} is set to the value @code{EILSEQ}, and the function returns
+@code{(size_t) -1}.  The conversion state is afterwards undefined.
+
+@pindex wchar.h
+@code{mbrtowc} was introduced in @w{Amendment 1} to @w{ISO C90} and
+is declared in @file{wchar.h}.
+@end deftypefun
+
+Use of @code{mbrtowc} is straightforward.  A function that copies a
+multibyte string into a wide character string while at the same time
+converting all lowercase characters into uppercase could look like this
+(this is not the final version, just an example; it has no error
+checking, and sometimes leaks memory):
+
+@smallexample
+wchar_t *
+mbstouwcs (const char *s)
+@{
+  size_t len = strlen (s);
+  wchar_t *result = malloc ((len + 1) * sizeof (wchar_t));
+  wchar_t *wcp = result;
+  wchar_t tmp[1];
+  mbstate_t state;
+  size_t nbytes;
+
+  memset (&state, '\0', sizeof (state));
+  while ((nbytes = mbrtowc (tmp, s, len, &state)) > 0)
+    @{
+      if (nbytes >= (size_t) -2)
+        /* Invalid input string.  */
+        return NULL;
+      *result++ = towupper (tmp[0]);
+      len -= nbytes;
+      s += nbytes;
+    @}
+  return result;
+@}
+@end smallexample
+
+The use of @code{mbrtowc} should be clear.  A single wide character is
+stored in @code{@var{tmp}[0]}, and the number of consumed bytes is stored
+in the variable @var{nbytes}.  If the conversion is successful, the 
+uppercase variant of the wide character is stored in the @var{result} 
+array and the pointer to the input string and the number of available 
+bytes is adjusted.
+
+The only non-obvious thing about @code{mbrtowc} might be the way memory 
+is allocated for the result.  The above code uses the fact that there 
+can never be more wide characters in the converted results than there are
+bytes in the multibyte input string.  This method yields a pessimistic 
+guess about the size of the result, and if many wide character strings 
+have to be constructed this way or if the strings are long, the extra 
+memory required to be allocated because the input string contains 
+multibyte characters might be significant.  The allocated memory block can 
+be resized to the correct size before returning it, but a better solution 
+might be to allocate just the right amount of space for the result right 
+away.  Unfortunately there is no function to compute the length of the wide 
+character string directly from the multibyte string.  There is, however, a 
+function that does part of the work.
+
+@comment wchar.h
+@comment ISO
+@deftypefun size_t mbrlen (const char *restrict @var{s}, size_t @var{n}, mbstate_t *@var{ps})
+The @code{mbrlen} function (``multibyte restartable length'') computes
+the number of at most @var{n} bytes starting at @var{s}, which form the
+next valid and complete multibyte character.
+
+If the next multibyte character corresponds to the NUL wide character,
+the return value is @math{0}.  If the next @var{n} bytes form a valid
+multibyte character, the number of bytes belonging to this multibyte
+character byte sequence is returned.
+
+If the the first @var{n} bytes possibly form a valid multibyte
+character but the character is incomplete, the return value is 
+@code{(size_t) -2}.  Otherwise the multibyte character sequence is invalid 
+and the return value is @code{(size_t) -1}.
+
+The multibyte sequence is interpreted in the state represented by the
+object pointed to by @var{ps}.  If @var{ps} is a null pointer, a state
+object local to @code{mbrlen} is used.
+
+@pindex wchar.h
+@code{mbrlen} was introduced in @w{Amendment 1} to @w{ISO C90} and
+is declared in @file{wchar.h}.
+@end deftypefun
+
+The attentive reader now will note that @code{mbrlen} can be implemented 
+as
+
+@smallexample
+mbrtowc (NULL, s, n, ps != NULL ? ps : &internal)
+@end smallexample
+
+This is true and in fact is mentioned in the official specification.
+How can this function be used to determine the length of the wide
+character string created from a multibyte character string?  It is not
+directly usable, but we can define a function @code{mbslen} using it:
+
+@smallexample
+size_t
+mbslen (const char *s)
+@{
+  mbstate_t state;
+  size_t result = 0;
+  size_t nbytes;
+  memset (&state, '\0', sizeof (state));
+  while ((nbytes = mbrlen (s, MB_LEN_MAX, &state)) > 0)
+    @{
+      if (nbytes >= (size_t) -2)
+        /* @r{Something is wrong.}  */
+        return (size_t) -1;
+      s += nbytes;
+      ++result;
+    @}
+  return result;
+@}
+@end smallexample
+
+This function simply calls @code{mbrlen} for each multibyte character
+in the string and counts the number of function calls.  Please note that
+we here use @code{MB_LEN_MAX} as the size argument in the @code{mbrlen}
+call.  This is acceptable since a) this value is larger then the length of 
+the longest multibyte character sequence and b) we know that the string 
+@var{s} ends with a NUL byte, which cannot be part of any other multibyte 
+character sequence but the one representing the NUL wide character.  
+Therefore, the @code{mbrlen} function will never read invalid memory.
+
+Now that this function is available (just to make this clear, this
+function is @emph{not} part of the GNU C library) we can compute the
+number of wide character required to store the converted multibyte
+character string @var{s} using
+
+@smallexample
+wcs_bytes = (mbslen (s) + 1) * sizeof (wchar_t);
+@end smallexample
+
+Please note that the @code{mbslen} function is quite inefficient.  The
+implementation of @code{mbstouwcs} with @code{mbslen} would have to 
+perform the conversion of the multibyte character input string twice, and 
+this conversion might be quite expensive.  So it is necessary to think 
+about the consequences of using the easier but imprecise method before 
+doing the work twice.
+
+@comment wchar.h
+@comment ISO
+@deftypefun size_t wcrtomb (char *restrict @var{s}, wchar_t @var{wc}, mbstate_t *restrict @var{ps})
+The @code{wcrtomb} function (``wide character restartable to
+multibyte'') converts a single wide character into a multibyte string
+corresponding to that wide character.
+
+If @var{s} is a null pointer, the function resets the state stored in
+the objects pointed to by @var{ps} (or the internal @code{mbstate_t}
+object) to the initial state.  This can also be achieved by a call like
+this:
+
+@smallexample
+wcrtombs (temp_buf, L'\0', ps)
+@end smallexample
+
+@noindent
+since, if @var{s} is a null pointer, @code{wcrtomb} performs as if it
+writes into an internal buffer, which is guaranteed to be large enough.
+
+If @var{wc} is the NUL wide character, @code{wcrtomb} emits, if
+necessary, a shift sequence to get the state @var{ps} into the initial
+state followed by a single NUL byte, which is stored in the string 
+@var{s}.
+
+Otherwise a byte sequence (possibly including shift sequences) is written 
+into the string @var{s}.  This only happens if @var{wc} is a valid wide 
+character (i.e., it has a multibyte representation in the character set 
+selected by locale of the @code{LC_CTYPE} category).  If @var{wc} is no 
+valid wide character, nothing is stored in the strings @var{s}, 
+@code{errno} is set to @code{EILSEQ}, the conversion state in @var{ps} 
+is undefined and the return value is @code{(size_t) -1}.
+
+If no error occurred the function returns the number of bytes stored in
+the string @var{s}.  This includes all bytes representing shift
+sequences.
+
+One word about the interface of the function: there is no parameter
+specifying the length of the array @var{s}.  Instead the function
+assumes that there are at least @code{MB_CUR_MAX} bytes available since
+this is the maximum length of any byte sequence representing a single
+character.  So the caller has to make sure that there is enough space
+available, otherwise buffer overruns can occur.
+
+@pindex wchar.h
+@code{wcrtomb} was introduced in @w{Amendment 1} to @w{ISO C90} and is
+declared in @file{wchar.h}.
+@end deftypefun
+
+Using @code{wcrtomb} is as easy as using @code{mbrtowc}.  The following
+example appends a wide character string to a multibyte character string.
+Again, the code is not really useful (or correct), it is simply here to
+demonstrate the use and some problems.
+
+@smallexample
+char *
+mbscatwcs (char *s, size_t len, const wchar_t *ws)
+@{
+  mbstate_t state;
+  /* @r{Find the end of the existing string.}  */
+  char *wp = strchr (s, '\0');
+  len -= wp - s;
+  memset (&state, '\0', sizeof (state));
+  do
+    @{
+      size_t nbytes;
+      if (len < MB_CUR_LEN)
+        @{
+          /* @r{We cannot guarantee that the next}
+             @r{character fits into the buffer, so}
+             @r{return an error.}  */
+          errno = E2BIG;
+          return NULL;
+        @}
+      nbytes = wcrtomb (wp, *ws, &state);
+      if (nbytes == (size_t) -1)
+        /* @r{Error in the conversion.}  */
+        return NULL;
+      len -= nbytes;
+      wp += nbytes;
+    @}
+  while (*ws++ != L'\0');
+  return s;
+@}
+@end smallexample
+
+First the function has to find the end of the string currently in the
+array @var{s}.  The @code{strchr} call does this very efficiently since a
+requirement for multibyte character representations is that the NUL byte
+is never used except to represent itself (and in this context, the end
+of the string).
+
+After initializing the state object the loop is entered where the first
+task is to make sure there is enough room in the array @var{s}.  We
+abort if there are not at least @code{MB_CUR_LEN} bytes available.  This
+is not always optimal but we have no other choice.  We might have less
+than @code{MB_CUR_LEN} bytes available but the next multibyte character
+might also be only one byte long.  At the time the @code{wcrtomb} call
+returns it is too late to decide whether the buffer was large enough.  If 
+this solution is unsuitable, there is a very slow but more accurate 
+solution.
+
+@smallexample
+  ...
+  if (len < MB_CUR_LEN)
+    @{
+      mbstate_t temp_state;
+      memcpy (&temp_state, &state, sizeof (state));
+      if (wcrtomb (NULL, *ws, &temp_state) > len)
+        @{
+          /* @r{We cannot guarantee that the next}
+             @r{character fits into the buffer, so}
+             @r{return an error.}  */
+          errno = E2BIG;
+          return NULL;
+        @}
+    @}
+  ...
+@end smallexample
+
+Here we perform the conversion that might overflow the buffer so that 
+we are afterwards in the position to make an exact decision about the 
+buffer size.  Please note the @code{NULL} argument for the destination 
+buffer in the new @code{wcrtomb} call; since we are not interested in the 
+converted text at this point, this is a nice way to express this.  The 
+most unusual thing about this piece of code certainly is the duplication 
+of the conversion state object, but if a change of the state is necessary 
+to emit the next multibyte character, we want to have the same shift state 
+change performed in the real conversion.  Therefore, we have to preserve 
+the initial shift state information.
+
+There are certainly many more and even better solutions to this problem.
+This example is only provided for educational purposes.
+
+@node Converting Strings
+@subsection Converting Multibyte and Wide Character Strings
+
+The functions described in the previous section only convert a single
+character at a time.  Most operations to be performed in real-world
+programs include strings and therefore the @w{ISO C} standard also
+defines conversions on entire strings.  However, the defined set of
+functions is quite limited; therefore, the GNU C library contains a few
+extensions that can help in some important situations.
+
+@comment wchar.h
+@comment ISO
+@deftypefun size_t mbsrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps})
+The @code{mbsrtowcs} function (``multibyte string restartable to wide
+character string'') converts an NUL-terminated multibyte character
+string at @code{*@var{src}} into an equivalent wide character string,
+including the NUL wide character at the end.  The conversion is started
+using the state information from the object pointed to by @var{ps} or
+from an internal object of @code{mbsrtowcs} if @var{ps} is a null
+pointer.  Before returning, the state object is updated to match the state 
+after the last converted character.  The state is the initial state if the
+terminating NUL byte is reached and converted.
+
+If @var{dst} is not a null pointer, the result is stored in the array
+pointed to by @var{dst}; otherwise, the conversion result is not
+available since it is stored in an internal buffer.
+
+If @var{len} wide characters are stored in the array @var{dst} before
+reaching the end of the input string, the conversion stops and @var{len}
+is returned.  If @var{dst} is a null pointer, @var{len} is never checked.
+
+Another reason for a premature return from the function call is if the
+input string contains an invalid multibyte sequence.  In this case the
+global variable @code{errno} is set to @code{EILSEQ} and the function
+returns @code{(size_t) -1}.
+
+@c XXX The ISO C9x draft seems to have a problem here.  It says that PS
+@c is not updated if DST is NULL.  This is not said straightforward and
+@c none of the other functions is described like this.  It would make sense
+@c to define the function this way but I don't think it is meant like this.
+
+In all other cases the function returns the number of wide characters
+converted during this call.  If @var{dst} is not null, @code{mbsrtowcs}
+stores in the pointer pointed to by @var{src} either a null pointer (if 
+the NUL byte in the input string was reached) or the address of the byte
+following the last converted multibyte character.
+
+@pindex wchar.h
+@code{mbsrtowcs} was introduced in @w{Amendment 1} to @w{ISO C90} and is
+declared in @file{wchar.h}.
+@end deftypefun
+
+The definition of the @code{mbsrtowcs} function has one important 
+limitation.  The requirement that @var{dst} has to be a NUL-terminated 
+string provides problems if one wants to convert buffers with text.  A
+buffer is normally no collection of NUL-terminated strings but instead a
+continuous collection of lines, separated by newline characters.  Now
+assume that a function to convert one line from a buffer is needed.  Since
+the line is not NUL-terminated, the source pointer cannot directly point
+into the unmodified text buffer.  This means, either one inserts the NUL
+byte at the appropriate place for the time of the @code{mbsrtowcs}
+function call (which is not doable for a read-only buffer or in a
+multi-threaded application) or one copies the line in an extra buffer
+where it can be terminated by a NUL byte.  Note that it is not in general 
+possible to limit the number of characters to convert by setting the 
+parameter @var{len} to any specific value.  Since it is not known how 
+many bytes each multibyte character sequence is in length, one can only 
+guess.
+
+@cindex stateful
+There is still a problem with the method of NUL-terminating a line right
+after the newline character, which could lead to very strange results.
+As said in the description of the @code{mbsrtowcs} function above the
+conversion state is guaranteed to be in the initial shift state after
+processing the NUL byte at the end of the input string.  But this NUL
+byte is not really part of the text (i.e., the conversion state after
+the newline in the original text could be something different than the
+initial shift state and therefore the first character of the next line
+is encoded using this state).  But the state in question is never
+accessible to the user since the conversion stops after the NUL byte
+(which resets the state).  Most stateful character sets in use today
+require that the shift state after a newline be the initial state--but
+this is not a strict guarantee.  Therefore, simply NUL-terminating a
+piece of a running text is not always an adequate solution and, 
+therefore, should never be used in generally used code.
+
+The generic conversion interface (@pxref{Generic Charset Conversion})
+does not have this limitation (it simply works on buffers, not
+strings), and the GNU C library contains a set of functions that take
+additional parameters specifying the maximal number of bytes that are
+consumed from the input string.  This way the problem of
+@code{mbsrtowcs}'s example above could be solved by determining the line
+length and passing this length to the function.
+
+@comment wchar.h
+@comment ISO
+@deftypefun size_t wcsrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps})
+The @code{wcsrtombs} function (``wide character string restartable to
+multibyte string'') converts the NUL-terminated wide character string at
+@code{*@var{src}} into an equivalent multibyte character string and 
+stores the result in the array pointed to by @var{dst}.  The NUL wide
+character is also converted.  The conversion starts in the state
+described in the object pointed to by @var{ps} or by a state object
+locally to @code{wcsrtombs} in case @var{ps} is a null pointer.  If
+@var{dst} is a null pointer, the conversion is performed as usual but the
+result is not available.  If all characters of the input string were
+successfully converted and if @var{dst} is not a null pointer, the 
+pointer pointed to by @var{src} gets assigned a null pointer.
+
+If one of the wide characters in the input string has no valid multibyte
+character equivalent, the conversion stops early, sets the global
+variable @code{errno} to @code{EILSEQ}, and returns @code{(size_t) -1}.
+
+Another reason for a premature stop is if @var{dst} is not a null
+pointer and the next converted character would require more than
+@var{len} bytes in total to the array @var{dst}.  In this case (and if
+@var{dest} is not a null pointer) the pointer pointed to by @var{src} is
+assigned a value pointing to the wide character right after the last one
+successfully converted.
+
+Except in the case of an encoding error the return value of the 
+@code{wcsrtombs} function is the number of bytes in all the multibyte 
+character sequences stored in @var{dst}.  Before returning the state in 
+the object pointed to by @var{ps} (or the internal object in case 
+@var{ps} is a null pointer) is updated to reflect the state after the 
+last conversion.  The state is the initial shift state in case the 
+terminating NUL wide character was converted.
+
+@pindex wchar.h
+The @code{wcsrtombs} function was introduced in @w{Amendment 1} to 
+@w{ISO C90} and is declared in @file{wchar.h}.
+@end deftypefun
+
+The restriction mentioned above for the @code{mbsrtowcs} function applies
+here also.  There is no possibility of directly controlling the number of
+input characters.  One has to place the NUL wide character at the correct 
+place or control the consumed input indirectly via the available output 
+array size (the @var{len} parameter).
+
+@comment wchar.h
+@comment GNU
+@deftypefun size_t mbsnrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{nmc}, size_t @var{len}, mbstate_t *restrict @var{ps})
+The @code{mbsnrtowcs} function is very similar to the @code{mbsrtowcs}
+function.  All the parameters are the same except for @var{nmc}, which is
+new.  The return value is the same as for @code{mbsrtowcs}.
+
+This new parameter specifies how many bytes at most can be used from the
+multibyte character string.  In other words, the multibyte character 
+string @code{*@var{src}} need not be NUL-terminated.  But if a NUL byte 
+is found within the @var{nmc} first bytes of the string, the conversion 
+stops here.
+
+This function is a GNU extension.  It is meant to work around the
+problems mentioned above.  Now it is possible to convert a buffer with
+multibyte character text piece for piece without having to care about
+inserting NUL bytes and the effect of NUL bytes on the conversion state.
+@end deftypefun
+
+A function to convert a multibyte string into a wide character string
+and display it could be written like this (this is not a really useful
+example):
+
+@smallexample
+void
+showmbs (const char *src, FILE *fp)
+@{
+  mbstate_t state;
+  int cnt = 0;
+  memset (&state, '\0', sizeof (state));
+  while (1)
+    @{
+      wchar_t linebuf[100];
+      const char *endp = strchr (src, '\n');
+      size_t n;
+
+      /* @r{Exit if there is no more line.}  */
+      if (endp == NULL)
+        break;
+
+      n = mbsnrtowcs (linebuf, &src, endp - src, 99, &state);
+      linebuf[n] = L'\0';
+      fprintf (fp, "line %d: \"%S\"\n", linebuf);
+    @}
+@}
+@end smallexample
+
+There is no problem with the state after a call to @code{mbsnrtowcs}.
+Since we don't insert characters in the strings that were not in there
+right from the beginning and we use @var{state} only for the conversion
+of the given buffer, there is no problem with altering the state.
+
+@comment wchar.h
+@comment GNU
+@deftypefun size_t wcsnrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{nwc}, size_t @var{len}, mbstate_t *restrict @var{ps})
+The @code{wcsnrtombs} function implements the conversion from wide
+character strings to multibyte character strings.  It is similar to
+@code{wcsrtombs} but, just like @code{mbsnrtowcs}, it takes an extra
+parameter, which specifies the length of the input string.
+
+No more than @var{nwc} wide characters from the input string
+@code{*@var{src}} are converted.  If the input string contains a NUL
+wide character in the first @var{nwc} characters, the conversion stops at
+this place.
+
+The @code{wcsnrtombs} function is a GNU extension and just like 
+@code{mbsnrtowcs} helps in situations where no NUL-terminated input 
+strings are available.
+@end deftypefun
+
+
+@node Multibyte Conversion Example
+@subsection A Complete Multibyte Conversion Example
+
+The example programs given in the last sections are only brief and do
+not contain all the error checking, etc.  Presented here is a complete
+and documented example.  It features the @code{mbrtowc} function but it
+should be easy to derive versions using the other functions.
+
+@smallexample
+int
+file_mbsrtowcs (int input, int output)
+@{
+  /* @r{Note the use of @code{MB_LEN_MAX}.}
+     @r{@code{MB_CUR_MAX} cannot portably be used here.}  */
+  char buffer[BUFSIZ + MB_LEN_MAX];
+  mbstate_t state;
+  int filled = 0;
+  int eof = 0;
+
+  /* @r{Initialize the state.}  */
+  memset (&state, '\0', sizeof (state));
+
+  while (!eof)
+    @{
+      ssize_t nread;
+      ssize_t nwrite;
+      char *inp = buffer;
+      wchar_t outbuf[BUFSIZ];
+      wchar_t *outp = outbuf;
+
+      /* @r{Fill up the buffer from the input file.}  */
+      nread = read (input, buffer + filled, BUFSIZ);
+      if (nread < 0)
+        @{
+          perror ("read");
+          return 0;
+        @}
+      /* @r{If we reach end of file, make a note to read no more.} */
+      if (nread == 0)
+        eof = 1;
+
+      /* @r{@code{filled} is now the number of bytes in @code{buffer}.} */
+      filled += nread;
+
+      /* @r{Convert those bytes to wide characters--as many as we can.} */
+      while (1)
+        @{
+          size_t thislen = mbrtowc (outp, inp, filled, &state);
+          /* @r{Stop converting at invalid character;}
+             @r{this can mean we have read just the first part}
+             @r{of a valid character.}  */
+          if (thislen == (size_t) -1)
+            break;
+          /* @r{We want to handle embedded NUL bytes}
+             @r{but the return value is 0.  Correct this.}  */
+          if (thislen == 0)
+            thislen = 1;
+          /* @r{Advance past this character.} */
+          inp += thislen;
+          filled -= thislen;
+          ++outp;
+        @}
+
+      /* @r{Write the wide characters we just made.}  */
+      nwrite = write (output, outbuf,
+                      (outp - outbuf) * sizeof (wchar_t));
+      if (nwrite < 0)
+        @{
+          perror ("write");
+          return 0;
+        @}
+
+      /* @r{See if we have a @emph{real} invalid character.} */
+      if ((eof && filled > 0) || filled >= MB_CUR_MAX)
+        @{
+          error (0, 0, "invalid multibyte character");
+          return 0;
+        @}
+
+      /* @r{If any characters must be carried forward,}
+         @r{put them at the beginning of @code{buffer}.} */
+      if (filled > 0)
+        memmove (inp, buffer, filled);
+    @}
+
+  return 1;
+@}
+@end smallexample
+
+
+@node Non-reentrant Conversion
+@section Non-reentrant Conversion Function
+
+The functions described in the previous chapter are defined in
+@w{Amendment 1} to @w{ISO C90}, but the original @w{ISO C90} standard 
+also contained functions for character set conversion.  The reason that 
+these original functions are not described first is that they are almost 
+entirely useless.
+
+The problem is that all the conversion functions described in the 
+original @w{ISO C90} use a local state.  Using a local state implies that 
+multiple conversions at the same time (not only when using threads) 
+cannot be done, and that you cannot first convert single characters and 
+then strings since you cannot tell the conversion functions which state 
+to use.
+
+These original functions are therefore usable only in a very limited set 
+of situations.  One must complete converting the entire string before
+starting a new one, and each string/text must be converted with the same
+function (there is no problem with the library itself; it is guaranteed
+that no library function changes the state of any of these functions).
+@strong{For the above reasons it is highly requested that the functions
+described in the previous section be used in place of non-reentrant 
+conversion functions.}
+
+@menu
+* Non-reentrant Character Conversion::  Non-reentrant Conversion of Single
+                                         Characters.
+* Non-reentrant String Conversion::     Non-reentrant Conversion of Strings.
+* Shift State::                         States in Non-reentrant Functions.
+@end menu
+
+@node Non-reentrant Character Conversion
+@subsection Non-reentrant Conversion of Single Characters
+
+@comment stdlib.h
+@comment ISO
+@deftypefun int mbtowc (wchar_t *restrict @var{result}, const char *restrict @var{string}, size_t @var{size})
+The @code{mbtowc} (``multibyte to wide character'') function when called
+with non-null @var{string} converts the first multibyte character
+beginning at @var{string} to its corresponding wide character code.  It
+stores the result in @code{*@var{result}}.
+
+@code{mbtowc} never examines more than @var{size} bytes.  (The idea is
+to supply for @var{size} the number of bytes of data you have in hand.)
+
+@code{mbtowc} with non-null @var{string} distinguishes three
+possibilities: the first @var{size} bytes at @var{string} start with
+valid multibyte characters, they start with an invalid byte sequence or
+just part of a character, or @var{string} points to an empty string (a
+null character).
+
+For a valid multibyte character, @code{mbtowc} converts it to a wide
+character and stores that in @code{*@var{result}}, and returns the
+number of bytes in that character (always at least @math{1} and never
+more than @var{size}).
+
+For an invalid byte sequence, @code{mbtowc} returns @math{-1}.  For an
+empty string, it returns @math{0}, also storing @code{'\0'} in
+@code{*@var{result}}.
+
+If the multibyte character code uses shift characters, then
+@code{mbtowc} maintains and updates a shift state as it scans.  If you
+call @code{mbtowc} with a null pointer for @var{string}, that
+initializes the shift state to its standard initial value.  It also
+returns nonzero if the multibyte character code in use actually has a
+shift state.  @xref{Shift State}.
+@end deftypefun
+
+@comment stdlib.h
+@comment ISO
+@deftypefun int wctomb (char *@var{string}, wchar_t @var{wchar})
+The @code{wctomb} (``wide character to multibyte'') function converts
+the wide character code @var{wchar} to its corresponding multibyte
+character sequence, and stores the result in bytes starting at
+@var{string}.  At most @code{MB_CUR_MAX} characters are stored.
+
+@code{wctomb} with non-null @var{string} distinguishes three
+possibilities for @var{wchar}: a valid wide character code (one that can
+be translated to a multibyte character), an invalid code, and 
+@code{L'\0'}.
+
+Given a valid code, @code{wctomb} converts it to a multibyte character,
+storing the bytes starting at @var{string}.  Then it returns the number
+of bytes in that character (always at least @math{1} and never more
+than @code{MB_CUR_MAX}).
+
+If @var{wchar} is an invalid wide character code, @code{wctomb} returns
+@math{-1}.  If @var{wchar} is @code{L'\0'}, it returns @code{0}, also
+storing @code{'\0'} in @code{*@var{string}}.
+
+If the multibyte character code uses shift characters, then
+@code{wctomb} maintains and updates a shift state as it scans.  If you
+call @code{wctomb} with a null pointer for @var{string}, that
+initializes the shift state to its standard initial value.  It also
+returns nonzero if the multibyte character code in use actually has a
+shift state.  @xref{Shift State}.
+
+Calling this function with a @var{wchar} argument of zero when
+@var{string} is not null has the side-effect of reinitializing the
+stored shift state @emph{as well as} storing the multibyte character
+@code{'\0'} and returning @math{0}.
+@end deftypefun
+
+Similar to @code{mbrlen} there is also a non-reentrant function that
+computes the length of a multibyte character.  It can be defined in
+terms of @code{mbtowc}.
+
+@comment stdlib.h
+@comment ISO
+@deftypefun int mblen (const char *@var{string}, size_t @var{size})
+The @code{mblen} function with a non-null @var{string} argument returns
+the number of bytes that make up the multibyte character beginning at
+@var{string}, never examining more than @var{size} bytes.  (The idea is
+to supply for @var{size} the number of bytes of data you have in hand.)
+
+The return value of @code{mblen} distinguishes three possibilities: the
+first @var{size} bytes at @var{string} start with valid multibyte
+characters, they start with an invalid byte sequence or just part of a
+character, or @var{string} points to an empty string (a null character).
+
+For a valid multibyte character, @code{mblen} returns the number of
+bytes in that character (always at least @code{1} and never more than
+@var{size}).  For an invalid byte sequence, @code{mblen} returns 
+@math{-1}.  For an empty string, it returns @math{0}.
+
+If the multibyte character code uses shift characters, then @code{mblen}
+maintains and updates a shift state as it scans.  If you call
+@code{mblen} with a null pointer for @var{string}, that initializes the
+shift state to its standard initial value.  It also returns a nonzero
+value if the multibyte character code in use actually has a shift state.
+@xref{Shift State}.
+
+@pindex stdlib.h
+The function @code{mblen} is declared in @file{stdlib.h}.
+@end deftypefun
+
+
+@node Non-reentrant String Conversion
+@subsection Non-reentrant Conversion of Strings
+
+For convenience the @w{ISO C90} standard also defines functions to 
+convert entire strings instead of single characters.  These functions
+suffer from the same problems as their reentrant counterparts from
+@w{Amendment 1} to @w{ISO C90}; see @ref{Converting Strings}.
+
+@comment stdlib.h
+@comment ISO
+@deftypefun size_t mbstowcs (wchar_t *@var{wstring}, const char *@var{string}, size_t @var{size})
+The @code{mbstowcs} (``multibyte string to wide character string'')
+function converts the null-terminated string of multibyte characters
+@var{string} to an array of wide character codes, storing not more than
+@var{size} wide characters into the array beginning at @var{wstring}.
+The terminating null character counts towards the size, so if @var{size}
+is less than the actual number of wide characters resulting from
+@var{string}, no terminating null character is stored.
+
+The conversion of characters from @var{string} begins in the initial
+shift state.
+
+If an invalid multibyte character sequence is found, the @code{mbstowcs} 
+function returns a value of @math{-1}.  Otherwise, it returns the number 
+of wide characters stored in the array @var{wstring}.  This number does 
+not include the terminating null character, which is present if the 
+number is less than @var{size}.
+
+Here is an example showing how to convert a string of multibyte
+characters, allocating enough space for the result.
+
+@smallexample
+wchar_t *
+mbstowcs_alloc (const char *string)
+@{
+  size_t size = strlen (string) + 1;
+  wchar_t *buf = xmalloc (size * sizeof (wchar_t));
+
+  size = mbstowcs (buf, string, size);
+  if (size == (size_t) -1)
+    return NULL;
+  buf = xrealloc (buf, (size + 1) * sizeof (wchar_t));
+  return buf;
+@}
+@end smallexample
+
+@end deftypefun
+
+@comment stdlib.h
+@comment ISO
+@deftypefun size_t wcstombs (char *@var{string}, const wchar_t *@var{wstring}, size_t @var{size})
+The @code{wcstombs} (``wide character string to multibyte string'')
+function converts the null-terminated wide character array @var{wstring}
+into a string containing multibyte characters, storing not more than
+@var{size} bytes starting at @var{string}, followed by a terminating
+null character if there is room.  The conversion of characters begins in
+the initial shift state.
+
+The terminating null character counts towards the size, so if @var{size}
+is less than or equal to the number of bytes needed in @var{wstring}, no
+terminating null character is stored.
+
+If a code that does not correspond to a valid multibyte character is
+found, the @code{wcstombs} function returns a value of @math{-1}. 
+Otherwise, the return value is the number of bytes stored in the array 
+@var{string}.  This number does not include the terminating null character, 
+which is present if the number is less than @var{size}.
+@end deftypefun
+
+@node Shift State
+@subsection States in Non-reentrant Functions
+
+In some multibyte character codes, the @emph{meaning} of any particular
+byte sequence is not fixed; it depends on what other sequences have come
+earlier in the same string.  Typically there are just a few sequences that 
+can change the meaning of other sequences; these few are called 
+@dfn{shift sequences} and we say that they set the @dfn{shift state} for
+other sequences that follow.
+
+To illustrate shift state and shift sequences, suppose we decide that
+the sequence @code{0200} (just one byte) enters Japanese mode, in which
+pairs of bytes in the range from @code{0240} to @code{0377} are single
+characters, while @code{0201} enters Latin-1 mode, in which single bytes
+in the range from @code{0240} to @code{0377} are characters, and
+interpreted according to the ISO Latin-1 character set.  This is a
+multibyte code that has two alternative shift states (``Japanese mode''
+and ``Latin-1 mode''), and two shift sequences that specify particular
+shift states.
+
+When the multibyte character code in use has shift states, then
+@code{mblen}, @code{mbtowc}, and @code{wctomb} must maintain and update
+the current shift state as they scan the string.  To make this work
+properly, you must follow these rules:
+
+@itemize @bullet
+@item
+Before starting to scan a string, call the function with a null pointer
+for the multibyte character address---for example, @code{mblen (NULL,
+0)}.  This initializes the shift state to its standard initial value.
+
+@item
+Scan the string one character at a time, in order.  Do not ``back up''
+and rescan characters already scanned, and do not intersperse the
+processing of different strings.
+@end itemize
+
+Here is an example of using @code{mblen} following these rules:
+
+@smallexample
+void
+scan_string (char *s)
+@{
+  int length = strlen (s);
+
+  /* @r{Initialize shift state.}  */
+  mblen (NULL, 0);
+
+  while (1)
+    @{
+      int thischar = mblen (s, length);
+      /* @r{Deal with end of string and invalid characters.}  */
+      if (thischar == 0)
+        break;
+      if (thischar == -1)
+        @{
+          error ("invalid multibyte character");
+          break;
+        @}
+      /* @r{Advance past this character.}  */
+      s += thischar;
+      length -= thischar;
+    @}
+@}
+@end smallexample
+
+The functions @code{mblen}, @code{mbtowc} and @code{wctomb} are not
+reentrant when using a multibyte code that uses a shift state.  However,
+no other library functions call these functions, so you don't have to
+worry that the shift state will be changed mysteriously.
+
+
+@node Generic Charset Conversion
+@section Generic Charset Conversion
+
+The conversion functions mentioned so far in this chapter all had in
+common that they operate on character sets that are not directly
+specified by the functions.  The multibyte encoding used is specified by
+the currently selected locale for the @code{LC_CTYPE} category.  The
+wide character set is fixed by the implementation (in the case of GNU C
+library it is always UCS-4 encoded @w{ISO 10646}.
+
+This has of course several problems when it comes to general character
+conversion:
+
+@itemize @bullet
+@item
+For every conversion where neither the source nor the destination 
+character set is the character set of the locale for the @code{LC_CTYPE} 
+category, one has to change the @code{LC_CTYPE} locale using 
+@code{setlocale}.
+
+Changing the @code{LC_TYPE} locale introduces major problems for the rest 
+of the programs since several more functions (e.g., the character 
+classification functions, @pxref{Classification of Characters}) use the 
+@code{LC_CTYPE} category.
+
+@item
+Parallel conversions to and from different character sets are not
+possible since the @code{LC_CTYPE} selection is global and shared by all
+threads.
+
+@item
+If neither the source nor the destination character set is the character
+set used for @code{wchar_t} representation, there is at least a two-step
+process necessary to convert a text using the functions above.  One would 
+have to select the source character set as the multibyte encoding, 
+convert the text into a @code{wchar_t} text, select the destination
+character set as the multibyte encoding, and convert the wide character
+text to the multibyte (@math{=} destination) character set.
+
+Even if this is possible (which is not guaranteed) it is a very tiring
+work.  Plus it suffers from the other two raised points even more due to
+the steady changing of the locale.
+@end itemize
+
+The XPG2 standard defines a completely new set of functions, which has
+none of these limitations.  They are not at all coupled to the selected
+locales, and they have no constraints on the character sets selected for
+source and destination.  Only the set of available conversions limits 
+them.  The standard does not specify that any conversion at all must be 
+available.  Such availability is a measure of the quality of the 
+implementation.
+
+In the following text first the interface to @code{iconv} and then the
+conversion function, will be described.  Comparisons with other
+implementations will show what obstacles stand in the way of portable
+applications.  Finally, the implementation is described in so far as might 
+interest the advanced user who wants to extend conversion capabilities.
+
+@menu
+* Generic Conversion Interface::    Generic Character Set Conversion Interface.
+* iconv Examples::                  A complete @code{iconv} example.
+* Other iconv Implementations::     Some Details about other @code{iconv}
+                                     Implementations.
+* glibc iconv Implementation::      The @code{iconv} Implementation in the GNU C
+                                     library.
+@end menu
+
+@node Generic Conversion Interface
+@subsection Generic Character Set Conversion Interface
+
+This set of functions follows the traditional cycle of using a resource:
+open--use--close.  The interface consists of three functions, each of
+which implements one step.
+
+Before the interfaces are described it is necessary to introduce a
+data type.  Just like other open--use--close interfaces the functions
+introduced here work using handles and the @file{iconv.h} header
+defines a special type for the handles used.
+
+@comment iconv.h
+@comment XPG2
+@deftp {Data Type} iconv_t
+This data type is an abstract type defined in @file{iconv.h}.  The user
+must not assume anything about the definition of this type; it must be
+completely opaque.
+
+Objects of this type can get assigned handles for the conversions using
+the @code{iconv} functions.  The objects themselves need not be freed, but
+the conversions for which the handles stand for have to.
+@end deftp
+
+@noindent
+The first step is the function to create a handle.
+
+@comment iconv.h
+@comment XPG2
+@deftypefun iconv_t iconv_open (const char *@var{tocode}, const char *@var{fromcode})
+The @code{iconv_open} function has to be used before starting a
+conversion.  The two parameters this function takes determine the
+source and destination character set for the conversion, and if the
+implementation has the possibility to perform such a conversion, the
+function returns a handle.
+
+If the wanted conversion is not available, the @code{iconv_open} function 
+returns @code{(iconv_t) -1}. In this case the global variable 
+@code{errno} can have the following values:
+
+@table @code
+@item EMFILE
+The process already has @code{OPEN_MAX} file descriptors open.
+@item ENFILE
+The system limit of open file is reached.
+@item ENOMEM
+Not enough memory to carry out the operation.
+@item EINVAL
+The conversion from @var{fromcode} to @var{tocode} is not supported.
+@end table
+
+It is not possible to use the same descriptor in different threads to
+perform independent conversions.  The data structures associated
+with the descriptor include information about the conversion state.
+This must not be messed up by using it in different conversions.
+
+An @code{iconv} descriptor is like a file descriptor as for every use a
+new descriptor must be created.  The descriptor does not stand for all
+of the conversions from @var{fromset} to @var{toset}.
+
+The GNU C library implementation of @code{iconv_open} has one
+significant extension to other implementations.  To ease the extension
+of the set of available conversions, the implementation allows storing
+the necessary files with data and code in an arbitrary number of 
+directories.  How this extension must be written will be explained below
+(@pxref{glibc iconv Implementation}).  Here it is only important to say
+that all directories mentioned in the @code{GCONV_PATH} environment
+variable are considered only if they contain a file @file{gconv-modules}.
+These directories need not necessarily be created by the system
+administrator.  In fact, this extension is introduced to help users
+writing and using their own, new conversions.  Of course, this does not 
+work for security reasons in SUID binaries; in this case only the system
+directory is considered and this normally is 
+@file{@var{prefix}/lib/gconv}.  The @code{GCONV_PATH} environment 
+variable is examined exactly once at the first call of the 
+@code{iconv_open} function.  Later modifications of the variable have no 
+effect.
+
+@pindex iconv.h
+The @code{iconv_open} function was introduced early in the X/Open 
+Portability Guide, @w{version 2}.  It is supported by all commercial 
+Unices as it is required for the Unix branding.  However, the quality and 
+completeness of the implementation varies widely.  The @code{iconv_open} 
+function is declared in @file{iconv.h}.
+@end deftypefun
+
+The @code{iconv} implementation can associate large data structure with
+the handle returned by @code{iconv_open}.  Therefore, it is crucial to 
+free all the resources once all conversions are carried out and the 
+conversion is not needed anymore.
+
+@comment iconv.h
+@comment XPG2
+@deftypefun int iconv_close (iconv_t @var{cd})
+The @code{iconv_close} function frees all resources associated with the
+handle @var{cd}, which must have been returned by a successful call to
+the @code{iconv_open} function.
+
+If the function call was successful the return value is @math{0}.
+Otherwise it is @math{-1} and @code{errno} is set appropriately.
+Defined error are:
+
+@table @code
+@item EBADF
+The conversion descriptor is invalid.
+@end table
+
+@pindex iconv.h
+The @code{iconv_close} function was introduced together with the rest 
+of the @code{iconv} functions in XPG2 and is declared in @file{iconv.h}.
+@end deftypefun
+
+The standard defines only one actual conversion function.  This has,
+therefore, the most general interface: it allows conversion from one
+buffer to another.  Conversion from a file to a buffer, vice versa, or
+even file to file can be implemented on top of it.
+
+@comment iconv.h
+@comment XPG2
+@deftypefun size_t iconv (iconv_t @var{cd}, char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft})
+@cindex stateful
+The @code{iconv} function converts the text in the input buffer
+according to the rules associated with the descriptor @var{cd} and
+stores the result in the output buffer.  It is possible to call the
+function for the same text several times in a row since for stateful
+character sets the necessary state information is kept in the data
+structures associated with the descriptor.
+
+The input buffer is specified by @code{*@var{inbuf}} and it contains
+@code{*@var{inbytesleft}} bytes.  The extra indirection is necessary for
+communicating the used input back to the caller (see below).  It is
+important to note that the buffer pointer is of type @code{char} and the
+length is measured in bytes even if the input text is encoded in wide
+characters.
+
+The output buffer is specified in a similar way.  @code{*@var{outbuf}}
+points to the beginning of the buffer with at least
+@code{*@var{outbytesleft}} bytes room for the result.  The buffer
+pointer again is of type @code{char} and the length is measured in
+bytes.  If @var{outbuf} or @code{*@var{outbuf}} is a null pointer, the
+conversion is performed but no output is available.
+
+If @var{inbuf} is a null pointer, the @code{iconv} function performs the
+necessary action to put the state of the conversion into the initial
+state.  This is obviously a no-op for non-stateful encodings, but if the
+encoding has a state, such a function call might put some byte sequences
+in the output buffer, which perform the necessary state changes.  The
+next call with @var{inbuf} not being a null pointer then simply goes on
+from the initial state.  It is important that the programmer never makes
+any assumption as to whether the conversion has to deal with states.  
+Even if the input and output character sets are not stateful, the 
+implementation might still have to keep states.  This is due to the
+implementation chosen for the GNU C library as it is described below.
+Therefore an @code{iconv} call to reset the state should always be
+performed if some protocol requires this for the output text.
+
+The conversion stops for one of three reasons. The first is that all
+characters from the input buffer are converted.  This actually can mean
+two things: either all bytes from the input buffer are consumed or
+there are some bytes at the end of the buffer that possibly can form a
+complete character but the input is incomplete.  The second reason for a
+stop is that the output buffer is full.  And the third reason is that
+the input contains invalid characters.
+
+In all of these cases the buffer pointers after the last successful
+conversion, for input and output buffer, are stored in @var{inbuf} and
+@var{outbuf}, and the available room in each buffer is stored in
+@var{inbytesleft} and @var{outbytesleft}.
+
+Since the character sets selected in the @code{iconv_open} call can be
+almost arbitrary, there can be situations where the input buffer contains
+valid characters, which have no identical representation in the output
+character set.  The behavior in this situation is undefined.  The
+@emph{current} behavior of the GNU C library in this situation is to
+return with an error immediately.  This certainly is not the most
+desirable solution; therefore, future versions will provide better ones,
+but they are not yet finished.
+
+If all input from the input buffer is successfully converted and stored
+in the output buffer, the function returns the number of non-reversible
+conversions performed.  In all other cases the return value is
+@code{(size_t) -1} and @code{errno} is set appropriately.  In such cases
+the value pointed to by @var{inbytesleft} is nonzero.
+
+@table @code
+@item EILSEQ
+The conversion stopped because of an invalid byte sequence in the input.
+After the call, @code{*@var{inbuf}} points at the first byte of the
+invalid byte sequence.
+
+@item E2BIG
+The conversion stopped because it ran out of space in the output buffer.
+
+@item EINVAL
+The conversion stopped because of an incomplete byte sequence at the end
+of the input buffer.
+
+@item EBADF
+The @var{cd} argument is invalid.
+@end table
+
+@pindex iconv.h
+The @code{iconv} function was introduced in the XPG2 standard and is 
+declared in the @file{iconv.h} header.
+@end deftypefun
+
+The definition of the @code{iconv} function is quite good overall.  It
+provides quite flexible functionality.  The only problems lie in the
+boundary cases, which are incomplete byte sequences at the end of the
+input buffer and invalid input.  A third problem, which is not really
+a design problem, is the way conversions are selected.  The standard
+does not say anything about the legitimate names, a minimal set of
+available conversions.  We will see how this negatively impacts other
+implementations, as demonstrated below.
+
+@node iconv Examples
+@subsection A complete @code{iconv} example
+
+The example below features a solution for a common problem.  Given that
+one knows the internal encoding used by the system for @code{wchar_t}
+strings, one often is in the position to read text from a file and store
+it in wide character buffers.  One can do this using @code{mbsrtowcs},
+but then we run into the problems discussed above.
+
+@smallexample
+int
+file2wcs (int fd, const char *charset, wchar_t *outbuf, size_t avail)
+@{
+  char inbuf[BUFSIZ];
+  size_t insize = 0;
+  char *wrptr = (char *) outbuf;
+  int result = 0;
+  iconv_t cd;
+
+  cd = iconv_open ("WCHAR_T", charset);
+  if (cd == (iconv_t) -1)
+    @{
+      /* @r{Something went wrong.}  */
+      if (errno == EINVAL)
+        error (0, 0, "conversion from '%s' to wchar_t not available",
+               charset);
+      else
+        perror ("iconv_open");
+
+      /* @r{Terminate the output string.}  */
+      *outbuf = L'\0';
+
+      return -1;
+    @}
+
+  while (avail > 0)
+    @{
+      size_t nread;
+      size_t nconv;
+      char *inptr = inbuf;
+
+      /* @r{Read more input.}  */
+      nread = read (fd, inbuf + insize, sizeof (inbuf) - insize);
+      if (nread == 0)
+        @{
+          /* @r{When we come here the file is completely read.}
+             @r{This still could mean there are some unused}
+             @r{characters in the @code{inbuf}.  Put them back.}  */
+          if (lseek (fd, -insize, SEEK_CUR) == -1)
+            result = -1;
+
+          /* @r{Now write out the byte sequence to get into the}
+             @r{initial state if this is necessary.}  */
+          iconv (cd, NULL, NULL, &wrptr, &avail);
+
+          break;
+        @}
+      insize += nread;
+
+      /* @r{Do the conversion.}  */
+      nconv = iconv (cd, &inptr, &insize, &wrptr, &avail);
+      if (nconv == (size_t) -1)
+        @{
+          /* @r{Not everything went right.  It might only be}
+             @r{an unfinished byte sequence at the end of the}
+             @r{buffer.  Or it is a real problem.}  */
+          if (errno == EINVAL)
+            /* @r{This is harmless.  Simply move the unused}
+               @r{bytes to the beginning of the buffer so that}
+               @r{they can be used in the next round.}  */
+            memmove (inbuf, inptr, insize);
+          else
+            @{
+              /* @r{It is a real problem.  Maybe we ran out of}
+                 @r{space in the output buffer or we have invalid}
+                 @r{input.  In any case back the file pointer to}
+                 @r{the position of the last processed byte.}  */
+              lseek (fd, -insize, SEEK_CUR);
+              result = -1;
+              break;
+            @}
+        @}
+    @}
+
+  /* @r{Terminate the output string.}  */
+  if (avail >= sizeof (wchar_t))
+    *((wchar_t *) wrptr) = L'\0';
+
+  if (iconv_close (cd) != 0)
+    perror ("iconv_close");
+
+  return (wchar_t *) wrptr - outbuf;
+@}
+@end smallexample
+
+@cindex stateful
+This example shows the most important aspects of using the @code{iconv}
+functions.  It shows how successive calls to @code{iconv} can be used to
+convert large amounts of text.  The user does not have to care about
+stateful encodings as the functions take care of everything.
+
+An interesting point is the case where @code{iconv} returns an error and
+@code{errno} is set to @code{EINVAL}.  This is not really an error in the 
+transformation.  It can happen whenever the input character set contains 
+byte sequences of more than one byte for some character and texts are not 
+processed in one piece.  In this case there is a chance that a multibyte 
+sequence is cut.  The caller can then simply read the remainder of the 
+takes and feed the offending bytes together with new character from the 
+input to @code{iconv} and continue the work.  The internal state kept in 
+the descriptor is @emph{not} unspecified after such an event as is the 
+case with the conversion functions from the @w{ISO C} standard.
+
+The example also shows the problem of using wide character strings with
+@code{iconv}.  As explained in the description of the @code{iconv}
+function above, the function always takes a pointer to a @code{char}
+array and the available space is measured in bytes.  In the example, the
+output buffer is a wide character buffer; therefore, we use a local
+variable @var{wrptr} of type @code{char *}, which is used in the
+@code{iconv} calls.
+
+This looks rather innocent but can lead to problems on platforms that
+have tight restriction on alignment.  Therefore the caller of @code{iconv} 
+has to make sure that the pointers passed are suitable for access of 
+characters from the appropriate character set.  Since, in the
+above case, the input parameter to the function is a @code{wchar_t}
+pointer, this is the case (unless the user violates alignment when
+computing the parameter).  But in other situations, especially when
+writing generic functions where one does not know what type of character
+set one uses and, therefore, treats text as a sequence of bytes, it might
+become tricky.
+
+@node Other iconv Implementations
+@subsection Some Details about other @code{iconv} Implementations
+
+This is not really the place to discuss the @code{iconv} implementation
+of other systems but it is necessary to know a bit about them to write
+portable programs.  The above mentioned problems with the specification
+of the @code{iconv} functions can lead to portability issues.
+
+The first thing to notice is that, due to the large number of character
+sets in use, it is certainly not practical to encode the conversions
+directly in the C library.  Therefore, the conversion information must
+come from files outside the C library.  This is usually done in one or
+both of the following ways:
+
+@itemize @bullet
+@item
+The C library contains a set of generic conversion functions that can
+read the needed conversion tables and other information from data files.
+These files get loaded when necessary.
+
+This solution is problematic as it requires a great deal of effort to
+apply to all character sets (potentially an infinite set).  The 
+differences in the structure of the different character sets is so large
+that many different variants of the table-processing functions must be
+developed.  In addition, the generic nature of these functions make them 
+slower than specifically implemented functions.
+
+@item
+The C library only contains a framework that can dynamically load
+object files and execute the conversion functions contained therein.
+
+This solution provides much more flexibility.  The C library itself
+contains only very little code and therefore reduces the general memory
+footprint.  Also, with a documented interface between the C library and
+the loadable modules it is possible for third parties to extend the set
+of available conversion modules.  A drawback of this solution is that
+dynamic loading must be available.
+@end itemize
+
+Some implementations in commercial Unices implement a mixture of these 
+possibilities; the majority implement only the second solution.  Using 
+loadable modules moves the code out of the library itself and keeps 
+the door open for extensions and improvements, but this design is also
+limiting on some platforms since not many platforms support dynamic
+loading in statically linked programs.  On platforms without this
+capability it is therefore not possible to use this interface in
+statically linked programs.  The GNU C library has, on ELF platforms, no
+problems with dynamic loading in these situations; therefore, this
+point is moot.  The danger is that one gets acquainted with this 
+situation and forgets about the restrictions on other systems.
+
+A second thing to know about other @code{iconv} implementations is that
+the number of available conversions is often very limited.  Some
+implementations provide, in the standard release (not special
+international or developer releases), at most 100 to 200 conversion
+possibilities.  This does not mean 200 different character sets are
+supported; for example, conversions from one character set to a set of 10 
+others might count as 10 conversions.  Together with the other direction
+this makes 20 conversion possibilities used up by one character set.  One 
+can imagine the thin coverage these platform provide.  Some Unix vendors 
+even provide only a handful of conversions, which renders them useless for 
+almost all uses.
+
+This directly leads to a third and probably the most problematic point.
+The way the @code{iconv} conversion functions are implemented on all
+known Unix systems and the availability of the conversion functions from
+character set @math{@cal{A}} to @math{@cal{B}} and the conversion from
+@math{@cal{B}} to @math{@cal{C}} does @emph{not} imply that the
+conversion from @math{@cal{A}} to @math{@cal{C}} is available.
+
+This might not seem unreasonable and problematic at first, but it is a
+quite big problem as one will notice shortly after hitting it.  To show
+the problem we assume to write a program that has to convert from
+@math{@cal{A}} to @math{@cal{C}}.  A call like
+
+@smallexample
+cd = iconv_open ("@math{@cal{C}}", "@math{@cal{A}}");
+@end smallexample
+
+@noindent
+fails according to the assumption above.  But what does the program
+do now?  The conversion is necessary; therefore, simply giving up is not
+an option.
+
+This is a nuisance.  The @code{iconv} function should take care of this.
+But how should the program proceed from here on?  If it tries to convert 
+to character set @math{@cal{B}}, first the two @code{iconv_open}
+calls
+
+@smallexample
+cd1 = iconv_open ("@math{@cal{B}}", "@math{@cal{A}}");
+@end smallexample
+
+@noindent
+and
+
+@smallexample
+cd2 = iconv_open ("@math{@cal{C}}", "@math{@cal{B}}");
+@end smallexample
+
+@noindent
+will succeed, but how to find @math{@cal{B}}?
+
+Unfortunately, the answer is: there is no general solution.  On some
+systems guessing might help.  On those systems most character sets can
+convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text. Beside 
+this only some very system-specific methods can help.  Since the 
+conversion functions come from loadable modules and these modules must
+be stored somewhere in the filesystem, one @emph{could} try to find them
+and determine from the available file which conversions are available
+and whether there is an indirect route from @math{@cal{A}} to
+@math{@cal{C}}.
+
+This example shows one of the design errors of @code{iconv} mentioned 
+above.  It should at least be possible to determine the list of available
+conversion programmatically so that if @code{iconv_open} says there is no 
+such conversion, one could make sure this also is true for indirect
+routes.
+
+@node glibc iconv Implementation
+@subsection The @code{iconv} Implementation in the GNU C library
+
+After reading about the problems of @code{iconv} implementations in the
+last section it is certainly good to note that the implementation in
+the GNU C library has none of the problems mentioned above.  What
+follows is a step-by-step analysis of the points raised above.  The
+evaluation is based on the current state of the development (as of
+January 1999).  The development of the @code{iconv} functions is not
+complete, but basic functionality has solidified.
+
+The GNU C library's @code{iconv} implementation uses shared loadable
+modules to implement the conversions.  A very small number of
+conversions are built into the library itself but these are only rather
+trivial conversions.
+
+All the benefits of loadable modules are available in the GNU C library
+implementation.  This is especially appealing since the interface is
+well documented (see below), and it, therefore, is easy to write new
+conversion modules.  The drawback of using loadable objects is not a
+problem in the GNU C library, at least on ELF systems.  Since the
+library is able to load shared objects even in statically linked
+binaries, static linking need not be forbidden in case one wants to use 
+@code{iconv}.
+
+The second mentioned problem is the number of supported conversions.
+Currently, the GNU C library supports more than 150 character sets.  The
+way the implementation is designed the number of supported conversions
+is greater than 22350 (@math{150} times @math{149}).  If any conversion
+from or to a character set is missing, it can be added easily.
+
+Particularly impressive as it may be, this high number is due to the
+fact that the GNU C library implementation of @code{iconv} does not have
+the third problem mentioned above (i.e., whenever there is a conversion
+from a character set @math{@cal{A}} to @math{@cal{B}} and from
+@math{@cal{B}} to @math{@cal{C}} it is always possible to convert from
+@math{@cal{A}} to @math{@cal{C}} directly).  If the @code{iconv_open}
+returns an error and sets @code{errno} to @code{EINVAL}, there is no 
+known way, directly or indirectly, to perform the wanted conversion.
+
+@cindex triangulation
+Triangulation is achieved by providing for each character set a 
+conversion from and to UCS-4 encoded @w{ISO 10646}.  Using @w{ISO 10646} 
+as an intermediate representation it is possible to @dfn{triangulate}
+(i.e., convert with an intermediate representation).
+
+There is no inherent requirement to provide a conversion to @w{ISO
+10646} for a new character set, and it is also possible to provide other
+conversions where neither source nor destination character set is @w{ISO
+10646}.  The existing set of conversions is simply meant to cover all 
+conversions that might be of interest.
+
+@cindex ISO-2022-JP
+@cindex EUC-JP
+All currently available conversions use the triangulation method above,
+making conversion run unnecessarily slow.  If, for example, somebody 
+often needs the conversion from ISO-2022-JP to EUC-JP, a quicker solution
+would involve direct conversion between the two character sets, skipping
+the input to @w{ISO 10646} first.  The two character sets of interest
+are much more similar to each other than to @w{ISO 10646}.
+
+In such a situation one easily can write a new conversion and provide it
+as a better alternative.  The GNU C library @code{iconv} implementation
+would automatically use the module implementing the conversion if it is
+specified to be more efficient.
+
+@subsubsection Format of @file{gconv-modules} files
+
+All information about the available conversions comes from a file named
+@file{gconv-modules}, which can be found in any of the directories along
+the @code{GCONV_PATH}.  The @file{gconv-modules} files are line-oriented
+text files, where each of the lines has one of the following formats:
+
+@itemize @bullet
+@item
+If the first non-whitespace character is a @kbd{#} the line contains only 
+comments and is ignored.
+
+@item
+Lines starting with @code{alias} define an alias name for a character 
+set.  Two more words are expected on the line.  The first word 
+defines the alias name, and the second defines the original name of the
+character set.  The effect is that it is possible to use the alias name
+in the @var{fromset} or @var{toset} parameters of @code{iconv_open} and
+achieve the same result as when using the real character set name.
+
+This is quite important as a character set has often many different
+names.  There is normally an official name but this need not correspond to 
+the most popular name.  Beside this many character sets have special 
+names that are somehow constructed.  For example, all character sets 
+specified by the ISO have an alias of the form @code{ISO-IR-@var{nnn}} 
+where @var{nnn} is the registration number.  This allows programs that 
+know about the registration number to construct character set names and 
+use them in @code{iconv_open} calls.  More on the available names and 
+aliases follows below.
+
+@item
+Lines starting with @code{module} introduce an available conversion
+module.  These lines must contain three or four more words.
+
+The first word specifies the source character set, the second word the
+destination character set of conversion implemented in this module, and 
+the third word is the name of the loadable module.  The filename is
+constructed by appending the usual shared object suffix (normally
+@file{.so}) and this file is then supposed to be found in the same
+directory the @file{gconv-modules} file is in.  The last word on the line, 
+which is optional, is a numeric value representing the cost of the
+conversion.  If this word is missing, a cost of @math{1} is assumed.  The
+numeric value itself does not matter that much; what counts are the
+relative values of the sums of costs for all possible conversion paths.
+Below is a more precise description of the use of the cost value.
+@end itemize
+
+Returning to the example above where one has written a module to directly
+convert from ISO-2022-JP to EUC-JP and back.  All that has to be done is
+to put the new module, let its name be ISO2022JP-EUCJP.so, in a directory
+and add a file @file{gconv-modules} with the following content in the
+same directory:
+
+@smallexample
+module  ISO-2022-JP//   EUC-JP//        ISO2022JP-EUCJP    1
+module  EUC-JP//        ISO-2022-JP//   ISO2022JP-EUCJP    1
+@end smallexample
+
+To see why this is sufficient, it is necessary to understand how the
+conversion used by @code{iconv} (and described in the descriptor) is
+selected.  The approach to this problem is quite simple.
+
+At the first call of the @code{iconv_open} function the program reads
+all available @file{gconv-modules} files and builds up two tables: one
+containing all the known aliases and another that contains the
+information about the conversions and which shared object implements
+them.
+
+@subsubsection Finding the conversion path in @code{iconv}
+
+The set of available conversions form a directed graph with weighted
+edges.  The weights on the edges are the costs specified in the
+@file{gconv-modules} files.  The @code{iconv_open} function uses an
+algorithm suitable for search for the best path in such a graph and so
+constructs a list of conversions that must be performed in succession
+to get the transformation from the source to the destination character
+set.
+
+Explaining why the above @file{gconv-modules} files allows the
+@code{iconv} implementation to resolve the specific ISO-2022-JP to
+EUC-JP conversion module instead of the conversion coming with the
+library itself is straightforward.  Since the latter conversion takes two
+steps (from ISO-2022-JP to @w{ISO 10646} and then from @w{ISO 10646} to
+EUC-JP), the cost is @math{1+1 = 2}.  The above @file{gconv-modules}
+file, however, specifies that the new conversion modules can perform this
+conversion with only the cost of @math{1}.
+
+A mysterious item about the @file{gconv-modules} file above (and also
+the file coming with the GNU C library) are the names of the character
+sets specified in the @code{module} lines.  Why do almost all the names
+end in @code{//}?  And this is not all: the names can actually be
+regular expressions.  At this point in time this mystery should not be
+revealed, unless you have the relevant spell-casting materials: ashes
+from an original @w{DOS 6.2} boot disk burnt in effigy, a crucifix
+blessed by St.@: Emacs, assorted herbal roots from Central America, sand
+from Cebu, etc.  Sorry!  @strong{The part of the implementation where
+this is used is not yet finished.  For now please simply follow the
+existing examples.  It'll become clearer once it is. --drepper}
+
+A last remark about the @file{gconv-modules} is about the names not
+ending with @code{//}.  A character set named @code{INTERNAL} is often 
+mentioned.  From the discussion above and the chosen name it should have 
+become clear that this is the name for the representation used in the 
+intermediate step of the triangulation.  We have said that this is UCS-4 
+but actually that is not quite right.  The UCS-4 specification also 
+includes the specification of the byte ordering used.  Since a UCS-4 value 
+consists of four bytes, a stored value is effected by byte ordering.  The 
+internal representation is @emph{not} the same as UCS-4 in case the byte 
+ordering of the processor (or at least the running process) is not the 
+same as the one required for UCS-4.  This is done for performance reasons 
+as one does not want to perform unnecessary byte-swapping operations if 
+one is not interested in actually seeing the result in UCS-4.  To avoid 
+trouble with endianess, the internal representation consistently is named 
+@code{INTERNAL} even on big-endian systems where the representations are 
+identical.
+
+@subsubsection @code{iconv} module data structures
+
+So far this section has described how modules are located and considered 
+to be used.  What remains to be described is the interface of the modules
+so that one can write new ones. This section describes the interface as
+it is in use in January 1999.  The interface will change a bit in the 
+future but, with luck, only in an upwardly compatible way.
+
+The definitions necessary to write new modules are publicly available
+in the non-standard header @file{gconv.h}.  The following text,
+therefore, describes the definitions from this header file.  First, 
+however, it is necessary to get an overview.
+
+From the perspective of the user of @code{iconv} the interface is quite
+simple: the @code{iconv_open} function returns a handle that can be used 
+in calls to @code{iconv}, and finally the handle is freed with a call to 
+@code{iconv_close}.  The problem is that the handle has to be able to
+represent the possibly long sequences of conversion steps and also the
+state of each conversion since the handle is all that is passed to the
+@code{iconv} function.  Therefore, the data structures are really the
+elements necessary to understanding the implementation.
+
+We need two different kinds of data structures.  The first describes the
+conversion and the second describes the state etc.  There are really two
+type definitions like this in @file{gconv.h}.
+@pindex gconv.h
+
+@comment gconv.h
+@comment GNU
+@deftp {Data type} {struct __gconv_step}
+This data structure describes one conversion a module can perform.  For
+each function in a loaded module with conversion functions there is
+exactly one object of this type.  This object is shared by all users of
+the conversion (i.e., this object does not contain any information
+corresponding to an actual conversion; it only describes the conversion
+itself).
+
+@table @code
+@item struct __gconv_loaded_object *__shlib_handle
+@itemx const char *__modname
+@itemx int __counter
+All these elements of the structure are used internally in the C library
+to coordinate loading and unloading the shared.  One must not expect any
+of the other elements to be available or initialized.
+
+@item const char *__from_name
+@itemx const char *__to_name
+@code{__from_name} and @code{__to_name} contain the names of the source and
+destination character sets.  They can be used to identify the actual
+conversion to be carried out since one module might implement conversions 
+for more than one character set and/or direction.
+
+@item gconv_fct __fct
+@itemx gconv_init_fct __init_fct
+@itemx gconv_end_fct __end_fct
+These elements contain pointers to the functions in the loadable module.
+The interface will be explained below.
+
+@item int __min_needed_from
+@itemx int __max_needed_from
+@itemx int __min_needed_to
+@itemx int __max_needed_to;
+These values have to be supplied in the init function of the module.  The
+@code{__min_needed_from} value specifies how many bytes a character of
+the source character set at least needs.  The @code{__max_needed_from}
+specifies the maximum value that also includes possible shift sequences.
+
+The @code{__min_needed_to} and @code{__max_needed_to} values serve the
+same purpose as @code{__min_needed_from} and @code{__max_needed_from} but 
+this time for the destination character set.
+
+It is crucial that these values be accurate since otherwise the
+conversion functions will have problems or not work at all.
+
+@item int __stateful
+This element must also be initialized by the init function. 
+@code{int __stateful} is nonzero if the source character set is stateful. 
+Otherwise it is zero.
+
+@item void *__data
+This element can be used freely by the conversion functions in the
+module.  @code{void *__data} can be used to communicate extra information 
+from one call to another.  @code{void *__data} need not be initialized if 
+not needed at all.  If @code{void *__data} element is assigned a pointer 
+to dynamically allocated memory (presumably in the init function) it has 
+to be made sure that the end function deallocates the memory.  Otherwise 
+the application will leak memory.
+
+It is important to be aware that this data structure is shared by all
+users of this specification conversion and therefore the @code{__data}
+element must not contain data specific to one specific use of the
+conversion function.
+@end table
+@end deftp
+
+@comment gconv.h
+@comment GNU
+@deftp {Data type} {struct __gconv_step_data}
+This is the data structure that contains the information specific to
+each use of the conversion functions.
+
+
+@table @code
+@item char *__outbuf
+@itemx char *__outbufend
+These elements specify the output buffer for the conversion step.  The
+@code{__outbuf} element points to the beginning of the buffer, and
+@code{__outbufend} points to the byte following the last byte in the
+buffer.  The conversion function must not assume anything about the size
+of the buffer but it can be safely assumed the there is room for at
+least one complete character in the output buffer.
+
+Once the conversion is finished, if the conversion is the last step, the
+@code{__outbuf} element must be modified to point after the last byte
+written into the buffer to signal how much output is available.  If this
+conversion step is not the last one, the element must not be modified.
+The @code{__outbufend} element must not be modified.
+
+@item int __is_last
+This element is nonzero if this conversion step is the last one.  This
+information is necessary for the recursion.  See the description of the
+conversion function internals below.  This element must never be
+modified.
+
+@item int __invocation_counter
+The conversion function can use this element to see how many calls of 
+the conversion function already happened.  Some character sets require a 
+certain prolog when generating output, and by comparing this value with
+zero, one can find out whether it is the first call and whether, 
+therefore, the prolog should be emitted.  This element must never be 
+modified.
+
+@item int __internal_use
+This element is another one rarely used but needed in certain
+situations.  It is assigned a nonzero value in case the conversion
+functions are used to implement @code{mbsrtowcs} et.al.@: (i.e., the
+function is not used directly through the @code{iconv} interface).
+
+This sometimes makes a difference as it is expected that the
+@code{iconv} functions are used to translate entire texts while the
+@code{mbsrtowcs} functions are normally used only to convert single
+strings and might be used multiple times to convert entire texts.
+
+But in this situation we would have problem complying with some rules of
+the character set specification.  Some character sets require a prolog,
+which must appear exactly once for an entire text.  If a number of
+@code{mbsrtowcs} calls are used to convert the text, only the first call
+must add the prolog.  However, because there is no communication between the
+different calls of @code{mbsrtowcs}, the conversion functions have no
+possibility to find this out.  The situation is different for sequences
+of @code{iconv} calls since the handle allows access to the needed
+information.
+
+The @code{int __internal_use} element is mostly used together with 
+@code{__invocation_counter} as follows:
+
+@smallexample
+if (!data->__internal_use
+     && data->__invocation_counter == 0)
+  /* @r{Emit prolog.}  */
+  ...
+@end smallexample
+
+This element must never be modified.
+
+@item mbstate_t *__statep
+The @code{__statep} element points to an object of type @code{mbstate_t}
+(@pxref{Keeping the state}).  The conversion of a stateful character
+set must use the object pointed to by @code{__statep} to store 
+information about the conversion state.  The @code{__statep} element 
+itself must never be modified.
+
+@item mbstate_t __state
+This element must @emph{never} be used directly.  It is only part of
+this structure to have the needed space allocated.
+@end table
+@end deftp
+
+@subsubsection @code{iconv} module interfaces
+
+With the knowledge about the data structures we now can describe the
+conversion function itself.  To understand the interface a bit of
+knowledge is necessary about the functionality in the C library that 
+loads the objects with the conversions.
+
+It is often the case that one conversion is used more than once (i.e.,
+there are several @code{iconv_open} calls for the same set of character
+sets during one program run).  The @code{mbsrtowcs} et.al.@: functions in
+the GNU C library also use the @code{iconv} functionality, which 
+increases the number of uses of the same functions even more.
+
+Because of this multiple use of conversions, the modules do not get 
+loaded exclusively for one conversion.  Instead a module once loaded can 
+be used by an arbitrary number of @code{iconv} or @code{mbsrtowcs} calls 
+at the same time.  The splitting of the information between conversion-
+function-specific information and conversion data makes this possible. 
+The last section showed the two data structures used to do this.
+
+This is of course also reflected in the interface and semantics of the
+functions that the modules must provide.  There are three functions that
+must have the following names:
+
+@table @code
+@item gconv_init
+The @code{gconv_init} function initializes the conversion function
+specific data structure.  This very same object is shared by all
+conversions that use this conversion and, therefore, no state information
+about the conversion itself must be stored in here.  If a module 
+implements more than one conversion, the @code{gconv_init} function will 
+be called multiple times.
+
+@item gconv_end
+The @code{gconv_end} function is responsible for freeing all resources
+allocated by the @code{gconv_init} function.  If there is nothing to do,
+this function can be missing.  Special care must be taken if the module
+implements more than one conversion and the @code{gconv_init} function
+does not allocate the same resources for all conversions.
+
+@item gconv
+This is the actual conversion function.  It is called to convert one
+block of text.  It gets passed the conversion step information
+initialized by @code{gconv_init} and the conversion data, specific to
+this use of the conversion functions.
+@end table
+
+There are three data types defined for the three module interface
+functions and these define the interface.
+
+@comment gconv.h
+@comment GNU
+@deftypevr {Data type} int {(*__gconv_init_fct)} (struct __gconv_step *)
+This specifies the interface of the initialization function of the
+module.  It is called exactly once for each conversion the module
+implements.
+
+As explained in the description of the @code{struct __gconv_step} data
+structure above the initialization function has to initialize parts of
+it.
+
+@table @code
+@item __min_needed_from
+@itemx __max_needed_from
+@itemx __min_needed_to
+@itemx __max_needed_to
+These elements must be initialized to the exact numbers of the minimum
+and maximum number of bytes used by one character in the source and
+destination character sets, respectively.  If the characters all have the
+same size, the minimum and maximum values are the same.
+
+@item __stateful
+This element must be initialized to an nonzero value if the source
+character set is stateful.  Otherwise it must be zero.
+@end table
+
+If the initialization function needs to communicate some information
+to the conversion function, this communication can happen using the 
+@code{__data} element of the @code{__gconv_step} structure.  But since 
+this data is shared by all the conversions, it must not be modified by 
+the conversion function.  The example below shows how this can be used.
+
+@smallexample
+#define MIN_NEEDED_FROM         1
+#define MAX_NEEDED_FROM         4
+#define MIN_NEEDED_TO           4
+#define MAX_NEEDED_TO           4
+
+int
+gconv_init (struct __gconv_step *step)
+@{
+  /* @r{Determine which direction.}  */
+  struct iso2022jp_data *new_data;
+  enum direction dir = illegal_dir;
+  enum variant var = illegal_var;
+  int result;
+
+  if (__strcasecmp (step->__from_name, "ISO-2022-JP//") == 0)
+    @{
+      dir = from_iso2022jp;
+      var = iso2022jp;
+    @}
+  else if (__strcasecmp (step->__to_name, "ISO-2022-JP//") == 0)
+    @{
+      dir = to_iso2022jp;
+      var = iso2022jp;
+    @}
+  else if (__strcasecmp (step->__from_name, "ISO-2022-JP-2//") == 0)
+    @{
+      dir = from_iso2022jp;
+      var = iso2022jp2;
+    @}
+  else if (__strcasecmp (step->__to_name, "ISO-2022-JP-2//") == 0)
+    @{
+      dir = to_iso2022jp;
+      var = iso2022jp2;
+    @}
+
+  result = __GCONV_NOCONV;
+  if (dir != illegal_dir)
+    @{
+      new_data = (struct iso2022jp_data *)
+        malloc (sizeof (struct iso2022jp_data));
+
+      result = __GCONV_NOMEM;
+      if (new_data != NULL)
+        @{
+          new_data->dir = dir;
+          new_data->var = var;
+          step->__data = new_data;
+
+          if (dir == from_iso2022jp)
+            @{
+              step->__min_needed_from = MIN_NEEDED_FROM;
+              step->__max_needed_from = MAX_NEEDED_FROM;
+              step->__min_needed_to = MIN_NEEDED_TO;
+              step->__max_needed_to = MAX_NEEDED_TO;
+            @}
+          else
+            @{
+              step->__min_needed_from = MIN_NEEDED_TO;
+              step->__max_needed_from = MAX_NEEDED_TO;
+              step->__min_needed_to = MIN_NEEDED_FROM;
+              step->__max_needed_to = MAX_NEEDED_FROM + 2;
+            @}
+
+          /* @r{Yes, this is a stateful encoding.}  */
+          step->__stateful = 1;
+
+          result = __GCONV_OK;
+        @}
+    @}
+
+  return result;
+@}
+@end smallexample
+
+The function first checks which conversion is wanted.  The module from
+which this function is taken implements four different conversions; 
+which one is selected can be determined by comparing the names.  The
+comparison should always be done without paying attention to the case.
+
+Next, a data structure, which contains the necessary information about 
+which conversion is selected, is allocated.  The data structure
+@code{struct iso2022jp_data} is locally defined since, outside the 
+module, this data is not used at all.  Please note that if all four 
+conversions this modules supports are requested there are four data 
+blocks.
+
+One interesting thing is the initialization of the @code{__min_} and
+@code{__max_} elements of the step data object.  A single ISO-2022-JP
+character can consist of one to four bytes.  Therefore the
+@code{MIN_NEEDED_FROM} and @code{MAX_NEEDED_FROM} macros are defined
+this way.  The output is always the @code{INTERNAL} character set (aka
+UCS-4) and therefore each character consists of exactly four bytes.  For
+the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into
+account that escape sequences might be necessary to switch the character
+sets.  Therefore the @code{__max_needed_to} element for this direction
+gets assigned @code{MAX_NEEDED_FROM + 2}.  This takes into account the
+two bytes needed for the escape sequences to single the switching.  The
+asymmetry in the maximum values for the two directions can be explained
+easily: when reading ISO-2022-JP text, escape sequences can be handled
+alone (i.e., it is not necessary to process a real character since the
+effect of the escape sequence can be recorded in the state information).
+The situation is different for the other direction.  Since it is in
+general not known which character comes next, one cannot emit escape
+sequences to change the state in advance.  This means the escape
+sequences that have to be emitted together with the next character.
+Therefore one needs more room than only for the character itself.
+
+The possible return values of the initialization function are:
+
+@table @code
+@item __GCONV_OK
+The initialization succeeded
+@item __GCONV_NOCONV
+The requested conversion is not supported in the module.  This can
+happen if the @file{gconv-modules} file has errors.
+@item __GCONV_NOMEM
+Memory required to store additional information could not be allocated.
+@end table
+@end deftypevr
+
+The function called before the module is unloaded is significantly
+easier.  It often has nothing at all to do; in which case it can be left
+out completely.
+
+@comment gconv.h
+@comment GNU
+@deftypevr {Data type} void {(*__gconv_end_fct)} (struct gconv_step *)
+The task of this function is to free all resources allocated in the
+initialization function.  Therefore only the @code{__data} element of
+the object pointed to by the argument is of interest.  Continuing the
+example from the initialization function, the finalization function
+looks like this:
+
+@smallexample
+void
+gconv_end (struct __gconv_step *data)
+@{
+  free (data->__data);
+@}
+@end smallexample
+@end deftypevr
+
+The most important function is the conversion function itself, which can
+get quite complicated for complex character sets.  But since this is not
+of interest here, we will only describe a possible skeleton for the
+conversion function.
+
+@comment gconv.h
+@comment GNU
+@deftypevr {Data type} int {(*__gconv_fct)} (struct __gconv_step *, struct __gconv_step_data *, const char **, const char *, size_t *, int)
+The conversion function can be called for two basic reason: to convert
+text or to reset the state.  From the description of the @code{iconv}
+function it can be seen why the flushing mode is necessary.  What mode
+is selected is determined by the sixth argument, an integer.  This 
+argument being nonzero means that flushing is selected.
+
+Common to both modes is where the output buffer can be found.  The
+information about this buffer is stored in the conversion step data.  A
+pointer to this information is passed as the second argument to this 
+function.  The description of the @code{struct __gconv_step_data} 
+structure has more information on the conversion step data.
+
+@cindex stateful
+What has to be done for flushing depends on the source character set.
+If the source character set is not stateful, nothing has to be done. 
+Otherwise the function has to emit a byte sequence to bring the state 
+object into the initial state.  Once this all happened the other 
+conversion modules in the chain of conversions have to get the same 
+chance.  Whether another step follows can be determined from the 
+@code{__is_last} element of the step data structure to which the first 
+parameter points.
+
+The more interesting mode is when actual text has to be converted.  The 
+first step in this case is to convert as much text as possible from the 
+input buffer and store the result in the output buffer.  The start of the 
+input buffer is determined by the third argument, which is a pointer to a 
+pointer variable referencing the beginning of the buffer.  The fourth 
+argument is a pointer to the byte right after the last byte in the buffer.
+
+The conversion has to be performed according to the current state if the
+character set is stateful.  The state is stored in an object pointed to
+by the @code{__statep} element of the step data (second argument).  Once
+either the input buffer is empty or the output buffer is full the
+conversion stops.  At this point, the pointer variable referenced by the
+third parameter must point to the byte following the last processed
+byte (i.e., if all of the input is consumed, this pointer and the fourth
+parameter have the same value).
+
+What now happens depends on whether this step is the last one.  If it is 
+the last step, the only thing that has to be done is to update the 
+@code{__outbuf} element of the step data structure to point after the
+last written byte.  This update gives the caller the information on how 
+much text is available in the output buffer.  In addition, the variable
+pointed to by the fifth parameter, which is of type @code{size_t}, must
+be incremented by the number of characters (@emph{not bytes}) that were
+converted in a non-reversible way.  Then, the function can return.
+
+In case the step is not the last one, the later conversion functions have
+to get a chance to do their work.  Therefore, the appropriate conversion
+function has to be called.  The information about the functions is
+stored in the conversion data structures, passed as the first parameter.
+This information and the step data are stored in arrays, so the next
+element in both cases can be found by simple pointer arithmetic:
+
+@smallexample
+int
+gconv (struct __gconv_step *step, struct __gconv_step_data *data,
+       const char **inbuf, const char *inbufend, size_t *written,
+       int do_flush)
+@{
+  struct __gconv_step *next_step = step + 1;
+  struct __gconv_step_data *next_data = data + 1;
+  ...
+@end smallexample
+
+The @code{next_step} pointer references the next step information and
+@code{next_data} the next data record.  The call of the next function
+therefore will look similar to this:
+
+@smallexample
+  next_step->__fct (next_step, next_data, &outerr, outbuf,
+                    written, 0)
+@end smallexample
+
+But this is not yet all.  Once the function call returns the conversion
+function might have some more to do.  If the return value of the function 
+is @code{__GCONV_EMPTY_INPUT}, more room is available in the output 
+buffer.  Unless the input buffer is empty the conversion, functions start 
+all over again and process the rest of the input buffer.  If the return 
+value is not @code{__GCONV_EMPTY_INPUT}, something went wrong and we have 
+to recover from this.
+
+A requirement for the conversion function is that the input buffer
+pointer (the third argument) always point to the last character that
+was put in converted form into the output buffer.  This is trivially
+true after the conversion performed in the current step, but if the
+conversion functions deeper downstream stop prematurely, not all
+characters from the output buffer are consumed and, therefore, the input
+buffer pointers must be backed off to the right position.
+
+Correcting the input buffers is easy to do if the input and output 
+character sets have a fixed width for all characters.  In this situation 
+we can compute how many characters are left in the output buffer and, 
+therefore, can correct the input buffer pointer appropriately with a 
+similar computation.  Things are getting tricky if either character set 
+has characters represented with variable length byte sequences, and it 
+gets even more complicated if the conversion has to take care of the 
+state.  In these cases the conversion has to be performed once again, from 
+the known state before the initial conversion (i.e., if necessary the 
+state of the conversion has to be reset and the conversion loop has to be 
+executed again).  The difference now is that it is known how much input 
+must be created, and the conversion can stop before converting the first 
+unused character.  Once this is done the input buffer pointers must be 
+updated again and the function can return.
+
+One final thing should be mentioned.  If it is necessary for the
+conversion to know whether it is the first invocation (in case a prolog
+has to be emitted), the conversion function should increment the 
+@code{__invocation_counter} element of the step data structure just 
+before returning to the caller.  See the description of the @code{struct
+__gconv_step_data} structure above for more information on how this can
+be used.
+
+The return value must be one of the following values:
+
+@table @code
+@item __GCONV_EMPTY_INPUT
+All input was consumed and there is room left in the output buffer.
+@item __GCONV_FULL_OUTPUT
+No more room in the output buffer.  In case this is not the last step
+this value is propagated down from the call of the next conversion
+function in the chain. 
+@item __GCONV_INCOMPLETE_INPUT
+The input buffer is not entirely empty since it contains an incomplete
+character sequence.
+@end table
+
+The following example provides a framework for a conversion function.
+In case a new conversion has to be written the holes in this
+implementation have to be filled and that is it.
+
+@smallexample
+int
+gconv (struct __gconv_step *step, struct __gconv_step_data *data,
+       const char **inbuf, const char *inbufend, size_t *written,
+       int do_flush)
+@{
+  struct __gconv_step *next_step = step + 1;
+  struct __gconv_step_data *next_data = data + 1;
+  gconv_fct fct = next_step->__fct;
+  int status;
+
+  /* @r{If the function is called with no input this means we have}
+     @r{to reset to the initial state.  The possibly partly}
+     @r{converted input is dropped.}  */
+  if (do_flush)
+    @{
+      status = __GCONV_OK;
+
+      /* @r{Possible emit a byte sequence which put the state object}
+         @r{into the initial state.}  */
+
+      /* @r{Call the steps down the chain if there are any but only}
+         @r{if we successfully emitted the escape sequence.}  */
+      if (status == __GCONV_OK && ! data->__is_last)
+        status = fct (next_step, next_data, NULL, NULL,
+                      written, 1);
+    @}
+  else
+    @{
+      /* @r{We preserve the initial values of the pointer variables.}  */
+      const char *inptr = *inbuf;
+      char *outbuf = data->__outbuf;
+      char *outend = data->__outbufend;
+      char *outptr;
+
+      do
+        @{
+          /* @r{Remember the start value for this round.}  */
+          inptr = *inbuf;
+          /* @r{The outbuf buffer is empty.}  */
+          outptr = outbuf;
+
+          /* @r{For stateful encodings the state must be safe here.}  */
+
+          /* @r{Run the conversion loop.  @code{status} is set}
+             @r{appropriately afterwards.}  */
+
+          /* @r{If this is the last step, leave the loop. There is}
+             @r{nothing we can do.}  */
+          if (data->__is_last)
+            @{
+              /* @r{Store information about how many bytes are}
+                 @r{available.}  */
+              data->__outbuf = outbuf;
+
+             /* @r{If any non-reversible conversions were performed,}
+                @r{add the number to @code{*written}.}  */
+
+             break;
+           @}
+
+          /* @r{Write out all output that was produced.}  */
+          if (outbuf > outptr)
+            @{
+              const char *outerr = data->__outbuf;
+              int result;
+
+              result = fct (next_step, next_data, &outerr,
+                            outbuf, written, 0);
+
+              if (result != __GCONV_EMPTY_INPUT)
+                @{
+                  if (outerr != outbuf)
+                    @{
+                      /* @r{Reset the input buffer pointer.  We}
+                         @r{document here the complex case.}  */
+                      size_t nstatus;
+
+                      /* @r{Reload the pointers.}  */
+                      *inbuf = inptr;
+                      outbuf = outptr;
+
+                      /* @r{Possibly reset the state.}  */
+
+                      /* @r{Redo the conversion, but this time}
+                         @r{the end of the output buffer is at}
+                         @r{@code{outerr}.}  */
+                    @}
+
+                  /* @r{Change the status.}  */
+                  status = result;
+                @}
+              else
+                /* @r{All the output is consumed, we can make}
+                   @r{ another run if everything was ok.}  */
+                if (status == __GCONV_FULL_OUTPUT)
+                  status = __GCONV_OK;
+           @}
+        @}
+      while (status == __GCONV_OK);
+
+      /* @r{We finished one use of this step.}  */
+      ++data->__invocation_counter;
+    @}
+
+  return status;
+@}
+@end smallexample
+@end deftypevr
+
+This information should be sufficient to write new modules.  Anybody
+doing so should also take a look at the available source code in the GNU
+C library sources.  It contains many examples of working and optimized
+modules.
+
  @c File charset.texi edited October 2001 by Dennis Grace, IBM Corporation
 \ No newline at end of file
author	Ulrich Drepper <drepper@redhat.com>
	Mon, 5 Nov 2001 08:11:26 +0000 (08:11 +0000)
committer	Ulrich Drepper <drepper@redhat.com>
	Mon, 5 Nov 2001 08:11:26 +0000 (08:11 +0000)
ChangeLog		patch \| blob \| history
manual/charset.texi		patch \| blob \| history