Therefore it would be legitimate to define @code{wchar_t} as @code{char},
which might make sense for embedded systems.
-But for GNU systems @code{wchar_t} is always 32 bits wide and, therefore,
+But in @theglibc{} @code{wchar_t} is always 32 bits wide and, therefore,
capable of representing all UCS-4 values and, therefore, covering all of
@w{ISO 10646}. Some Unix systems define @code{wchar_t} as a 16-bit type
and thereby follow Unicode very strictly. This definition is perfectly
These internal representations present problems when it comes to storing
and transmittal. Because each single wide character consists of more
-than one byte, they are effected by byte-ordering. Thus, machines with
+than one byte, they are affected by byte-ordering. Thus, machines with
different endianesses would see different values when accessing the same
data. This byte ordering concern also applies for communication protocols
-that are all byte-based and, thereforet require that the sender has to
+that are all byte-based and therefore require that the sender has to
decide about splitting the wide character in bytes. A last (but not least
important) point is that wide characters often require more storage space
than a customized byte-oriented character set.
the character @code{'/'} is used in the encoding @emph{only} to
represent itself. Things are a bit different for character sets like
EBCDIC (Extended Binary Coded Decimal Interchange Code, a character set
-family used by IBM), but if the operation system does not understand
+family used by IBM), but if the operating system does not understand
EBCDIC directly the parameters-to-system calls have to be converted
first anyhow.
big advantage that whenever one can identify the beginning of the byte
sequence of a character one can interpret a text correctly. Examples of
character sets using this policy are the various EUC character sets
-(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
+(used by Sun's operating systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
or Shift_JIS (SJIS, a Japanese encoding).
But there are also character sets using a state that is valid for more
The functions handling more than one character at a time require NUL
terminated strings as the argument (i.e., converting blocks of text
does not work unless one can add a NUL byte at an appropriate place).
-The GNU C library contains some extensions to the standard that allow
+@Theglibc{} contains some extensions to the standard that allow
specifying a size, but basically they also expect terminated strings.
@end itemize
by the functions we are about to describe. Each locale uses its own
character set (given as an argument to @code{localedef}) and this is the
one assumed as the external multibyte encoding. The wide character
-character set always is UCS-4, at least on GNU systems.
+set is always UCS-4 in @theglibc{}.
A characteristic of each multibyte character set is the maximum number
of bytes that can be necessary to represent one character. This
maximum number of bytes in a multibyte character in the current locale.
The value is never greater than @code{MB_LEN_MAX}. Unlike
@code{MB_LEN_MAX} this macro need not be a compile-time constant, and in
-the GNU C library it is not.
+@theglibc{} it is not.
@pindex stdlib.h
@code{MB_CUR_MAX} is defined in @file{stdlib.h}.
The code to emit the escape sequence to get back to the initial state is
interesting. The @code{wcsrtombs} function can be used to determine the
-necessary output code (@pxref{Converting Strings}). Please note that on
-GNU systems it is not necessary to perform this extra action for the
+necessary output code (@pxref{Converting Strings}). Please note that with
+@theglibc{} it is not necessary to perform this extra action for the
conversion from multibyte text to wide character text since the wide
character encoding is not stateful. But there is nothing mentioned in
any standard that prohibits making @code{wchar_t} using a stateful
and is declared in @file{wchar.h}.
@end deftypefun
-Despite the limitation that the single byte value always is interpreted
-in the initial state this function is actually useful most of the time.
+Despite the limitation that the single byte value is always interpreted
+in the initial state, this function is actually useful most of the time.
Most characters are either entirely single-byte character sets or they
are extension to ASCII. But then it is possible to write code like this
(not that this specific example is very useful):
on the character of the character set used for @code{wchar_t}
representation. In other situations the bytes are not constant at
compile time and so the compiler cannot do the work. In situations like
-this it is necessary @code{btowc}.
+this, using @code{btowc} is required.
@noindent
-There also is a function for the conversion in the other direction.
+There is also a function for the conversion in the other direction.
@comment wchar.h
@comment ISO
multibyte character, the number of bytes belonging to this multibyte
character byte sequence is returned.
-If the the first @var{n} bytes possibly form a valid multibyte
+If the first @var{n} bytes possibly form a valid multibyte
character but the character is incomplete, the return value is
@code{(size_t) -2}. Otherwise the multibyte character sequence is invalid
and the return value is @code{(size_t) -1}.
This function simply calls @code{mbrlen} for each multibyte character
in the string and counts the number of function calls. Please note that
we here use @code{MB_LEN_MAX} as the size argument in the @code{mbrlen}
-call. This is acceptable since a) this value is larger then the length of
+call. This is acceptable since a) this value is larger than the length of
the longest multibyte character sequence and b) we know that the string
@var{s} ends with a NUL byte, which cannot be part of any other multibyte
character sequence but the one representing the NUL wide character.
Therefore, the @code{mbrlen} function will never read invalid memory.
Now that this function is available (just to make this clear, this
-function is @emph{not} part of the GNU C library) we can compute the
+function is @emph{not} part of @theglibc{}) we can compute the
number of wide character required to store the converted multibyte
character string @var{s} using
character at a time. Most operations to be performed in real-world
programs include strings and therefore the @w{ISO C} standard also
defines conversions on entire strings. However, the defined set of
-functions is quite limited; therefore, the GNU C library contains a few
+functions is quite limited; therefore, @theglibc{} contains a few
extensions that can help in some important situations.
@comment wchar.h
The generic conversion interface (@pxref{Generic Charset Conversion})
does not have this limitation (it simply works on buffers, not
-strings), and the GNU C library contains a set of functions that take
+strings), and @theglibc{} contains a set of functions that take
additional parameters specifying the maximal number of bytes that are
consumed from the input string. This way the problem of
@code{mbsrtowcs}'s example above could be solved by determining the line
/* @r{If any characters must be carried forward,}
@r{put them at the beginning of @code{buffer}.} */
if (filled > 0)
- memmove (inp, buffer, filled);
+ memmove (buffer, inp, filled);
@}
return 1;
common that they operate on character sets that are not directly
specified by the functions. The multibyte encoding used is specified by
the currently selected locale for the @code{LC_CTYPE} category. The
-wide character set is fixed by the implementation (in the case of GNU C
-library it is always UCS-4 encoded @w{ISO 10646}.
+wide character set is fixed by the implementation (in the case of @theglibc{}
+it is always UCS-4 encoded @w{ISO 10646}.
This has of course several problems when it comes to general character
conversion:
category, one has to change the @code{LC_CTYPE} locale using
@code{setlocale}.
-Changing the @code{LC_TYPE} locale introduces major problems for the rest
+Changing the @code{LC_CTYPE} locale introduces major problems for the rest
of the programs since several more functions (e.g., the character
classification functions, @pxref{Classification of Characters}) use the
@code{LC_CTYPE} category.
new descriptor must be created. The descriptor does not stand for all
of the conversions from @var{fromset} to @var{toset}.
-The GNU C library implementation of @code{iconv_open} has one
+The @glibcadj{} implementation of @code{iconv_open} has one
significant extension to other implementations. To ease the extension
of the set of available conversions, the implementation allows storing
the necessary files with data and code in an arbitrary number of
any assumption as to whether the conversion has to deal with states.
Even if the input and output character sets are not stateful, the
implementation might still have to keep states. This is due to the
-implementation chosen for the GNU C library as it is described below.
+implementation chosen for @theglibc{} as it is described below.
Therefore an @code{iconv} call to reset the state should always be
performed if some protocol requires this for the output text.
almost arbitrary, there can be situations where the input buffer contains
valid characters, which have no identical representation in the output
character set. The behavior in this situation is undefined. The
-@emph{current} behavior of the GNU C library in this situation is to
+@emph{current} behavior of @theglibc{} in this situation is to
return with an error immediately. This certainly is not the most
desirable solution; therefore, future versions will provide better ones,
but they are not yet finished.
limiting on some platforms since not many platforms support dynamic
loading in statically linked programs. On platforms without this
capability it is therefore not possible to use this interface in
-statically linked programs. The GNU C library has, on ELF platforms, no
+statically linked programs. @Theglibc{} has, on ELF platforms, no
problems with dynamic loading in these situations; therefore, this
point is moot. The danger is that one gets acquainted with this
situation and forgets about the restrictions on other systems.
routes.
@node glibc iconv Implementation
-@subsection The @code{iconv} Implementation in the GNU C library
+@subsection The @code{iconv} Implementation in @theglibc{}
After reading about the problems of @code{iconv} implementations in the
last section it is certainly good to note that the implementation in
-the GNU C library has none of the problems mentioned above. What
+@theglibc{} has none of the problems mentioned above. What
follows is a step-by-step analysis of the points raised above. The
evaluation is based on the current state of the development (as of
January 1999). The development of the @code{iconv} functions is not
complete, but basic functionality has solidified.
-The GNU C library's @code{iconv} implementation uses shared loadable
+@Theglibc{}'s @code{iconv} implementation uses shared loadable
modules to implement the conversions. A very small number of
conversions are built into the library itself but these are only rather
trivial conversions.
-All the benefits of loadable modules are available in the GNU C library
+All the benefits of loadable modules are available in the @glibcadj{}
implementation. This is especially appealing since the interface is
well documented (see below), and it, therefore, is easy to write new
conversion modules. The drawback of using loadable objects is not a
-problem in the GNU C library, at least on ELF systems. Since the
+problem in @theglibc{}, at least on ELF systems. Since the
library is able to load shared objects even in statically linked
binaries, static linking need not be forbidden in case one wants to use
@code{iconv}.
The second mentioned problem is the number of supported conversions.
-Currently, the GNU C library supports more than 150 character sets. The
+Currently, @theglibc{} supports more than 150 character sets. The
way the implementation is designed the number of supported conversions
is greater than 22350 (@math{150} times @math{149}). If any conversion
from or to a character set is missing, it can be added easily.
Particularly impressive as it may be, this high number is due to the
-fact that the GNU C library implementation of @code{iconv} does not have
+fact that the @glibcadj{} implementation of @code{iconv} does not have
the third problem mentioned above (i.e., whenever there is a conversion
from a character set @math{@cal{A}} to @math{@cal{B}} and from
@math{@cal{B}} to @math{@cal{C}} it is always possible to convert from
are much more similar to each other than to @w{ISO 10646}.
In such a situation one easily can write a new conversion and provide it
-as a better alternative. The GNU C library @code{iconv} implementation
+as a better alternative. The @glibcadj{} @code{iconv} implementation
would automatically use the module implementing the conversion if it is
specified to be more efficient.
conversion with only the cost of @math{1}.
A mysterious item about the @file{gconv-modules} file above (and also
-the file coming with the GNU C library) are the names of the character
+the file coming with @theglibc{}) are the names of the character
sets specified in the @code{module} lines. Why do almost all the names
end in @code{//}? And this is not all: the names can actually be
regular expressions. At this point in time this mystery should not be
intermediate step of the triangulation. We have said that this is UCS-4
but actually that is not quite right. The UCS-4 specification also
includes the specification of the byte ordering used. Since a UCS-4 value
-consists of four bytes, a stored value is effected by byte ordering. The
+consists of four bytes, a stored value is affected by byte ordering. The
internal representation is @emph{not} the same as UCS-4 in case the byte
ordering of the processor (or at least the running process) is not the
same as the one required for UCS-4. This is done for performance reasons
as one does not want to perform unnecessary byte-swapping operations if
one is not interested in actually seeing the result in UCS-4. To avoid
-trouble with endianess, the internal representation consistently is named
+trouble with endianness, the internal representation consistently is named
@code{INTERNAL} even on big-endian systems where the representations are
identical.
It is often the case that one conversion is used more than once (i.e.,
there are several @code{iconv_open} calls for the same set of character
sets during one program run). The @code{mbsrtowcs} et.al.@: functions in
-the GNU C library also use the @code{iconv} functionality, which
+@theglibc{} also use the @code{iconv} functionality, which
increases the number of uses of the same functions even more.
Because of this multiple use of conversions, the modules do not get
@end deftypevr
This information should be sufficient to write new modules. Anybody
-doing so should also take a look at the available source code in the GNU
-C library sources. It contains many examples of working and optimized
+doing so should also take a look at the available source code in the
+@glibcadj{} sources. It contains many examples of working and optimized
modules.
@c File charset.texi edited October 2001 by Dennis Grace, IBM Corporation