2 @chapter Normalization forms (composition and decomposition) @code{<uninorm.h>}
6 This include file defines functions for transforming Unicode strings to one
7 of the four normal forms, known as NFC, NFD, NKFC, NFKD. These
8 transformations involve decomposition and --- for NFC and NFKC --- composition
12 * Decomposition of characters::
13 * Composition of characters::
14 * Normalization of strings::
15 * Normalizing comparisons::
16 * Normalization of streams::
19 @node Decomposition of characters
20 @section Decomposition of Unicode characters
23 The following enumerated values are the possible types of decomposition of a
26 @deftypevr Constant int UC_DECOMP_CANONICAL
27 Denotes canonical decomposition.
30 @deftypevr Constant int UC_DECOMP_FONT
31 UCD marker: @code{<font>}. Denotes a font variant (e.g@. a blackletter form).
34 @deftypevr Constant int UC_DECOMP_NOBREAK
35 UCD marker: @code{<noBreak>}.
36 Denotes a no-break version of a space or hyphen.
39 @deftypevr Constant int UC_DECOMP_INITIAL
40 UCD marker: @code{<initial>}.
41 Denotes an initial presentation form (Arabic).
44 @deftypevr Constant int UC_DECOMP_MEDIAL
45 UCD marker: @code{<medial>}.
46 Denotes a medial presentation form (Arabic).
49 @deftypevr Constant int UC_DECOMP_FINAL
50 UCD marker: @code{<final>}.
51 Denotes a final presentation form (Arabic).
54 @deftypevr Constant int UC_DECOMP_ISOLATED
55 UCD marker: @code{<isolated>}.
56 Denotes an isolated presentation form (Arabic).
59 @deftypevr Constant int UC_DECOMP_CIRCLE
60 UCD marker: @code{<circle>}.
61 Denotes an encircled form.
64 @deftypevr Constant int UC_DECOMP_SUPER
65 UCD marker: @code{<super>}.
66 Denotes a superscript form.
69 @deftypevr Constant int UC_DECOMP_SUB
70 UCD marker: @code{<sub>}.
71 Denotes a subscript form.
74 @deftypevr Constant int UC_DECOMP_VERTICAL
75 UCD marker: @code{<vertical>}.
76 Denotes a vertical layout presentation form.
79 @deftypevr Constant int UC_DECOMP_WIDE
80 UCD marker: @code{<wide>}.
81 Denotes a wide (or zenkaku) compatibility character.
84 @deftypevr Constant int UC_DECOMP_NARROW
85 UCD marker: @code{<narrow>}.
86 Denotes a narrow (or hankaku) compatibility character.
89 @deftypevr Constant int UC_DECOMP_SMALL
90 UCD marker: @code{<small>}.
91 Denotes a small variant form (CNS compatibility).
94 @deftypevr Constant int UC_DECOMP_SQUARE
95 UCD marker: @code{<square>}.
96 Denotes a CJK squared font variant.
99 @deftypevr Constant int UC_DECOMP_FRACTION
100 UCD marker: @code{<fraction>}.
101 Denotes a vulgar fraction form.
104 @deftypevr Constant int UC_DECOMP_COMPAT
105 UCD marker: @code{<compat>}.
106 Denotes an otherwise unspecified compatibility character.
109 The following constant denotes the maximum size of decomposition of a single
112 @deftypevr Macro {unsigned int} UC_DECOMPOSITION_MAX_LENGTH
113 This macro expands to a constant that is the required size of buffer passed to
114 the @code{uc_decomposition} and @code{uc_canonical_decomposition} functions.
117 The following functions decompose a Unicode character.
119 @deftypefun int uc_decomposition (ucs4_t@tie{}@var{uc}, int@tie{}*@var{decomp_tag}, ucs4_t@tie{}*@var{decomposition})
120 Returns the character decomposition mapping of the Unicode character @var{uc}.
121 @var{decomposition} must point to an array of at least
122 @code{UC_DECOMPOSITION_MAX_LENGTH} @code{ucs_t} elements.
124 When a decomposition exists, @code{@var{decomposition}[0..@var{n}-1]} and
125 @code{*@var{decomp_tag}} are filled and @var{n} is returned. Otherwise -1 is
129 @deftypefun int uc_canonical_decomposition (ucs4_t@tie{}@var{uc}, ucs4_t@tie{}*@var{decomposition})
130 Returns the canonical character decomposition mapping of the Unicode character
131 @var{uc}. @var{decomposition} must point to an array of at least
132 @code{UC_DECOMPOSITION_MAX_LENGTH} @code{ucs_t} elements.
134 When a decomposition exists, @code{@var{decomposition}[0..@var{n}-1]} is filled
135 and @var{n} is returned. Otherwise -1 is returned.
137 Note: This function returns the (simple) ``canonical decomposition'' of
138 @var{uc}. If you want the ``full canonical decomposition'' of @var{uc},
139 that is, the recursive application of ``canonical decomposition'', use the
140 function @code{u*_normalize} with argument @code{UNINORM_NFD} instead.
143 @node Composition of characters
144 @section Composition of Unicode characters
146 @cindex composing, Unicode characters
147 @cindex combining, Unicode characters
148 The following function composes a Unicode character from two Unicode
151 @deftypefun ucs4_t uc_composition (ucs4_t@tie{}@var{uc1}, ucs4_t@tie{}@var{uc2})
152 Attempts to combine the Unicode characters @var{uc1}, @var{uc2}.
153 @var{uc1} is known to have canonical combining class 0.
155 Returns the combination of @var{uc1} and @var{uc2}, if it exists.
158 Not all decompositions can be recombined using this function. See the Unicode
159 file @file{CompositionExclusions.txt} for details.
162 @node Normalization of strings
163 @section Normalization of strings
165 The Unicode standard defines four normalization forms for Unicode strings.
166 The following type is used to denote a normalization form.
168 @deftp Type uninorm_t
169 An object of type @code{uninorm_t} denotes a Unicode normalization form.
170 This is a scalar type; its values can be compared with @code{==}.
173 The following constants denote the four normalization forms.
175 @deftypevr Macro uninorm_t UNINORM_NFD
176 Denotes Normalization form D: canonical decomposition.
179 @deftypevr Macro uninorm_t UNINORM_NFC
180 Normalization form C: canonical decomposition, then canonical composition.
183 @deftypevr Macro uninorm_t UNINORM_NFKD
184 Normalization form KD: compatibility decomposition.
187 @deftypevr Macro uninorm_t UNINORM_NFKC
188 Normalization form KC: compatibility decomposition, then canonical composition.
191 The following functions operate on @code{uninorm_t} objects.
193 @deftypefun bool uninorm_is_compat_decomposing (uninorm_t@tie{}@var{nf})
194 Tests whether the normalization form @var{nf} does compatibility decomposition.
197 @deftypefun bool uninorm_is_composing (uninorm_t@tie{}@var{nf})
198 Tests whether the normalization form @var{nf} includes canonical composition.
201 @deftypefun uninorm_t uninorm_decomposing_form (uninorm_t@tie{}@var{nf})
202 Returns the decomposing variant of the normalization form @var{nf}.
203 This maps NFC,NFD @arrow{} NFD and NFKC,NFKD @arrow{} NFKD.
206 The following functions apply a Unicode normalization form to a Unicode string.
208 @deftypefun {uint8_t *} u8_normalize (uninorm_t@tie{}@var{nf}, const@tie{}uint8_t@tie{}*@var{s}, size_t@tie{}@var{n}, uint8_t@tie{}*@var{resultbuf}, size_t@tie{}*@var{lengthp})
209 @deftypefunx {uint16_t *} u16_normalize (uninorm_t@tie{}@var{nf}, const@tie{}uint16_t@tie{}*@var{s}, size_t@tie{}@var{n}, uint16_t@tie{}*@var{resultbuf}, size_t@tie{}*@var{lengthp})
210 @deftypefunx {uint32_t *} u32_normalize (uninorm_t@tie{}@var{nf}, const@tie{}uint32_t@tie{}*@var{s}, size_t@tie{}@var{n}, uint32_t@tie{}*@var{resultbuf}, size_t@tie{}*@var{lengthp})
211 Returns the specified normalization form of a string.
213 The @var{resultbuf} and @var{lengthp} arguments are as described in
214 chapter @ref{Conventions}.
217 @node Normalizing comparisons
218 @section Normalizing comparisons
220 @cindex comparing, ignoring normalization
221 The following functions compare Unicode string, ignoring differences in
224 @deftypefun int u8_normcmp (const@tie{}uint8_t@tie{}*@var{s1}, size_t@tie{}@var{n1}, const@tie{}uint8_t@tie{}*@var{s2}, size_t@tie{}@var{n2}, uninorm_t@tie{}@var{nf}, int@tie{}*@var{resultp})
225 @deftypefunx int u16_normcmp (const@tie{}uint16_t@tie{}*@var{s1}, size_t@tie{}@var{n1}, const@tie{}uint16_t@tie{}*@var{s2}, size_t@tie{}@var{n2}, uninorm_t@tie{}@var{nf}, int@tie{}*@var{resultp})
226 @deftypefunx int u32_normcmp (const@tie{}uint32_t@tie{}*@var{s1}, size_t@tie{}@var{n1}, const@tie{}uint32_t@tie{}*@var{s2}, size_t@tie{}@var{n2}, uninorm_t@tie{}@var{nf}, int@tie{}*@var{resultp})
227 Compares @var{s1} and @var{s2}, ignoring differences in normalization.
229 @var{nf} must be either @code{UNINORM_NFD} or @code{UNINORM_NFKD}.
231 If successful, sets @code{*@var{resultp}} to -1 if @var{s1} < @var{s2},
232 0 if @var{s1} = @var{s2}, 1 if @var{s1} > @var{s2}, and returns 0.
233 Upon failure, returns -1 with @code{errno} set.
236 @cindex comparing, ignoring normalization, with collation rules
237 @cindex comparing, with collation rules, ignoring normalization
238 @deftypefun {char *} u8_normxfrm (const@tie{}uint8_t@tie{}*@var{s}, size_t@tie{}@var{n}, uninorm_t@tie{}@var{nf}, char@tie{}*@var{resultbuf}, size_t@tie{}*@var{lengthp})
239 @deftypefunx {char *} u16_normxfrm (const@tie{}uint16_t@tie{}*@var{s}, size_t@tie{}@var{n}, uninorm_t@tie{}@var{nf}, char@tie{}*@var{resultbuf}, size_t@tie{}*@var{lengthp})
240 @deftypefunx {char *} u32_normxfrm (const@tie{}uint32_t@tie{}*@var{s}, size_t@tie{}@var{n}, uninorm_t@tie{}@var{nf}, char@tie{}*@var{resultbuf}, size_t@tie{}*@var{lengthp})
241 Converts the string @var{s} of length @var{n} to a NUL-terminated byte
242 sequence, in such a way that comparing @code{u8_normxfrm (@var{s1})} and
243 @code{u8_normxfrm (@var{s2})} with the @code{u8_cmp2} function is equivalent to
244 comparing @var{s1} and @var{s2} with the @code{u8_normcoll} function.
246 @var{nf} must be either @code{UNINORM_NFC} or @code{UNINORM_NFKC}.
248 The @var{resultbuf} and @var{lengthp} arguments are as described in
249 chapter @ref{Conventions}.
252 @deftypefun int u8_normcoll (const@tie{}uint8_t@tie{}*@var{s1}, size_t@tie{}@var{n1}, const@tie{}uint8_t@tie{}*@var{s2}, size_t@tie{}@var{n2}, uninorm_t@tie{}@var{nf}, int@tie{}*@var{resultp})
253 @deftypefunx int u16_normcoll (const@tie{}uint16_t@tie{}*@var{s1}, size_t@tie{}@var{n1}, const@tie{}uint16_t@tie{}*@var{s2}, size_t@tie{}@var{n2}, uninorm_t@tie{}@var{nf}, int@tie{}*@var{resultp})
254 @deftypefunx int u32_normcoll (const@tie{}uint32_t@tie{}*@var{s1}, size_t@tie{}@var{n1}, const@tie{}uint32_t@tie{}*@var{s2}, size_t@tie{}@var{n2}, uninorm_t@tie{}@var{nf}, int@tie{}*@var{resultp})
255 Compares @var{s1} and @var{s2}, ignoring differences in normalization, using
256 the collation rules of the current locale.
258 @var{nf} must be either @code{UNINORM_NFC} or @code{UNINORM_NFKC}.
260 If successful, sets @code{*@var{resultp}} to -1 if @var{s1} < @var{s2},
261 0 if @var{s1} = @var{s2}, 1 if @var{s1} > @var{s2}, and returns 0.
262 Upon failure, returns -1 with @code{errno} set.
265 @node Normalization of streams
266 @section Normalization of streams of Unicode characters
268 @cindex stream, normalizing a
269 A ``stream of Unicode characters'' is essentially a function that accepts an
270 @code{ucs4_t} argument repeatedly, optionally combined with a function that
271 ``flushes'' the stream.
273 @deftp Type {struct uninorm_filter}
274 This is the data type of a stream of Unicode characters that normalizes its
275 input according to a given normalization form and passes the normalized
276 character sequence to the encapsulated stream of Unicode characters.
279 @deftypefun {struct uninorm_filter *} uninorm_filter_create (uninorm_t@tie{}@var{nf}, int@tie{}(*@var{stream_func})@tie{}(void@tie{}*@var{stream_data}, ucs4_t@tie{}@var{uc}), void@tie{}*@var{stream_data})
280 Creates and returns a normalization filter for Unicode characters.
282 The pair (@var{stream_func}, @var{stream_data}) is the encapsulated stream.
283 @code{@var{stream_func} (@var{stream_data}, @var{uc})} receives the Unicode
284 character @var{uc} and returns 0 if successful, or -1 with @code{errno} set
287 Returns the new filter, or NULL with @code{errno} set upon failure.
290 @deftypefun int uninorm_filter_write (struct@tie{}uninorm_filter@tie{}*@var{filter}, ucs4_t@tie{}@var{uc})
291 Stuffs a Unicode character into a normalizing filter.
292 Returns 0 if successful, or -1 with @code{errno} set upon failure.
295 @deftypefun int uninorm_filter_flush (struct@tie{}uninorm_filter@tie{}*@var{filter})
296 Brings data buffered in the filter to its destination, the encapsulated stream.
298 Returns 0 if successful, or -1 with @code{errno} set upon failure.
300 Note! If after calling this function, additional characters are written
301 into the filter, the resulting character sequence in the encapsulated stream
302 will not necessarily be normalized.
305 @deftypefun int uninorm_filter_free (struct@tie{}uninorm_filter@tie{}*@var{filter})
306 Brings data buffered in the filter to its destination, the encapsulated stream,
307 then closes and frees the filter.
309 Returns 0 if successful, or -1 with @code{errno} set upon failure.