1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html401/loose.dtd">
3 <!-- Created on September, 1 2014 by texi2html 1.78a -->
5 Written by: Lionel Cons <Lionel.Cons@cern.ch> (original author)
6 Karl Berry <karl@freefriends.org>
7 Olaf Bachmann <obachman@mathematik.uni-kl.de>
9 Maintained by: Many creative people.
10 Send bugs and suggestions to <texi2html-bug@nongnu.org>
14 <title>GNU libunistring: 13. Normalization forms (composition and decomposition) <uninorm.h></title>
16 <meta name="description" content="GNU libunistring: 13. Normalization forms (composition and decomposition) <uninorm.h>">
17 <meta name="keywords" content="GNU libunistring: 13. Normalization forms (composition and decomposition) <uninorm.h>">
18 <meta name="resource-type" content="document">
19 <meta name="distribution" content="global">
20 <meta name="Generator" content="texi2html 1.78a">
21 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
22 <style type="text/css">
24 a.summary-letter {text-decoration: none}
25 pre.display {font-family: serif}
26 pre.format {font-family: serif}
27 pre.menu-comment {font-family: serif}
28 pre.menu-preformatted {font-family: serif}
29 pre.smalldisplay {font-family: serif; font-size: smaller}
30 pre.smallexample {font-size: smaller}
31 pre.smallformat {font-family: serif; font-size: smaller}
32 pre.smalllisp {font-size: smaller}
33 span.roman {font-family:serif; font-weight:normal;}
34 span.sansserif {font-family:sans-serif; font-weight:normal;}
35 ul.toc {list-style: none}
42 <body lang="en" bgcolor="#FFFFFF" text="#000000" link="#0000FF" vlink="#800080" alink="#FF0000">
44 <table cellpadding="1" cellspacing="1" border="0">
45 <tr><td valign="middle" align="left">[<a href="libunistring_12.html#SEC47" title="Beginning of this chapter or previous chapter"> << </a>]</td>
46 <td valign="middle" align="left">[<a href="libunistring_14.html#SEC54" title="Next chapter"> >> </a>]</td>
47 <td valign="middle" align="left"> </td>
48 <td valign="middle" align="left"> </td>
49 <td valign="middle" align="left"> </td>
50 <td valign="middle" align="left"> </td>
51 <td valign="middle" align="left"> </td>
52 <td valign="middle" align="left">[<a href="libunistring.html#SEC_Top" title="Cover (top) of document">Top</a>]</td>
53 <td valign="middle" align="left">[<a href="libunistring.html#SEC_Contents" title="Table of contents">Contents</a>]</td>
54 <td valign="middle" align="left">[<a href="libunistring_19.html#SEC77" title="Index">Index</a>]</td>
55 <td valign="middle" align="left">[<a href="libunistring_abt.html#SEC_About" title="About (help)"> ? </a>]</td>
59 <a name="uninorm_002eh"></a>
61 <h1 class="chapter"> <a href="libunistring.html#TOC48">13. Normalization forms (composition and decomposition) <code><uninorm.h></code></a> </h1>
63 <p>This include file defines functions for transforming Unicode strings to one
64 of the four normal forms, known as NFC, NFD, NKFC, NFKD. These
65 transformations involve decomposition and — for NFC and NFKC — composition
66 of Unicode characters.
70 <a name="Decomposition-of-characters"></a>
72 <h2 class="section"> <a href="libunistring.html#TOC49">13.1 Decomposition of Unicode characters</a> </h2>
74 <p>The following enumerated values are the possible types of decomposition of a
78 <dt><u>Constant:</u> int <b>UC_DECOMP_CANONICAL</b>
81 <dd><p>Denotes canonical decomposition.
85 <dt><u>Constant:</u> int <b>UC_DECOMP_FONT</b>
88 <dd><p>UCD marker: <code><font></code>. Denotes a font variant (e.g. a blackletter form).
92 <dt><u>Constant:</u> int <b>UC_DECOMP_NOBREAK</b>
95 <dd><p>UCD marker: <code><noBreak></code>.
96 Denotes a no-break version of a space or hyphen.
100 <dt><u>Constant:</u> int <b>UC_DECOMP_INITIAL</b>
101 <a name="IDX770"></a>
103 <dd><p>UCD marker: <code><initial></code>.
104 Denotes an initial presentation form (Arabic).
108 <dt><u>Constant:</u> int <b>UC_DECOMP_MEDIAL</b>
109 <a name="IDX771"></a>
111 <dd><p>UCD marker: <code><medial></code>.
112 Denotes a medial presentation form (Arabic).
116 <dt><u>Constant:</u> int <b>UC_DECOMP_FINAL</b>
117 <a name="IDX772"></a>
119 <dd><p>UCD marker: <code><final></code>.
120 Denotes a final presentation form (Arabic).
124 <dt><u>Constant:</u> int <b>UC_DECOMP_ISOLATED</b>
125 <a name="IDX773"></a>
127 <dd><p>UCD marker: <code><isolated></code>.
128 Denotes an isolated presentation form (Arabic).
132 <dt><u>Constant:</u> int <b>UC_DECOMP_CIRCLE</b>
133 <a name="IDX774"></a>
135 <dd><p>UCD marker: <code><circle></code>.
136 Denotes an encircled form.
140 <dt><u>Constant:</u> int <b>UC_DECOMP_SUPER</b>
141 <a name="IDX775"></a>
143 <dd><p>UCD marker: <code><super></code>.
144 Denotes a superscript form.
148 <dt><u>Constant:</u> int <b>UC_DECOMP_SUB</b>
149 <a name="IDX776"></a>
151 <dd><p>UCD marker: <code><sub></code>.
152 Denotes a subscript form.
156 <dt><u>Constant:</u> int <b>UC_DECOMP_VERTICAL</b>
157 <a name="IDX777"></a>
159 <dd><p>UCD marker: <code><vertical></code>.
160 Denotes a vertical layout presentation form.
164 <dt><u>Constant:</u> int <b>UC_DECOMP_WIDE</b>
165 <a name="IDX778"></a>
167 <dd><p>UCD marker: <code><wide></code>.
168 Denotes a wide (or zenkaku) compatibility character.
172 <dt><u>Constant:</u> int <b>UC_DECOMP_NARROW</b>
173 <a name="IDX779"></a>
175 <dd><p>UCD marker: <code><narrow></code>.
176 Denotes a narrow (or hankaku) compatibility character.
180 <dt><u>Constant:</u> int <b>UC_DECOMP_SMALL</b>
181 <a name="IDX780"></a>
183 <dd><p>UCD marker: <code><small></code>.
184 Denotes a small variant form (CNS compatibility).
188 <dt><u>Constant:</u> int <b>UC_DECOMP_SQUARE</b>
189 <a name="IDX781"></a>
191 <dd><p>UCD marker: <code><square></code>.
192 Denotes a CJK squared font variant.
196 <dt><u>Constant:</u> int <b>UC_DECOMP_FRACTION</b>
197 <a name="IDX782"></a>
199 <dd><p>UCD marker: <code><fraction></code>.
200 Denotes a vulgar fraction form.
204 <dt><u>Constant:</u> int <b>UC_DECOMP_COMPAT</b>
205 <a name="IDX783"></a>
207 <dd><p>UCD marker: <code><compat></code>.
208 Denotes an otherwise unspecified compatibility character.
211 <p>The following constant denotes the maximum size of decomposition of a single
215 <dt><u>Macro:</u> unsigned int <b>UC_DECOMPOSITION_MAX_LENGTH</b>
216 <a name="IDX784"></a>
218 <dd><p>This macro expands to a constant that is the required size of buffer passed to
219 the <code>uc_decomposition</code> and <code>uc_canonical_decomposition</code> functions.
222 <p>The following functions decompose a Unicode character.
225 <dt><u>Function:</u> int <b>uc_decomposition</b><i> (ucs4_t <var>uc</var>, int *<var>decomp_tag</var>, ucs4_t *<var>decomposition</var>)</i>
226 <a name="IDX785"></a>
228 <dd><p>Returns the character decomposition mapping of the Unicode character <var>uc</var>.
229 <var>decomposition</var> must point to an array of at least
230 <code>UC_DECOMPOSITION_MAX_LENGTH</code> <code>ucs_t</code> elements.
232 <p>When a decomposition exists, <code><var>decomposition</var>[0..<var>n</var>-1]</code> and
233 <code>*<var>decomp_tag</var></code> are filled and <var>n</var> is returned. Otherwise -1 is
238 <dt><u>Function:</u> int <b>uc_canonical_decomposition</b><i> (ucs4_t <var>uc</var>, ucs4_t *<var>decomposition</var>)</i>
239 <a name="IDX786"></a>
241 <dd><p>Returns the canonical character decomposition mapping of the Unicode character
242 <var>uc</var>. <var>decomposition</var> must point to an array of at least
243 <code>UC_DECOMPOSITION_MAX_LENGTH</code> <code>ucs_t</code> elements.
245 <p>When a decomposition exists, <code><var>decomposition</var>[0..<var>n</var>-1]</code> is filled
246 and <var>n</var> is returned. Otherwise -1 is returned.
250 <a name="Composition-of-characters"></a>
252 <h2 class="section"> <a href="libunistring.html#TOC50">13.2 Composition of Unicode characters</a> </h2>
254 <p>The following function composes a Unicode character from two Unicode
258 <dt><u>Function:</u> ucs4_t <b>uc_composition</b><i> (ucs4_t <var>uc1</var>, ucs4_t <var>uc2</var>)</i>
259 <a name="IDX787"></a>
261 <dd><p>Attempts to combine the Unicode characters <var>uc1</var>, <var>uc2</var>.
262 <var>uc1</var> is known to have canonical combining class 0.
264 <p>Returns the combination of <var>uc1</var> and <var>uc2</var>, if it exists.
267 <p>Not all decompositions can be recombined using this function. See the Unicode
268 file ‘<tt>CompositionExclusions.txt</tt>’ for details.
272 <a name="Normalization-of-strings"></a>
274 <h2 class="section"> <a href="libunistring.html#TOC51">13.3 Normalization of strings</a> </h2>
276 <p>The Unicode standard defines four normalization forms for Unicode strings.
277 The following type is used to denote a normalization form.
280 <dt><u>Type:</u> <b>uninorm_t</b>
281 <a name="IDX788"></a>
283 <dd><p>An object of type <code>uninorm_t</code> denotes a Unicode normalization form.
284 This is a scalar type; its values can be compared with <code>==</code>.
287 <p>The following constants denote the four normalization forms.
290 <dt><u>Macro:</u> uninorm_t <b>UNINORM_NFD</b>
291 <a name="IDX789"></a>
293 <dd><p>Denotes Normalization form D: canonical decomposition.
297 <dt><u>Macro:</u> uninorm_t <b>UNINORM_NFC</b>
298 <a name="IDX790"></a>
300 <dd><p>Normalization form C: canonical decomposition, then canonical composition.
304 <dt><u>Macro:</u> uninorm_t <b>UNINORM_NFKD</b>
305 <a name="IDX791"></a>
307 <dd><p>Normalization form KD: compatibility decomposition.
311 <dt><u>Macro:</u> uninorm_t <b>UNINORM_NFKC</b>
312 <a name="IDX792"></a>
314 <dd><p>Normalization form KC: compatibility decomposition, then canonical composition.
317 <p>The following functions operate on <code>uninorm_t</code> objects.
320 <dt><u>Function:</u> bool <b>uninorm_is_compat_decomposing</b><i> (uninorm_t <var>nf</var>)</i>
321 <a name="IDX793"></a>
323 <dd><p>Tests whether the normalization form <var>nf</var> does compatibility decomposition.
327 <dt><u>Function:</u> bool <b>uninorm_is_composing</b><i> (uninorm_t <var>nf</var>)</i>
328 <a name="IDX794"></a>
330 <dd><p>Tests whether the normalization form <var>nf</var> includes canonical composition.
334 <dt><u>Function:</u> uninorm_t <b>uninorm_decomposing_form</b><i> (uninorm_t <var>nf</var>)</i>
335 <a name="IDX795"></a>
337 <dd><p>Returns the decomposing variant of the normalization form <var>nf</var>.
338 This maps NFC,NFD → NFD and NFKC,NFKD → NFKD.
341 <p>The following functions apply a Unicode normalization form to a Unicode string.
344 <dt><u>Function:</u> uint8_t * <b>u8_normalize</b><i> (uninorm_t <var>nf</var>, const uint8_t *<var>s</var>, size_t <var>n</var>, uint8_t *<var>resultbuf</var>, size_t *<var>lengthp</var>)</i>
345 <a name="IDX796"></a>
347 <dt><u>Function:</u> uint16_t * <b>u16_normalize</b><i> (uninorm_t <var>nf</var>, const uint16_t *<var>s</var>, size_t <var>n</var>, uint16_t *<var>resultbuf</var>, size_t *<var>lengthp</var>)</i>
348 <a name="IDX797"></a>
350 <dt><u>Function:</u> uint32_t * <b>u32_normalize</b><i> (uninorm_t <var>nf</var>, const uint32_t *<var>s</var>, size_t <var>n</var>, uint32_t *<var>resultbuf</var>, size_t *<var>lengthp</var>)</i>
351 <a name="IDX798"></a>
353 <dd><p>Returns the specified normalization form of a string.
357 <a name="Normalizing-comparisons"></a>
359 <h2 class="section"> <a href="libunistring.html#TOC52">13.4 Normalizing comparisons</a> </h2>
361 <p>The following functions compare Unicode string, ignoring differences in
365 <dt><u>Function:</u> int <b>u8_normcmp</b><i> (const uint8_t *<var>s1</var>, size_t <var>n1</var>, const uint8_t *<var>s2</var>, size_t <var>n2</var>, uninorm_t <var>nf</var>, int *<var>resultp</var>)</i>
366 <a name="IDX799"></a>
368 <dt><u>Function:</u> int <b>u16_normcmp</b><i> (const uint16_t *<var>s1</var>, size_t <var>n1</var>, const uint16_t *<var>s2</var>, size_t <var>n2</var>, uninorm_t <var>nf</var>, int *<var>resultp</var>)</i>
369 <a name="IDX800"></a>
371 <dt><u>Function:</u> int <b>u32_normcmp</b><i> (const uint32_t *<var>s1</var>, size_t <var>n1</var>, const uint32_t *<var>s2</var>, size_t <var>n2</var>, uninorm_t <var>nf</var>, int *<var>resultp</var>)</i>
372 <a name="IDX801"></a>
374 <dd><p>Compares <var>s1</var> and <var>s2</var>, ignoring differences in normalization.
376 <p><var>nf</var> must be either <code>UNINORM_NFD</code> or <code>UNINORM_NFKD</code>.
378 <p>If successful, sets <code>*<var>resultp</var></code> to -1 if <var>s1</var> < <var>s2</var>,
379 0 if <var>s1</var> = <var>s2</var>, 1 if <var>s1</var> > <var>s2</var>, and returns 0.
380 Upon failure, returns -1 with <code>errno</code> set.
383 <a name="IDX802"></a>
384 <a name="IDX803"></a>
386 <dt><u>Function:</u> char * <b>u8_normxfrm</b><i> (const uint8_t *<var>s</var>, size_t <var>n</var>, uninorm_t <var>nf</var>, char *<var>resultbuf</var>, size_t *<var>lengthp</var>)</i>
387 <a name="IDX804"></a>
389 <dt><u>Function:</u> char * <b>u16_normxfrm</b><i> (const uint16_t *<var>s</var>, size_t <var>n</var>, uninorm_t <var>nf</var>, char *<var>resultbuf</var>, size_t *<var>lengthp</var>)</i>
390 <a name="IDX805"></a>
392 <dt><u>Function:</u> char * <b>u32_normxfrm</b><i> (const uint32_t *<var>s</var>, size_t <var>n</var>, uninorm_t <var>nf</var>, char *<var>resultbuf</var>, size_t *<var>lengthp</var>)</i>
393 <a name="IDX806"></a>
395 <dd><p>Converts the string <var>s</var> of length <var>n</var> to a NUL-terminated byte
396 sequence, in such a way that comparing <code>u8_normxfrm (<var>s1</var>)</code> and
397 <code>u8_normxfrm (<var>s2</var>)</code> with the <code>u8_cmp2</code> function is equivalent to
398 comparing <var>s1</var> and <var>s2</var> with the <code>u8_normcoll</code> function.
400 <p><var>nf</var> must be either <code>UNINORM_NFC</code> or <code>UNINORM_NFKC</code>.
404 <dt><u>Function:</u> int <b>u8_normcoll</b><i> (const uint8_t *<var>s1</var>, size_t <var>n1</var>, const uint8_t *<var>s2</var>, size_t <var>n2</var>, uninorm_t <var>nf</var>, int *<var>resultp</var>)</i>
405 <a name="IDX807"></a>
407 <dt><u>Function:</u> int <b>u16_normcoll</b><i> (const uint16_t *<var>s1</var>, size_t <var>n1</var>, const uint16_t *<var>s2</var>, size_t <var>n2</var>, uninorm_t <var>nf</var>, int *<var>resultp</var>)</i>
408 <a name="IDX808"></a>
410 <dt><u>Function:</u> int <b>u32_normcoll</b><i> (const uint32_t *<var>s1</var>, size_t <var>n1</var>, const uint32_t *<var>s2</var>, size_t <var>n2</var>, uninorm_t <var>nf</var>, int *<var>resultp</var>)</i>
411 <a name="IDX809"></a>
413 <dd><p>Compares <var>s1</var> and <var>s2</var>, ignoring differences in normalization, using
414 the collation rules of the current locale.
416 <p><var>nf</var> must be either <code>UNINORM_NFC</code> or <code>UNINORM_NFKC</code>.
418 <p>If successful, sets <code>*<var>resultp</var></code> to -1 if <var>s1</var> < <var>s2</var>,
419 0 if <var>s1</var> = <var>s2</var>, 1 if <var>s1</var> > <var>s2</var>, and returns 0.
420 Upon failure, returns -1 with <code>errno</code> set.
424 <a name="Normalization-of-streams"></a>
426 <h2 class="section"> <a href="libunistring.html#TOC53">13.5 Normalization of streams of Unicode characters</a> </h2>
428 <p>A “stream of Unicode characters” is essentially a function that accepts an
429 <code>ucs4_t</code> argument repeatedly, optionally combined with a function that
430 “flushes” the stream.
433 <dt><u>Type:</u> <b>struct uninorm_filter</b>
434 <a name="IDX810"></a>
436 <dd><p>This is the data type of a stream of Unicode characters that normalizes its
437 input according to a given normalization form and passes the normalized
438 character sequence to the encapsulated stream of Unicode characters.
442 <dt><u>Function:</u> struct uninorm_filter * <b>uninorm_filter_create</b><i> (uninorm_t <var>nf</var>, int (*<var>stream_func</var>) (void *<var>stream_data</var>, ucs4_t <var>uc</var>), void *<var>stream_data</var>)</i>
443 <a name="IDX811"></a>
445 <dd><p>Creates and returns a normalization filter for Unicode characters.
447 <p>The pair (<var>stream_func</var>, <var>stream_data</var>) is the encapsulated stream.
448 <code><var>stream_func</var> (<var>stream_data</var>, <var>uc</var>)</code> receives the Unicode
449 character <var>uc</var> and returns 0 if successful, or -1 with <code>errno</code> set
452 <p>Returns the new filter, or NULL with <code>errno</code> set upon failure.
456 <dt><u>Function:</u> int <b>uninorm_filter_write</b><i> (struct uninorm_filter *<var>filter</var>, ucs4_t <var>uc</var>)</i>
457 <a name="IDX812"></a>
459 <dd><p>Stuffs a Unicode character into a normalizing filter.
460 Returns 0 if successful, or -1 with <code>errno</code> set upon failure.
464 <dt><u>Function:</u> int <b>uninorm_filter_flush</b><i> (struct uninorm_filter *<var>filter</var>)</i>
465 <a name="IDX813"></a>
467 <dd><p>Brings data buffered in the filter to its destination, the encapsulated stream.
469 <p>Returns 0 if successful, or -1 with <code>errno</code> set upon failure.
471 <p>Note! If after calling this function, additional characters are written
472 into the filter, the resulting character sequence in the encapsulated stream
473 will not necessarily be normalized.
477 <dt><u>Function:</u> int <b>uninorm_filter_free</b><i> (struct uninorm_filter *<var>filter</var>)</i>
478 <a name="IDX814"></a>
480 <dd><p>Brings data buffered in the filter to its destination, the encapsulated stream,
481 then closes and frees the filter.
483 <p>Returns 0 if successful, or -1 with <code>errno</code> set upon failure.
486 <table cellpadding="1" cellspacing="1" border="0">
487 <tr><td valign="middle" align="left">[<a href="#SEC48" title="Beginning of this chapter or previous chapter"> << </a>]</td>
488 <td valign="middle" align="left">[<a href="libunistring_14.html#SEC54" title="Next chapter"> >> </a>]</td>
489 <td valign="middle" align="left"> </td>
490 <td valign="middle" align="left"> </td>
491 <td valign="middle" align="left"> </td>
492 <td valign="middle" align="left"> </td>
493 <td valign="middle" align="left"> </td>
494 <td valign="middle" align="left">[<a href="libunistring.html#SEC_Top" title="Cover (top) of document">Top</a>]</td>
495 <td valign="middle" align="left">[<a href="libunistring.html#SEC_Contents" title="Table of contents">Contents</a>]</td>
496 <td valign="middle" align="left">[<a href="libunistring_19.html#SEC77" title="Index">Index</a>]</td>
497 <td valign="middle" align="left">[<a href="libunistring_abt.html#SEC_About" title="About (help)"> ? </a>]</td>
501 This document was generated by <em>Daiki Ueno</em> on <em>September, 1 2014</em> using <a href="http://www.nongnu.org/texi2html/"><em>texi2html 1.78a</em></a>.