2 @chapter Unicode character classification and properties @code{<unictype.h>}
4 This include file declares functions that classify Unicode characters
5 and that test whether Unicode characters have specific properties.
7 The classification assigns a ``general category'' to every Unicode
8 character. This is similar to the classification provided by ISO C in
11 Properties are the data that guides various text processing algorithms
12 in the presence of specific Unicode characters.
16 * Canonical combining class::
18 * Decimal digit value::
21 * Mirrored character::
26 * ISO C and Java syntax::
27 * Classifications like in ISO C::
30 @node General category
31 @section General category
33 @cindex general category
34 @cindex Unicode character, general category
35 @cindex Unicode character, classification
36 Every Unicode character or code point has a @emph{general category} assigned
37 to it. This classification is important for most algorithms that work on
40 The GNU libunistring library provides two kinds of API for working with
41 general categories. The object oriented API uses a variable to denote
42 every predefined general category value or combinations thereof. The
43 low-level API uses a bit mask instead. The advantage of the object oriented
44 API is that if only a few predefined general category values are used,
45 the data tables are relatively small. When you combine general category
46 values (using @code{uc_general_category_or}, @code{uc_general_category_and},
47 or @code{uc_general_category_and_not}), or when you use the low level
48 bit masks, a big table is used thats holds the complete general category
49 information for all Unicode characters.
52 * Object oriented API::
56 @node Object oriented API
57 @subsection The object oriented API for general category
59 @deftp Type uc_general_category_t
60 This data type denotes a general category value. It is an immediate type that
61 can be copied by simple assignment, without involving memory allocation. It is
65 The following are the predefined general category value. Additional general
66 categories may be added in the future.
68 The @code{UC_CATEGORY_*} constants reflect the systematic general category
69 values assigned by the Unicode Consortium. Whereas the other @code{UC_*}
70 macros are aliases, for use when readable code is preferred.
72 @deftypevr Constant uc_general_category_t UC_CATEGORY_L
73 @deftypevrx Macro uc_general_category_t UC_LETTER
74 This represents the general category ``Letter''.
77 @deftypevr Constant uc_general_category_t UC_CATEGORY_LC
78 @deftypevrx Macro uc_general_category_t UC_CASED_LETTER
81 @deftypevr Constant uc_general_category_t UC_CATEGORY_Lu
82 @deftypevrx Macro uc_general_category_t UC_UPPERCASE_LETTER
83 This represents the general category ``Letter, uppercase''.
86 @deftypevr Constant uc_general_category_t UC_CATEGORY_Ll
87 @deftypevrx Macro uc_general_category_t UC_LOWERCASE_LETTER
88 This represents the general category ``Letter, lowercase''.
91 @deftypevr Constant uc_general_category_t UC_CATEGORY_Lt
92 @deftypevrx Macro uc_general_category_t UC_TITLECASE_LETTER
93 This represents the general category ``Letter, titlecase''.
96 @deftypevr Constant uc_general_category_t UC_CATEGORY_Lm
97 @deftypevrx Macro uc_general_category_t UC_MODIFIER_LETTER
98 This represents the general category ``Letter, modifier''.
101 @deftypevr Constant uc_general_category_t UC_CATEGORY_Lo
102 @deftypevrx Macro uc_general_category_t UC_OTHER_LETTER
103 This represents the general category ``Letter, other''.
106 @deftypevr Constant uc_general_category_t UC_CATEGORY_M
107 @deftypevrx Macro uc_general_category_t UC_MARK
108 This represents the general category ``Marker''.
111 @deftypevr Constant uc_general_category_t UC_CATEGORY_Mn
112 @deftypevrx Macro uc_general_category_t UC_NON_SPACING_MARK
113 This represents the general category ``Marker, nonspacing''.
116 @deftypevr Constant uc_general_category_t UC_CATEGORY_Mc
117 @deftypevrx Macro uc_general_category_t UC_COMBINING_SPACING_MARK
118 This represents the general category ``Marker, spacing combining''.
121 @deftypevr Constant uc_general_category_t UC_CATEGORY_Me
122 @deftypevrx Macro uc_general_category_t UC_ENCLOSING_MARK
123 This represents the general category ``Marker, enclosing''.
126 @deftypevr Constant uc_general_category_t UC_CATEGORY_N
127 @deftypevrx Macro uc_general_category_t UC_NUMBER
128 This represents the general category ``Number''.
131 @deftypevr Constant uc_general_category_t UC_CATEGORY_Nd
132 @deftypevrx Macro uc_general_category_t UC_DECIMAL_DIGIT_NUMBER
133 This represents the general category ``Number, decimal digit''.
136 @deftypevr Constant uc_general_category_t UC_CATEGORY_Nl
137 @deftypevrx Macro uc_general_category_t UC_LETTER_NUMBER
138 This represents the general category ``Number, letter''.
141 @deftypevr Constant uc_general_category_t UC_CATEGORY_No
142 @deftypevrx Macro uc_general_category_t UC_OTHER_NUMBER
143 This represents the general category ``Number, other''.
146 @deftypevr Constant uc_general_category_t UC_CATEGORY_P
147 @deftypevrx Macro uc_general_category_t UC_PUNCTUATION
148 This represents the general category ``Punctuation''.
151 @deftypevr Constant uc_general_category_t UC_CATEGORY_Pc
152 @deftypevrx Macro uc_general_category_t UC_CONNECTOR_PUNCTUATION
153 This represents the general category ``Punctuation, connector''.
156 @deftypevr Constant uc_general_category_t UC_CATEGORY_Pd
157 @deftypevrx Macro uc_general_category_t UC_DASH_PUNCTUATION
158 This represents the general category ``Punctuation, dash''.
161 @deftypevr Constant uc_general_category_t UC_CATEGORY_Ps
162 @deftypevrx Macro uc_general_category_t UC_OPEN_PUNCTUATION
163 This represents the general category ``Punctuation, open'', a.k.a. ``start punctuation''.
166 @deftypevr Constant uc_general_category_t UC_CATEGORY_Pe
167 @deftypevrx Macro uc_general_category_t UC_CLOSE_PUNCTUATION
168 This represents the general category ``Punctuation, close'', a.k.a. ``end punctuation''.
171 @deftypevr Constant uc_general_category_t UC_CATEGORY_Pi
172 @deftypevrx Macro uc_general_category_t UC_INITIAL_QUOTE_PUNCTUATION
173 This represents the general category ``Punctuation, initial quote''.
176 @deftypevr Constant uc_general_category_t UC_CATEGORY_Pf
177 @deftypevrx Macro uc_general_category_t UC_FINAL_QUOTE_PUNCTUATION
178 This represents the general category ``Punctuation, final quote''.
181 @deftypevr Constant uc_general_category_t UC_CATEGORY_Po
182 @deftypevrx Macro uc_general_category_t UC_OTHER_PUNCTUATION
183 This represents the general category ``Punctuation, other''.
186 @deftypevr Constant uc_general_category_t UC_CATEGORY_S
187 @deftypevrx Macro uc_general_category_t UC_SYMBOL
188 This represents the general category ``Symbol''.
191 @deftypevr Constant uc_general_category_t UC_CATEGORY_Sm
192 @deftypevrx Macro uc_general_category_t UC_MATH_SYMBOL
193 This represents the general category ``Symbol, math''.
196 @deftypevr Constant uc_general_category_t UC_CATEGORY_Sc
197 @deftypevrx Macro uc_general_category_t UC_CURRENCY_SYMBOL
198 This represents the general category ``Symbol, currency''.
201 @deftypevr Constant uc_general_category_t UC_CATEGORY_Sk
202 @deftypevrx Macro uc_general_category_t UC_MODIFIER_SYMBOL
203 This represents the general category ``Symbol, modifier''.
206 @deftypevr Constant uc_general_category_t UC_CATEGORY_So
207 @deftypevrx Macro uc_general_category_t UC_OTHER_SYMBOL
208 This represents the general category ``Symbol, other''.
211 @deftypevr Constant uc_general_category_t UC_CATEGORY_Z
212 @deftypevrx Macro uc_general_category_t UC_SEPARATOR
213 This represents the general category ``Separator''.
216 @deftypevr Constant uc_general_category_t UC_CATEGORY_Zs
217 @deftypevrx Macro uc_general_category_t UC_SPACE_SEPARATOR
218 This represents the general category ``Separator, space''.
221 @deftypevr Constant uc_general_category_t UC_CATEGORY_Zl
222 @deftypevrx Macro uc_general_category_t UC_LINE_SEPARATOR
223 This represents the general category ``Separator, line''.
226 @deftypevr Constant uc_general_category_t UC_CATEGORY_Zp
227 @deftypevrx Macro uc_general_category_t UC_PARAGRAPH_SEPARATOR
228 This represents the general category ``Separator, paragraph''.
231 @deftypevr Constant uc_general_category_t UC_CATEGORY_C
232 @deftypevrx Macro uc_general_category_t UC_OTHER
233 This represents the general category ``Other''.
236 @deftypevr Constant uc_general_category_t UC_CATEGORY_Cc
237 @deftypevrx Macro uc_general_category_t UC_CONTROL
238 This represents the general category ``Other, control''.
241 @deftypevr Constant uc_general_category_t UC_CATEGORY_Cf
242 @deftypevrx Macro uc_general_category_t UC_FORMAT
243 This represents the general category ``Other, format''.
246 @deftypevr Constant uc_general_category_t UC_CATEGORY_Cs
247 @deftypevrx Macro uc_general_category_t UC_SURROGATE
248 This represents the general category ``Other, surrogate''.
249 All code points in this category are invalid characters.
252 @deftypevr Constant uc_general_category_t UC_CATEGORY_Co
253 @deftypevrx Macro uc_general_category_t UC_PRIVATE_USE
254 This represents the general category ``Other, private use''.
257 @deftypevr Constant uc_general_category_t UC_CATEGORY_Cn
258 @deftypevrx Macro uc_general_category_t UC_UNASSIGNED
259 This represents the general category ``Other, not assigned''.
260 Some code points in this category are invalid characters.
263 The following functions combine general categories, like in a boolean algebra,
264 except that there is no @samp{not} operation.
266 @deftypefun uc_general_category_t uc_general_category_or (uc_general_category_t@tie{}@var{category1}, uc_general_category_t@tie{}@var{category2})
267 Returns the union of two general categories.
268 This corresponds to the unions of the two sets of characters.
271 @deftypefun uc_general_category_t uc_general_category_and (uc_general_category_t@tie{}@var{category1}, uc_general_category_t@tie{}@var{category2})
272 Returns the intersection of two general categories as bit masks.
273 This @emph{does not} correspond to the intersection of the two sets of
278 @deftypefun uc_general_category_t uc_general_category_and_not (uc_general_category_t@tie{}@var{category1}, uc_general_category_t@tie{}@var{category2})
279 Returns the intersection of a general category with the complement of a
280 second general category, as bit masks.
281 This @emph{does not} correspond to the intersection with complement, when
282 viewing the categories as sets of characters.
286 The following functions associate general categories with their name.
288 @deftypefun {const char *} uc_general_category_name (uc_general_category_t@tie{}@var{category})
289 Returns the name of a general category, more precisely, the abbreviated name.
290 Returns NULL if the general category corresponds to a bit mask that does not
294 @deftypefun {const char *} uc_general_category_long_name (uc_general_category_t@tie{}@var{category})
295 Returns the long name of a general category.
296 Returns NULL if the general category corresponds to a bit mask that does not
300 @deftypefun uc_general_category_t uc_general_category_byname (const@tie{}char@tie{}*@var{category_name})
301 Returns the general category given by name, e.g@. @code{"Lu"}, or by long
302 name, e.g@. @code{"Uppercase Letter"}.
303 This lookup ignores spaces, underscores, or hyphens as word separators and is
307 The following functions view general categories as sets of Unicode characters.
309 @deftypefun uc_general_category_t uc_general_category (ucs4_t@tie{}@var{uc})
310 Returns the general category of a Unicode character.
312 This function uses a big table.
315 @deftypefun bool uc_is_general_category (ucs4_t@tie{}@var{uc}, uc_general_category_t@tie{}@var{category})
316 Tests whether a Unicode character belongs to a given category.
317 The @var{category} argument can be a predefined general category or the
318 combination of several predefined general categories.
322 @subsection The bit mask API for general category
324 The following are the predefined general category value as bit masks.
325 Additional general categories may be added in the future.
327 @deftypevr Macro uint32_t UC_CATEGORY_MASK_L
328 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_LC
329 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Lu
330 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Ll
331 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Lt
332 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Lm
333 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Lo
334 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_M
335 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Mn
336 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Mc
337 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Me
338 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_N
339 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Nd
340 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Nl
341 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_No
342 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_P
343 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Pc
344 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Pd
345 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Ps
346 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Pe
347 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Pi
348 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Pf
349 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Po
350 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_S
351 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Sm
352 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Sc
353 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Sk
354 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_So
355 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Z
356 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Zs
357 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Zl
358 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Zp
359 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_C
360 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Cc
361 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Cf
362 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Cs
363 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Co
364 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Cn
367 The following function views general categories as sets of Unicode characters.
369 @deftypefun bool uc_is_general_category_withtable (ucs4_t@tie{}@var{uc}, uint32_t@tie{}@var{bitmask})
370 Tests whether a Unicode character belongs to a given category.
371 The @var{bitmask} argument can be a predefined general category bitmask or the
372 combination of several predefined general category bitmasks.
374 This function uses a big table comprising all general categories.
377 @node Canonical combining class
378 @section Canonical combining class
380 @cindex canonical combining class
381 @cindex Unicode character, canonical combining class
382 Every Unicode character or code point has a @emph{canonical combining class}
385 What is the meaning of the canonical combining class? Essentially, it
386 indicates the priority with which a combining character is attached to its
387 base character. The characters for which the canonical combining class is 0
388 are the base characters, and the characters for which it is greater than 0 are
389 the combining characters. Combining characters are rendered
390 near/attached/around their base character, and combining characters with small
391 combining classes are attached "first" or "closer" to the base character.
393 The canonical combining class of a character is a number in the range
394 0..255. The possible values are described in the Unicode Character Database
395 @texnl{}@url{https://www.unicode.org/Public/UNIDATA/UCD.html}. The list here is
396 not definitive; more values can be added in future versions.
398 @deftypevr Constant int UC_CCC_NR
399 The canonical combining class value for ``Not Reordered'' characters.
403 @deftypevr Constant int UC_CCC_OV
404 The canonical combining class value for ``Overlay'' characters.
407 @deftypevr Constant int UC_CCC_NK
408 The canonical combining class value for ``Nukta'' characters.
411 @deftypevr Constant int UC_CCC_KV
412 The canonical combining class value for ``Kana Voicing'' characters.
415 @deftypevr Constant int UC_CCC_VR
416 The canonical combining class value for ``Virama'' characters.
419 @deftypevr Constant int UC_CCC_ATBL
420 The canonical combining class value for ``Attached Below Left'' characters.
423 @deftypevr Constant int UC_CCC_ATB
424 The canonical combining class value for ``Attached Below'' characters.
427 @deftypevr Constant int UC_CCC_ATA
428 The canonical combining class value for ``Attached Above'' characters.
431 @deftypevr Constant int UC_CCC_ATAR
432 The canonical combining class value for ``Attached Above Right'' characters.
435 @deftypevr Constant int UC_CCC_BL
436 The canonical combining class value for ``Below Left'' characters.
439 @deftypevr Constant int UC_CCC_B
440 The canonical combining class value for ``Below'' characters.
443 @deftypevr Constant int UC_CCC_BR
444 The canonical combining class value for ``Below Right'' characters.
447 @deftypevr Constant int UC_CCC_L
448 The canonical combining class value for ``Left'' characters.
451 @deftypevr Constant int UC_CCC_R
452 The canonical combining class value for ``Right'' characters.
455 @deftypevr Constant int UC_CCC_AL
456 The canonical combining class value for ``Above Left'' characters.
459 @deftypevr Constant int UC_CCC_A
460 The canonical combining class value for ``Above'' characters.
463 @deftypevr Constant int UC_CCC_AR
464 The canonical combining class value for ``Above Right'' characters.
467 @deftypevr Constant int UC_CCC_DB
468 The canonical combining class value for ``Double Below'' characters.
471 @deftypevr Constant int UC_CCC_DA
472 The canonical combining class value for ``Double Above'' characters.
475 @deftypevr Constant int UC_CCC_IS
476 The canonical combining class value for ``Iota Subscript'' characters.
479 The following functions associate canonical combining classes with their name.
481 @deftypefun {const char *} uc_combining_class_name (int@tie{}@var{ccc})
482 Returns the name of a canonical combining class, more precisely, the
484 Returns NULL if the canonical combining class is a numeric value without a
488 @deftypefun {const char *} uc_combining_class_long_name (int@tie{}@var{ccc})
489 Returns the long name of a canonical combining class.
490 Returns NULL if the canonical combining class is a numeric value without a
494 @deftypefun int uc_combining_class_byname (const@tie{}char@tie{}*@var{ccc_name})
495 Returns the canonical combining class given by name, e.g@. @code{"BL"}, or by
496 long name, e.g@. @code{"Below Left"}.
497 This lookup ignores spaces, underscores, or hyphens as word separators and is
501 The following function looks up the canonical combining class of a character.
503 @deftypefun int uc_combining_class (ucs4_t@tie{}@var{uc})
504 Returns the canonical combining class of a Unicode character.
511 @cindex bidirectional category
512 @cindex Unicode character, bidi class
513 @cindex Unicode character, bidirectional category
514 Every Unicode character or code point has a @emph{bidi class} assigned to it.
515 Before Unicode 4.0, this concept was known as @emph{bidirectional category}.
517 The bidi class guides the bidirectional algorithm@texnl{}
518 (@url{https://www.unicode.org/reports/tr9/}). The possible values are
521 @deftypevr Constant int UC_BIDI_L
522 The bidi class for `Left-to-Right`'' characters.
525 @deftypevr Constant int UC_BIDI_LRE
526 The bidi class for ``Left-to-Right Embedding'' characters.
529 @deftypevr Constant int UC_BIDI_LRO
530 The bidi class for ``Left-to-Right Override'' characters.
533 @deftypevr Constant int UC_BIDI_R
534 The bidi class for ``Right-to-Left'' characters.
537 @deftypevr Constant int UC_BIDI_AL
538 The bidi class for ``Right-to-Left Arabic'' characters.
541 @deftypevr Constant int UC_BIDI_RLE
542 The bidi class for ``Right-to-Left Embedding'' characters.
545 @deftypevr Constant int UC_BIDI_RLO
546 The bidi class for ``Right-to-Left Override'' characters.
549 @deftypevr Constant int UC_BIDI_PDF
550 The bidi class for ``Pop Directional Format'' characters.
553 @deftypevr Constant int UC_BIDI_EN
554 The bidi class for ``European Number'' characters.
557 @deftypevr Constant int UC_BIDI_ES
558 The bidi class for ``European Number Separator'' characters.
561 @deftypevr Constant int UC_BIDI_ET
562 The bidi class for ``European Number Terminator'' characters.
565 @deftypevr Constant int UC_BIDI_AN
566 The bidi class for ``Arabic Number'' characters.
569 @deftypevr Constant int UC_BIDI_CS
570 The bidi class for ``Common Number Separator'' characters.
573 @deftypevr Constant int UC_BIDI_NSM
574 The bidi class for ``Non-Spacing Mark'' characters.
577 @deftypevr Constant int UC_BIDI_BN
578 The bidi class for ``Boundary Neutral'' characters.
581 @deftypevr Constant int UC_BIDI_B
582 The bidi class for ``Paragraph Separator'' characters.
585 @deftypevr Constant int UC_BIDI_S
586 The bidi class for ``Segment Separator'' characters.
589 @deftypevr Constant int UC_BIDI_WS
590 The bidi class for ``Whitespace'' characters.
593 @deftypevr Constant int UC_BIDI_ON
594 The bidi class for ``Other Neutral'' characters.
597 @deftypevr Constant int UC_BIDI_LRI
598 The bidi class for ``Left-to-Right Isolate'' characters.
601 @deftypevr Constant int UC_BIDI_RLI
602 The bidi class for ``Right-to-Left Isolate'' characters.
605 @deftypevr Constant int UC_BIDI_FSI
606 The bidi class for ``First Strong Isolate'' characters.
609 @deftypevr Constant int UC_BIDI_PDI
610 The bidi class for ``Pop Directional Isolate'' characters.
613 The following functions implement the association between a bidirectional
614 category and its name.
616 @deftypefun {const char *} uc_bidi_class_name (int@tie{}@var{bidi_class})
617 @deftypefunx {const char *} uc_bidi_category_name (int@tie{}@var{category})
618 Returns the name of a bidi class, more precisely, the abbreviated name.
621 @deftypefun {const char *} uc_bidi_class_long_name (int@tie{}@var{bidi_class})
622 Returns the long name of a bidi class.
625 @deftypefun int uc_bidi_class_byname (const@tie{}char@tie{}*@var{bidi_class_name})
626 @deftypefunx int uc_bidi_category_byname (const@tie{}char@tie{}*@var{category_name})
627 Returns the bidi class given by name, e.g@. @code{"LRE"}, or by long name,
628 e.g@. @code{"Left-to-Right Embedding"}.
629 This lookup ignores spaces, underscores, or hyphens as word separators and is
633 The following functions view bidirectional categories as sets of Unicode
636 @deftypefun int uc_bidi_class (ucs4_t@tie{}@var{uc})
637 @deftypefunx int uc_bidi_category (ucs4_t@tie{}@var{uc})
638 Returns the bidi class of a Unicode character.
641 @deftypefun bool uc_is_bidi_class (ucs4_t@tie{}@var{uc}, int@tie{}@var{bidi_class})
642 @deftypefunx bool uc_is_bidi_category (ucs4_t@tie{}@var{uc}, int@tie{}@var{category})
643 Tests whether a Unicode character belongs to a given bidi class.
646 @node Decimal digit value
647 @section Decimal digit value
649 @cindex value, of Unicode character
650 @cindex Unicode character, value
651 Decimal digits (like the digits from @samp{0} to @samp{9}) exist in many
652 scripts. The following function converts a decimal digit character to its
655 @deftypefun int uc_decimal_value (ucs4_t@tie{}@var{uc})
656 Returns the decimal digit value of a Unicode character.
657 The return value is an integer in the range 0..9, or -1 for characters that
658 do not represent a decimal digit.
664 @cindex value, of Unicode character
665 @cindex Unicode character, value
666 Digit characters are like decimal digit characters, possibly in special forms,
667 like as superscript, subscript, or circled. The following function converts a
668 digit character to its numerical value.
670 @deftypefun int uc_digit_value (ucs4_t@tie{}@var{uc})
671 Returns the digit value of a Unicode character.
672 The return value is an integer in the range 0..9, or -1 for characters that
673 do not represent a digit.
677 @section Numeric value
679 @cindex value, of Unicode character
680 @cindex Unicode character, value
681 There are also characters that represent numbers without a digit system, like
682 the Roman numerals, and fractional numbers, like 1/4 or 3/4.
684 The following type represents the numeric value of a Unicode character.
685 @deftp Type uc_fraction_t
686 This is a structure type with the following fields:
691 An integer @var{n} is represented by @code{numerator = @var{n}},
692 @code{denominator = 1}.
695 The following function converts a number character to its numerical value.
697 @deftypefun uc_fraction_t uc_numeric_value (ucs4_t@tie{}@var{uc})
698 Returns the numeric value of a Unicode character.
699 The return value is a fraction, or the pseudo-fraction @code{@{ 0, 0 @}} for
700 characters that do not represent a number.
703 @node Mirrored character
704 @section Mirrored character
706 @cindex mirroring, of Unicode character
707 @cindex Unicode character, mirroring
708 Character mirroring is used to associate the closing parenthesis character
709 to the opening parenthesis character, the closing brace character with the
710 opening brace character, and so on.
712 The following function looks up the mirrored character of a Unicode character.
714 @deftypefun bool uc_mirror_char (ucs4_t@tie{}@var{uc}, ucs4_t@tie{}*@var{puc})
715 Stores the mirrored character of a Unicode character @var{uc} in
716 @code{*@var{puc}} and returns @code{true}, if it exists. Otherwise it
717 stores @var{uc} unmodified in @code{*@var{puc}} and returns @code{false}.
721 @section Arabic shaping
723 @cindex Arabic shaping
724 @cindex joining of Arabic characters
725 When Arabic characters are rendered, after bidi reordering has taken
726 place, the shape of the glyphs are modified so that many adjacent glyphs
727 are joined. Two character properties describe how this ``Arabic shaping''
728 takes place: the joining type and the joining group.
736 @subsection Joining type of Arabic characters
739 The joining type of a character describes on which of the left and right
740 neighbour characters the character's shape depends, and which of the two
741 neighbour characters are rendered depending on this character.
743 The joining type has the following possible values:
745 @deftypevr Constant int UC_JOINING_TYPE_U
746 ``Non joining'': Characters of this joining type prohibit joining.
749 @deftypevr Constant int UC_JOINING_TYPE_T
750 ``Transparent'': Characters of this joining type are skipped when
754 @deftypevr Constant int UC_JOINING_TYPE_C
755 ``Join causing'': Characters of this joining type cause their neighbour
756 characters to change their shapes but don't change their own shape.
759 @deftypevr Constant int UC_JOINING_TYPE_L
760 ``Left joining'': Characters of this joining type have two shapes,
761 isolated and initial. Such characters currently don't exist.
764 @deftypevr Constant int UC_JOINING_TYPE_R
765 ``Right joining'': Characters of this joining type have two shapes,
769 @deftypevr Constant int UC_JOINING_TYPE_D
770 ``Dual joining'': Characters of this joining type have four shapes,
771 initial, medial, final, and isolated.
774 The following functions implement the association between a joining type
777 @deftypefun {const char *} uc_joining_type_name (int@tie{}@var{joining_type})
778 Returns the name of a joining type.
781 @deftypefun {const char *} uc_joining_type_long_name (int@tie{}@var{joining_type})
782 Returns the long name of a joining type.
785 @deftypefun int uc_joining_type_byname (const@tie{}char@tie{}*@var{joining_type_name})
786 Returns the joining type given by name, e.g@. @code{"D"}, or by long name,
787 e.g@. @code{"Dual Joining}.
788 This lookup ignores spaces, underscores, or hyphens as word separators and is
792 The following function gives the joining type of every Unicode character.
794 @deftypefun int uc_joining_type (ucs4_t@tie{}@var{uc})
795 Returns the joining type of a Unicode character.
799 @subsection Joining group of Arabic characters
801 @cindex joining group
802 The joining group of a character describes how the character's shape
803 is modified in the four contexts of dual-joining characters or in the
804 two contexts of right-joining characters.
806 The joining group has the following possible values:
808 @deftypevr Constant int UC_JOINING_GROUP_NONE
809 @deftypevrx Constant int UC_JOINING_GROUP_AIN
810 @deftypevrx Constant int UC_JOINING_GROUP_ALAPH
811 @deftypevrx Constant int UC_JOINING_GROUP_ALEF
812 @deftypevrx Constant int UC_JOINING_GROUP_BEH
813 @deftypevrx Constant int UC_JOINING_GROUP_BETH
814 @deftypevrx Constant int UC_JOINING_GROUP_BURUSHASKI_YEH_BARREE
815 @deftypevrx Constant int UC_JOINING_GROUP_DAL
816 @deftypevrx Constant int UC_JOINING_GROUP_DALATH_RISH
817 @deftypevrx Constant int UC_JOINING_GROUP_E
818 @deftypevrx Constant int UC_JOINING_GROUP_FARSI_YEH
819 @deftypevrx Constant int UC_JOINING_GROUP_FE
820 @deftypevrx Constant int UC_JOINING_GROUP_FEH
821 @deftypevrx Constant int UC_JOINING_GROUP_FINAL_SEMKATH
822 @deftypevrx Constant int UC_JOINING_GROUP_GAF
823 @deftypevrx Constant int UC_JOINING_GROUP_GAMAL
824 @deftypevrx Constant int UC_JOINING_GROUP_HAH
825 @deftypevrx Constant int UC_JOINING_GROUP_HE
826 @deftypevrx Constant int UC_JOINING_GROUP_HEH
827 @deftypevrx Constant int UC_JOINING_GROUP_HEH_GOAL
828 @deftypevrx Constant int UC_JOINING_GROUP_HETH
829 @deftypevrx Constant int UC_JOINING_GROUP_KAF
830 @deftypevrx Constant int UC_JOINING_GROUP_KAPH
831 @deftypevrx Constant int UC_JOINING_GROUP_KHAPH
832 @deftypevrx Constant int UC_JOINING_GROUP_KNOTTED_HEH
833 @deftypevrx Constant int UC_JOINING_GROUP_LAM
834 @deftypevrx Constant int UC_JOINING_GROUP_LAMADH
835 @deftypevrx Constant int UC_JOINING_GROUP_MEEM
836 @deftypevrx Constant int UC_JOINING_GROUP_MIM
837 @deftypevrx Constant int UC_JOINING_GROUP_NOON
838 @deftypevrx Constant int UC_JOINING_GROUP_NUN
839 @deftypevrx Constant int UC_JOINING_GROUP_NYA
840 @deftypevrx Constant int UC_JOINING_GROUP_PE
841 @deftypevrx Constant int UC_JOINING_GROUP_QAF
842 @deftypevrx Constant int UC_JOINING_GROUP_QAPH
843 @deftypevrx Constant int UC_JOINING_GROUP_REH
844 @deftypevrx Constant int UC_JOINING_GROUP_REVERSED_PE
845 @deftypevrx Constant int UC_JOINING_GROUP_SAD
846 @deftypevrx Constant int UC_JOINING_GROUP_SADHE
847 @deftypevrx Constant int UC_JOINING_GROUP_SEEN
848 @deftypevrx Constant int UC_JOINING_GROUP_SEMKATH
849 @deftypevrx Constant int UC_JOINING_GROUP_SHIN
850 @deftypevrx Constant int UC_JOINING_GROUP_SWASH_KAF
851 @deftypevrx Constant int UC_JOINING_GROUP_SYRIAC_WAW
852 @deftypevrx Constant int UC_JOINING_GROUP_TAH
853 @deftypevrx Constant int UC_JOINING_GROUP_TAW
854 @deftypevrx Constant int UC_JOINING_GROUP_TEH_MARBUTA
855 @deftypevrx Constant int UC_JOINING_GROUP_TEH_MARBUTA_GOAL
856 @deftypevrx Constant int UC_JOINING_GROUP_TETH
857 @deftypevrx Constant int UC_JOINING_GROUP_WAW
858 @deftypevrx Constant int UC_JOINING_GROUP_YEH
859 @deftypevrx Constant int UC_JOINING_GROUP_YEH_BARREE
860 @deftypevrx Constant int UC_JOINING_GROUP_YEH_WITH_TAIL
861 @deftypevrx Constant int UC_JOINING_GROUP_YUDH
862 @deftypevrx Constant int UC_JOINING_GROUP_YUDH_HE
863 @deftypevrx Constant int UC_JOINING_GROUP_ZAIN
864 @deftypevrx Constant int UC_JOINING_GROUP_ZHAIN
865 @deftypevrx Constant int UC_JOINING_GROUP_ROHINGYA_YEH
866 @deftypevrx Constant int UC_JOINING_GROUP_STRAIGHT_WAW
867 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_ALEPH
868 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_BETH
869 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_GIMEL
870 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_DALETH
871 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_WAW
872 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_ZAYIN
873 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_HETH
874 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_TETH
875 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_YODH
876 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_KAPH
877 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_LAMEDH
878 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_DHAMEDH
879 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_THAMEDH
880 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_MEM
881 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_NUN
882 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_SAMEKH
883 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_AYIN
884 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_PE
885 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_SADHE
886 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_QOPH
887 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_RESH
888 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_TAW
889 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_ONE
890 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_FIVE
891 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_TEN
892 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_TWENTY
893 @deftypevrx Constant int UC_JOINING_GROUP_MANICHAEAN_HUNDRED
894 @deftypevrx Constant int UC_JOINING_GROUP_AFRICAN_FEH
895 @deftypevrx Constant int UC_JOINING_GROUP_AFRICAN_QAF
896 @deftypevrx Constant int UC_JOINING_GROUP_AFRICAN_NOON
897 @deftypevrx Constant int UC_JOINING_GROUP_MALAYALAM_NGA
898 @deftypevrx Constant int UC_JOINING_GROUP_MALAYALAM_JA
899 @deftypevrx Constant int UC_JOINING_GROUP_MALAYALAM_NYA
900 @deftypevrx Constant int UC_JOINING_GROUP_MALAYALAM_TTA
901 @deftypevrx Constant int UC_JOINING_GROUP_MALAYALAM_NNA
902 @deftypevrx Constant int UC_JOINING_GROUP_MALAYALAM_NNNA
903 @deftypevrx Constant int UC_JOINING_GROUP_MALAYALAM_BHA
904 @deftypevrx Constant int UC_JOINING_GROUP_MALAYALAM_RA
905 @deftypevrx Constant int UC_JOINING_GROUP_MALAYALAM_LLA
906 @deftypevrx Constant int UC_JOINING_GROUP_MALAYALAM_LLLA
907 @deftypevrx Constant int UC_JOINING_GROUP_MALAYALAM_SSA
908 @deftypevrx Constant int UC_JOINING_GROUP_HANIFI_ROHINGYA_PA
909 @deftypevrx Constant int UC_JOINING_GROUP_HANIFI_ROHINGYA_KINNA_YA
910 @deftypevrx Constant int UC_JOINING_GROUP_THIN_YEH
911 @deftypevrx Constant int UC_JOINING_GROUP_VERTICAL_TAIL
914 The following functions implement the association between a joining group
917 @deftypefun {const char *} uc_joining_group_name (int@tie{}@var{joining_group})
918 Returns the name of a joining group.
921 @deftypefun int uc_joining_group_byname (const@tie{}char@tie{}*@var{joining_group_name})
922 Returns the joining group given by name, e.g@. @code{"Teh_Marbuta"}.
923 This lookup ignores spaces, underscores, or hyphens as word separators and is
927 The following function gives the joining group of every Unicode character.
929 @deftypefun int uc_joining_group (ucs4_t@tie{}@var{uc})
930 Returns the joining group of a Unicode character.
936 @cindex properties, of Unicode character
937 @cindex Unicode character, properties
938 This section defines boolean properties of Unicode characters. This
939 means, a character either has the given property or does not have it.
940 In other words, the property can be viewed as a subset of the set of
943 The GNU libunistring library provides two kinds of API for working with
944 properties. The object oriented API uses a type @code{uc_property_t}
945 to designate a property. In the function-based API, which is a bit more
946 low level, a property is merely a function.
949 * Properties as objects::
950 * Properties as functions::
953 @node Properties as objects
954 @subsection Properties as objects -- the object oriented API
956 The following type designates a property on Unicode characters.
958 @deftp Type uc_property_t
959 This data type denotes a boolean property on Unicode characters. It is an
960 immediate type that can be copied by simple assignment, without involving
961 memory allocation. It is not an array type.
964 Many Unicode properties are predefined.
966 The following are general properties.
968 @deftypevr Constant uc_property_t UC_PROPERTY_WHITE_SPACE
969 @deftypevrx Constant uc_property_t UC_PROPERTY_ALPHABETIC
970 @deftypevrx Constant uc_property_t UC_PROPERTY_OTHER_ALPHABETIC
971 @deftypevrx Constant uc_property_t UC_PROPERTY_NOT_A_CHARACTER
972 @deftypevrx Constant uc_property_t UC_PROPERTY_DEFAULT_IGNORABLE_CODE_POINT
973 @deftypevrx Constant uc_property_t UC_PROPERTY_OTHER_DEFAULT_IGNORABLE_CODE_POINT
974 @deftypevrx Constant uc_property_t UC_PROPERTY_DEPRECATED
975 @deftypevrx Constant uc_property_t UC_PROPERTY_LOGICAL_ORDER_EXCEPTION
976 @deftypevrx Constant uc_property_t UC_PROPERTY_VARIATION_SELECTOR
977 @deftypevrx Constant uc_property_t UC_PROPERTY_PRIVATE_USE
978 @deftypevrx Constant uc_property_t UC_PROPERTY_UNASSIGNED_CODE_VALUE
981 The following properties are related to case folding.
983 @deftypevr Constant uc_property_t UC_PROPERTY_UPPERCASE
984 @deftypevrx Constant uc_property_t UC_PROPERTY_OTHER_UPPERCASE
985 @deftypevrx Constant uc_property_t UC_PROPERTY_LOWERCASE
986 @deftypevrx Constant uc_property_t UC_PROPERTY_OTHER_LOWERCASE
987 @deftypevrx Constant uc_property_t UC_PROPERTY_TITLECASE
988 @deftypevrx Constant uc_property_t UC_PROPERTY_CASED
989 @deftypevrx Constant uc_property_t UC_PROPERTY_CASE_IGNORABLE
990 @deftypevrx Constant uc_property_t UC_PROPERTY_CHANGES_WHEN_LOWERCASED
991 @deftypevrx Constant uc_property_t UC_PROPERTY_CHANGES_WHEN_UPPERCASED
992 @deftypevrx Constant uc_property_t UC_PROPERTY_CHANGES_WHEN_TITLECASED
993 @deftypevrx Constant uc_property_t UC_PROPERTY_CHANGES_WHEN_CASEFOLDED
994 @deftypevrx Constant uc_property_t UC_PROPERTY_CHANGES_WHEN_CASEMAPPED
995 @deftypevrx Constant uc_property_t UC_PROPERTY_SOFT_DOTTED
998 The following properties are related to identifiers.
1000 @deftypevr Constant uc_property_t UC_PROPERTY_ID_START
1001 @deftypevrx Constant uc_property_t UC_PROPERTY_OTHER_ID_START
1002 @deftypevrx Constant uc_property_t UC_PROPERTY_ID_CONTINUE
1003 @deftypevrx Constant uc_property_t UC_PROPERTY_OTHER_ID_CONTINUE
1004 @deftypevrx Constant uc_property_t UC_PROPERTY_XID_START
1005 @deftypevrx Constant uc_property_t UC_PROPERTY_XID_CONTINUE
1006 @deftypevrx Constant uc_property_t UC_PROPERTY_PATTERN_WHITE_SPACE
1007 @deftypevrx Constant uc_property_t UC_PROPERTY_PATTERN_SYNTAX
1010 The following properties have an influence on shaping and rendering.
1012 @deftypevr Constant uc_property_t UC_PROPERTY_JOIN_CONTROL
1013 @deftypevrx Constant uc_property_t UC_PROPERTY_GRAPHEME_BASE
1014 @deftypevrx Constant uc_property_t UC_PROPERTY_GRAPHEME_EXTEND
1015 @deftypevrx Constant uc_property_t UC_PROPERTY_OTHER_GRAPHEME_EXTEND
1016 @deftypevrx Constant uc_property_t UC_PROPERTY_GRAPHEME_LINK
1019 The following properties relate to bidirectional reordering.
1021 @deftypevr Constant uc_property_t UC_PROPERTY_BIDI_CONTROL
1022 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_LEFT_TO_RIGHT
1023 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_HEBREW_RIGHT_TO_LEFT
1024 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_ARABIC_RIGHT_TO_LEFT
1025 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_EUROPEAN_DIGIT
1026 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_EUR_NUM_SEPARATOR
1027 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_EUR_NUM_TERMINATOR
1028 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_ARABIC_DIGIT
1029 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_COMMON_SEPARATOR
1030 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_BLOCK_SEPARATOR
1031 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_SEGMENT_SEPARATOR
1032 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_WHITESPACE
1033 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_NON_SPACING_MARK
1034 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_BOUNDARY_NEUTRAL
1035 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_PDF
1036 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_EMBEDDING_OR_OVERRIDE
1037 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_OTHER_NEUTRAL
1040 The following properties deal with number representations.
1042 @deftypevr Constant uc_property_t UC_PROPERTY_HEX_DIGIT
1043 @deftypevrx Constant uc_property_t UC_PROPERTY_ASCII_HEX_DIGIT
1046 The following properties deal with CJK.
1048 @deftypevr Constant uc_property_t UC_PROPERTY_IDEOGRAPHIC
1049 @deftypevrx Constant uc_property_t UC_PROPERTY_UNIFIED_IDEOGRAPH
1050 @deftypevrx Constant uc_property_t UC_PROPERTY_RADICAL
1051 @deftypevrx Constant uc_property_t UC_PROPERTY_IDS_BINARY_OPERATOR
1052 @deftypevrx Constant uc_property_t UC_PROPERTY_IDS_TRINARY_OPERATOR
1055 The following properties deal with pictographic symbols.
1057 @deftypevr Constant uc_property_t UC_PROPERTY_EMOJI
1058 @deftypevrx Constant uc_property_t UC_PROPERTY_EMOJI_PRESENTATION
1059 @deftypevrx Constant uc_property_t UC_PROPERTY_EMOJI_MODIFIER
1060 @deftypevrx Constant uc_property_t UC_PROPERTY_EMOJI_MODIFIER_BASE
1061 @deftypevrx Constant uc_property_t UC_PROPERTY_EMOJI_COMPONENT
1062 @deftypevrx Constant uc_property_t UC_PROPERTY_EXTENDED_PICTOGRAPHIC
1065 Other miscellaneous properties are:
1067 @deftypevr Constant uc_property_t UC_PROPERTY_ZERO_WIDTH
1068 @deftypevrx Constant uc_property_t UC_PROPERTY_SPACE
1069 @deftypevrx Constant uc_property_t UC_PROPERTY_NON_BREAK
1070 @deftypevrx Constant uc_property_t UC_PROPERTY_ISO_CONTROL
1071 @deftypevrx Constant uc_property_t UC_PROPERTY_FORMAT_CONTROL
1072 @deftypevrx Constant uc_property_t UC_PROPERTY_DASH
1073 @deftypevrx Constant uc_property_t UC_PROPERTY_HYPHEN
1074 @deftypevrx Constant uc_property_t UC_PROPERTY_PUNCTUATION
1075 @deftypevrx Constant uc_property_t UC_PROPERTY_LINE_SEPARATOR
1076 @deftypevrx Constant uc_property_t UC_PROPERTY_PARAGRAPH_SEPARATOR
1077 @deftypevrx Constant uc_property_t UC_PROPERTY_QUOTATION_MARK
1078 @deftypevrx Constant uc_property_t UC_PROPERTY_SENTENCE_TERMINAL
1079 @deftypevrx Constant uc_property_t UC_PROPERTY_TERMINAL_PUNCTUATION
1080 @deftypevrx Constant uc_property_t UC_PROPERTY_CURRENCY_SYMBOL
1081 @deftypevrx Constant uc_property_t UC_PROPERTY_MATH
1082 @deftypevrx Constant uc_property_t UC_PROPERTY_OTHER_MATH
1083 @deftypevrx Constant uc_property_t UC_PROPERTY_PAIRED_PUNCTUATION
1084 @deftypevrx Constant uc_property_t UC_PROPERTY_LEFT_OF_PAIR
1085 @deftypevrx Constant uc_property_t UC_PROPERTY_COMBINING
1086 @deftypevrx Constant uc_property_t UC_PROPERTY_COMPOSITE
1087 @deftypevrx Constant uc_property_t UC_PROPERTY_DECIMAL_DIGIT
1088 @deftypevrx Constant uc_property_t UC_PROPERTY_NUMERIC
1089 @deftypevrx Constant uc_property_t UC_PROPERTY_DIACRITIC
1090 @deftypevrx Constant uc_property_t UC_PROPERTY_EXTENDER
1091 @deftypevrx Constant uc_property_t UC_PROPERTY_IGNORABLE_CONTROL
1092 @deftypevrx Constant uc_property_t UC_PROPERTY_REGIONAL_INDICATOR
1095 The following function looks up a property by its name.
1097 @deftypefun uc_property_t uc_property_byname (const@tie{}char@tie{}*@var{property_name})
1098 Returns the property given by name, e.g@. @code{"White space"}. If a property
1099 with the given name exists, the result will satisfy the
1100 @code{uc_property_is_valid} predicate. Otherwise the result will not satisfy
1101 this predicate and must not be passed to functions that expect an
1102 @code{uc_property_t} argument.
1104 This lookup ignores spaces, underscores, or hyphens as word separators, is
1105 case-insignificant, and supports the aliases listed in Unicode's
1106 @file{PropertyAliases.txt} file.
1108 This function references a big table of all predefined properties. Its use
1109 can significantly increase the size of your application.
1112 @deftypefun bool uc_property_is_valid (uc_property_t@tie{}property)
1113 Returns @code{true} when the given property is valid, or @code{false}
1117 The following function views a property as a set of Unicode characters.
1119 @deftypefun bool uc_is_property (ucs4_t@tie{}@var{uc}, uc_property_t@tie{}@var{property})
1120 Tests whether the Unicode character @var{uc} has the given property.
1123 @node Properties as functions
1124 @subsection Properties as functions -- the functional API
1126 The following are general properties.
1128 @deftypefun bool uc_is_property_white_space (ucs4_t@tie{}@var{uc})
1129 @deftypefunx bool uc_is_property_alphabetic (ucs4_t@tie{}@var{uc})
1130 @deftypefunx bool uc_is_property_other_alphabetic (ucs4_t@tie{}@var{uc})
1131 @deftypefunx bool uc_is_property_not_a_character (ucs4_t@tie{}@var{uc})
1132 @deftypefunx bool uc_is_property_default_ignorable_code_point (ucs4_t@tie{}@var{uc})
1133 @deftypefunx bool uc_is_property_other_default_ignorable_code_point (ucs4_t@tie{}@var{uc})
1134 @deftypefunx bool uc_is_property_deprecated (ucs4_t@tie{}@var{uc})
1135 @deftypefunx bool uc_is_property_logical_order_exception (ucs4_t@tie{}@var{uc})
1136 @deftypefunx bool uc_is_property_variation_selector (ucs4_t@tie{}@var{uc})
1137 @deftypefunx bool uc_is_property_private_use (ucs4_t@tie{}@var{uc})
1138 @deftypefunx bool uc_is_property_unassigned_code_value (ucs4_t@tie{}@var{uc})
1141 The following properties are related to case folding.
1143 @deftypefun bool uc_is_property_uppercase (ucs4_t@tie{}@var{uc})
1144 @deftypefunx bool uc_is_property_other_uppercase (ucs4_t@tie{}@var{uc})
1145 @deftypefunx bool uc_is_property_lowercase (ucs4_t@tie{}@var{uc})
1146 @deftypefunx bool uc_is_property_other_lowercase (ucs4_t@tie{}@var{uc})
1147 @deftypefunx bool uc_is_property_titlecase (ucs4_t@tie{}@var{uc})
1148 @deftypefunx bool uc_is_property_cased (ucs4_t@tie{}@var{uc})
1149 @deftypefunx bool uc_is_property_case_ignorable (ucs4_t@tie{}@var{uc})
1150 @deftypefunx bool uc_is_property_changes_when_lowercased (ucs4_t@tie{}@var{uc})
1151 @deftypefunx bool uc_is_property_changes_when_uppercased (ucs4_t@tie{}@var{uc})
1152 @deftypefunx bool uc_is_property_changes_when_titlecased (ucs4_t@tie{}@var{uc})
1153 @deftypefunx bool uc_is_property_changes_when_casefolded (ucs4_t@tie{}@var{uc})
1154 @deftypefunx bool uc_is_property_changes_when_casemapped (ucs4_t@tie{}@var{uc})
1155 @deftypefunx bool uc_is_property_soft_dotted (ucs4_t@tie{}@var{uc})
1158 The following properties are related to identifiers.
1160 @deftypefun bool uc_is_property_id_start (ucs4_t@tie{}@var{uc})
1161 @deftypefunx bool uc_is_property_other_id_start (ucs4_t@tie{}@var{uc})
1162 @deftypefunx bool uc_is_property_id_continue (ucs4_t@tie{}@var{uc})
1163 @deftypefunx bool uc_is_property_other_id_continue (ucs4_t@tie{}@var{uc})
1164 @deftypefunx bool uc_is_property_xid_start (ucs4_t@tie{}@var{uc})
1165 @deftypefunx bool uc_is_property_xid_continue (ucs4_t@tie{}@var{uc})
1166 @deftypefunx bool uc_is_property_pattern_white_space (ucs4_t@tie{}@var{uc})
1167 @deftypefunx bool uc_is_property_pattern_syntax (ucs4_t@tie{}@var{uc})
1170 The following properties have an influence on shaping and rendering.
1172 @deftypefun bool uc_is_property_join_control (ucs4_t@tie{}@var{uc})
1173 @deftypefunx bool uc_is_property_grapheme_base (ucs4_t@tie{}@var{uc})
1174 @deftypefunx bool uc_is_property_grapheme_extend (ucs4_t@tie{}@var{uc})
1175 @deftypefunx bool uc_is_property_other_grapheme_extend (ucs4_t@tie{}@var{uc})
1176 @deftypefunx bool uc_is_property_grapheme_link (ucs4_t@tie{}@var{uc})
1179 The following properties relate to bidirectional reordering.
1181 @deftypefun bool uc_is_property_bidi_control (ucs4_t@tie{}@var{uc})
1182 @deftypefunx bool uc_is_property_bidi_left_to_right (ucs4_t@tie{}@var{uc})
1183 @deftypefunx bool uc_is_property_bidi_hebrew_right_to_left (ucs4_t@tie{}@var{uc})
1184 @deftypefunx bool uc_is_property_bidi_arabic_right_to_left (ucs4_t@tie{}@var{uc})
1185 @deftypefunx bool uc_is_property_bidi_european_digit (ucs4_t@tie{}@var{uc})
1186 @deftypefunx bool uc_is_property_bidi_eur_num_separator (ucs4_t@tie{}@var{uc})
1187 @deftypefunx bool uc_is_property_bidi_eur_num_terminator (ucs4_t@tie{}@var{uc})
1188 @deftypefunx bool uc_is_property_bidi_arabic_digit (ucs4_t@tie{}@var{uc})
1189 @deftypefunx bool uc_is_property_bidi_common_separator (ucs4_t@tie{}@var{uc})
1190 @deftypefunx bool uc_is_property_bidi_block_separator (ucs4_t@tie{}@var{uc})
1191 @deftypefunx bool uc_is_property_bidi_segment_separator (ucs4_t@tie{}@var{uc})
1192 @deftypefunx bool uc_is_property_bidi_whitespace (ucs4_t@tie{}@var{uc})
1193 @deftypefunx bool uc_is_property_bidi_non_spacing_mark (ucs4_t@tie{}@var{uc})
1194 @deftypefunx bool uc_is_property_bidi_boundary_neutral (ucs4_t@tie{}@var{uc})
1195 @deftypefunx bool uc_is_property_bidi_pdf (ucs4_t@tie{}@var{uc})
1196 @deftypefunx bool uc_is_property_bidi_embedding_or_override (ucs4_t@tie{}@var{uc})
1197 @deftypefunx bool uc_is_property_bidi_other_neutral (ucs4_t@tie{}@var{uc})
1200 The following properties deal with number representations.
1202 @deftypefun bool uc_is_property_hex_digit (ucs4_t@tie{}@var{uc})
1203 @deftypefunx bool uc_is_property_ascii_hex_digit (ucs4_t@tie{}@var{uc})
1206 The following properties deal with CJK.
1208 @deftypefun bool uc_is_property_ideographic (ucs4_t@tie{}@var{uc})
1209 @deftypefunx bool uc_is_property_unified_ideograph (ucs4_t@tie{}@var{uc})
1210 @deftypefunx bool uc_is_property_radical (ucs4_t@tie{}@var{uc})
1211 @deftypefunx bool uc_is_property_ids_binary_operator (ucs4_t@tie{}@var{uc})
1212 @deftypefunx bool uc_is_property_ids_trinary_operator (ucs4_t@tie{}@var{uc})
1215 The following properties deal with pictographic symbols.
1217 @deftypefun bool uc_is_property_emoji (ucs4_t@tie{}@var{uc})
1218 @deftypefunx bool uc_is_property_emoji_presentation (ucs4_t@tie{}@var{uc})
1219 @deftypefunx bool uc_is_property_emoji_modifier (ucs4_t@tie{}@var{uc})
1220 @deftypefunx bool uc_is_property_emoji_modifier_base (ucs4_t@tie{}@var{uc})
1221 @deftypefunx bool uc_is_property_emoji_component (ucs4_t@tie{}@var{uc})
1222 @deftypefunx bool uc_is_property_extended_pictographic (ucs4_t@tie{}@var{uc})
1225 Other miscellaneous properties are:
1227 @deftypefun bool uc_is_property_zero_width (ucs4_t@tie{}@var{uc})
1228 @deftypefunx bool uc_is_property_space (ucs4_t@tie{}@var{uc})
1229 @deftypefunx bool uc_is_property_non_break (ucs4_t@tie{}@var{uc})
1230 @deftypefunx bool uc_is_property_iso_control (ucs4_t@tie{}@var{uc})
1231 @deftypefunx bool uc_is_property_format_control (ucs4_t@tie{}@var{uc})
1232 @deftypefunx bool uc_is_property_dash (ucs4_t@tie{}@var{uc})
1233 @deftypefunx bool uc_is_property_hyphen (ucs4_t@tie{}@var{uc})
1234 @deftypefunx bool uc_is_property_punctuation (ucs4_t@tie{}@var{uc})
1235 @deftypefunx bool uc_is_property_line_separator (ucs4_t@tie{}@var{uc})
1236 @deftypefunx bool uc_is_property_paragraph_separator (ucs4_t@tie{}@var{uc})
1237 @deftypefunx bool uc_is_property_quotation_mark (ucs4_t@tie{}@var{uc})
1238 @deftypefunx bool uc_is_property_sentence_terminal (ucs4_t@tie{}@var{uc})
1239 @deftypefunx bool uc_is_property_terminal_punctuation (ucs4_t@tie{}@var{uc})
1240 @deftypefunx bool uc_is_property_currency_symbol (ucs4_t@tie{}@var{uc})
1241 @deftypefunx bool uc_is_property_math (ucs4_t@tie{}@var{uc})
1242 @deftypefunx bool uc_is_property_other_math (ucs4_t@tie{}@var{uc})
1243 @deftypefunx bool uc_is_property_paired_punctuation (ucs4_t@tie{}@var{uc})
1244 @deftypefunx bool uc_is_property_left_of_pair (ucs4_t@tie{}@var{uc})
1245 @deftypefunx bool uc_is_property_combining (ucs4_t@tie{}@var{uc})
1246 @deftypefunx bool uc_is_property_composite (ucs4_t@tie{}@var{uc})
1247 @deftypefunx bool uc_is_property_decimal_digit (ucs4_t@tie{}@var{uc})
1248 @deftypefunx bool uc_is_property_numeric (ucs4_t@tie{}@var{uc})
1249 @deftypefunx bool uc_is_property_diacritic (ucs4_t@tie{}@var{uc})
1250 @deftypefunx bool uc_is_property_extender (ucs4_t@tie{}@var{uc})
1251 @deftypefunx bool uc_is_property_ignorable_control (ucs4_t@tie{}@var{uc})
1252 @deftypefunx bool uc_is_property_regional_indicator (ucs4_t@tie{}@var{uc})
1259 The Unicode characters are subdivided into scripts.
1261 The following type is used to represent a script:
1263 @deftp Type uc_script_t
1264 This data type is a structure type that refers to statically allocated
1265 read-only data. It contains the following fields:
1270 The @code{name} field contains the name of the script.
1273 @cindex Unicode character, script
1274 The following functions look up a script.
1276 @deftypefun {const uc_script_t *} uc_script (ucs4_t@tie{}@var{uc})
1277 Returns the script of a Unicode character. Returns NULL if @var{uc} does not
1278 belong to any script.
1281 @deftypefun {const uc_script_t *} uc_script_byname (const@tie{}char@tie{}*@var{script_name})
1282 Returns the script given by its name, e.g@. @code{"HAN"}. Returns NULL if a
1283 script with the given name does not exist.
1286 The following function views a script as a set of Unicode characters.
1288 @deftypefun bool uc_is_script (ucs4_t@tie{}@var{uc}, const@tie{}uc_script_t@tie{}*@var{script})
1289 Tests whether a Unicode character belongs to a given script.
1292 The following gives a global picture of all scripts.
1294 @deftypefun void uc_all_scripts (const@tie{}uc_script_t@tie{}**@var{scripts}, size_t@tie{}*@var{count})
1295 Get the list of all scripts. Stores a pointer to an array of all scripts in
1296 @code{*@var{scripts}} and the length of this array in @code{*@var{count}}.
1303 The Unicode characters are subdivided into blocks. A block is an interval of
1304 Unicode code points.
1306 The following type is used to represent a block.
1308 @deftp Type uc_block_t
1309 This data type is a structure type that refers to statically allocated data.
1310 It contains the following fields:
1317 The @code{start} field is the first Unicode code point in the block.
1319 The @code{end} field is the last Unicode code point in the block.
1321 The @code{name} field is the name of the block.
1324 @cindex Unicode character, block
1325 The following function looks up a block.
1327 @deftypefun {const uc_block_t *} uc_block (ucs4_t@tie{}@var{uc})
1328 Returns the block a character belongs to.
1331 The following function views a block as a set of Unicode characters.
1333 @deftypefun bool uc_is_block (ucs4_t@tie{}@var{uc}, const@tie{}uc_block_t@tie{}*@var{block})
1334 Tests whether a Unicode character belongs to a given block.
1337 The following gives a global picture of all block.
1339 @deftypefun void uc_all_blocks (const@tie{}uc_block_t@tie{}**@var{blocks}, size_t@tie{}*@var{count})
1340 Get the list of all blocks. Stores a pointer to an array of all blocks in
1341 @code{*@var{blocks}} and the length of this array in @code{*@var{count}}.
1344 @node ISO C and Java syntax
1345 @section ISO C and Java syntax
1347 @cindex C, programming language
1348 @cindex Java, programming language
1350 The following properties are taken from language standards. The supported
1351 language standards are ISO C 99 and Java.
1353 @deftypefun bool uc_is_c_whitespace (ucs4_t@tie{}@var{uc})
1354 Tests whether a Unicode character is considered whitespace in ISO C 99.
1357 @deftypefun bool uc_is_java_whitespace (ucs4_t@tie{}@var{uc})
1358 Tests whether a Unicode character is considered whitespace in Java.
1361 The following enumerated values are the possible return values of the functions
1362 @code{uc_c_ident_category} and @code{uc_java_ident_category}.
1364 @deftypevr Constant int UC_IDENTIFIER_START
1365 This return value means that the given character is valid as first or
1366 subsequent character in an identifier.
1369 @deftypevr Constant int UC_IDENTIFIER_VALID
1370 This return value means that the given character is valid as subsequent
1374 @deftypevr Constant int UC_IDENTIFIER_INVALID
1375 This return value means that the given character is not valid in an identifier.
1378 @deftypevr Constant int UC_IDENTIFIER_IGNORABLE
1379 This return value (only for Java) means that the given character is ignorable.
1382 The following function determine whether a given character can be a constituent
1383 of an identifier in the given programming language.
1385 @cindex Unicode character, validity in C identifiers
1386 @deftypefun int uc_c_ident_category (ucs4_t@tie{}@var{uc})
1387 Returns the categorization of a Unicode character with respect to the ISO C 99
1391 @cindex Unicode character, validity in Java identifiers
1392 @deftypefun int uc_java_ident_category (ucs4_t@tie{}@var{uc})
1393 Returns the categorization of a Unicode character with respect to the Java
1397 @node Classifications like in ISO C
1398 @section Classifications like in ISO C
1401 @cindex Unicode character, classification like in C
1402 The following character classifications mimic those declared in the ISO C
1403 header files @code{<ctype.h>} and @code{<wctype.h>}. These functions are
1404 deprecated, because this set of functions was designed with ASCII in mind and
1405 cannot reflect the more diverse reality of the Unicode character set. But
1406 they can be a quick-and-dirty porting aid when migrating from @code{wchar_t}
1407 APIs to Unicode strings.
1409 @deftypefun bool uc_is_alnum (ucs4_t@tie{}@var{uc})
1410 Tests for any character for which @code{uc_is_alpha} or @code{uc_is_digit} is
1414 @deftypefun bool uc_is_alpha (ucs4_t@tie{}@var{uc})
1415 Tests for any character for which @code{uc_is_upper} or @code{uc_is_lower} is
1416 true, or any character that is one of a locale-specific set of characters for
1417 which none of @code{uc_is_cntrl}, @code{uc_is_digit}, @code{uc_is_punct}, or
1418 @code{uc_is_space} is true.
1421 @deftypefun bool uc_is_cntrl (ucs4_t@tie{}@var{uc})
1422 Tests for any control character.
1425 @deftypefun bool uc_is_digit (ucs4_t@tie{}@var{uc})
1426 Tests for any character that corresponds to a decimal-digit character.
1429 @deftypefun bool uc_is_graph (ucs4_t@tie{}@var{uc})
1430 Tests for any character for which @code{uc_is_print} is true and
1431 @code{uc_is_space} is false.
1434 @deftypefun bool uc_is_lower (ucs4_t@tie{}@var{uc})
1435 Tests for any character that corresponds to a lowercase letter or is one
1436 of a locale-specific set of characters for which none of @code{uc_is_cntrl},
1437 @code{uc_is_digit}, @code{uc_is_punct}, or @code{uc_is_space} is true.
1440 @deftypefun bool uc_is_print (ucs4_t@tie{}@var{uc})
1441 Tests for any printing character.
1444 @deftypefun bool uc_is_punct (ucs4_t@tie{}@var{uc})
1445 Tests for any printing character that is one of a locale-specific set of
1446 characters for which neither @code{uc_is_space} nor @code{uc_is_alnum} is true.
1449 @deftypefun bool uc_is_space (ucs4_t@tie{}@var{uc})
1450 Test for any character that corresponds to a locale-specific set of characters
1451 for which none of @code{uc_is_alnum}, @code{uc_is_graph}, or @code{uc_is_punct}
1455 @deftypefun bool uc_is_upper (ucs4_t@tie{}@var{uc})
1456 Tests for any character that corresponds to an uppercase letter or is one
1457 of a locale-specific set of characters for which none of @code{uc_is_cntrl},
1458 @code{uc_is_digit}, @code{uc_is_punct}, or @code{uc_is_space} is true.
1461 @deftypefun bool uc_is_xdigit (ucs4_t@tie{}@var{uc})
1462 Tests for any character that corresponds to a hexadecimal-digit character.
1465 @deftypefun bool uc_is_blank (ucs4_t@tie{}@var{uc})
1466 Tests for any character that corresponds to a standard blank character or
1467 a locale-specific set of characters for which @code{uc_is_alnum} is false.