2 @chapter Unicode character classification and properties @code{<unictype.h>}
4 This include file declares functions that classify Unicode characters
5 and that test whether Unicode characters have specific properties.
7 The classification assigns a ``general category'' to every Unicode
8 character. This is similar to the classification provided by ISO C in
11 Properties are the data that guides various text processing algorithms
12 in the presence of specific Unicode characters.
16 * Canonical combining class::
17 * Bidirectional category::
18 * Decimal digit value::
21 * Mirrored character::
25 * ISO C and Java syntax::
26 * Classifications like in ISO C::
29 @node General category
30 @section General category
32 @cindex general category
33 @cindex Unicode character, general category
34 @cindex Unicode character, classification
35 Every Unicode character or code point has a @emph{general category} assigned
36 to it. This classification is important for most algorithms that work on
39 The GNU libunistring library provides two kinds of API for working with
40 general categories. The object oriented API uses a variable to denote
41 every predefined general category value or combinations thereof. The
42 low-level API uses a bit mask instead. The advantage of the object oriented
43 API is that if only a few predefined general category values are used,
44 the data tables are relatively small. When you combine general category
45 values (using @code{uc_general_category_or}, @code{uc_general_category_and},
46 or @code{uc_general_category_and_not}), or when you use the low level
47 bit masks, a big table is used thats holds the complete general category
48 information for all Unicode characters.
51 * Object oriented API::
55 @node Object oriented API
56 @subsection The object oriented API for general category
58 @deftp Type uc_general_category_t
59 This data type denotes a general category value. It is an immediate type that
60 can be copied by simple assignment, without involving memory allocation. It is
64 The following are the predefined general category value. Additional general
65 categories may be added in the future.
67 @deftypevr Constant uc_general_category_t UC_CATEGORY_L
68 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Lu
69 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Ll
70 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Lt
71 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Lm
72 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Lo
73 @deftypevrx Constant uc_general_category_t UC_CATEGORY_M
74 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Mn
75 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Mc
76 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Me
77 @deftypevrx Constant uc_general_category_t UC_CATEGORY_N
78 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Nd
79 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Nl
80 @deftypevrx Constant uc_general_category_t UC_CATEGORY_No
81 @deftypevrx Constant uc_general_category_t UC_CATEGORY_P
82 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Pc
83 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Pd
84 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Ps
85 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Pe
86 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Pi
87 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Pf
88 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Po
89 @deftypevrx Constant uc_general_category_t UC_CATEGORY_S
90 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Sm
91 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Sc
92 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Sk
93 @deftypevrx Constant uc_general_category_t UC_CATEGORY_So
94 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Z
95 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Zs
96 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Zl
97 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Zp
98 @deftypevrx Constant uc_general_category_t UC_CATEGORY_C
99 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Cc
100 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Cf
101 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Cs
102 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Co
103 @deftypevrx Constant uc_general_category_t UC_CATEGORY_Cn
106 The following are alias names for predefined General category values.
108 @deftypevr Macro uc_general_category_t UC_LETTER
109 This is another name for @code{UC_CATEGORY_L}.
112 @deftypevr Macro uc_general_category_t UC_UPPERCASE_LETTER
113 This is another name for @code{UC_CATEGORY_Lu}.
116 @deftypevr Macro uc_general_category_t UC_LOWERCASE_LETTER
117 This is another name for @code{UC_CATEGORY_Ll}.
120 @deftypevr Macro uc_general_category_t UC_TITLECASE_LETTER
121 This is another name for @code{UC_CATEGORY_Lt}.
124 @deftypevr Macro uc_general_category_t UC_MODIFIER_LETTER
125 This is another name for @code{UC_CATEGORY_Lm}.
128 @deftypevr Macro uc_general_category_t UC_OTHER_LETTER
129 This is another name for @code{UC_CATEGORY_Lo}.
132 @deftypevr Macro uc_general_category_t UC_MARK
133 This is another name for @code{UC_CATEGORY_M}.
136 @deftypevr Macro uc_general_category_t UC_NON_SPACING_MARK
137 This is another name for @code{UC_CATEGORY_Mn}.
140 @deftypevr Macro uc_general_category_t UC_COMBINING_SPACING_MARK
141 This is another name for @code{UC_CATEGORY_Mc}.
144 @deftypevr Macro uc_general_category_t UC_ENCLOSING_MARK
145 This is another name for @code{UC_CATEGORY_Me}.
148 @deftypevr Macro uc_general_category_t UC_NUMBER
149 This is another name for @code{UC_CATEGORY_N}.
152 @deftypevr Macro uc_general_category_t UC_DECIMAL_DIGIT_NUMBER
153 This is another name for @code{UC_CATEGORY_Nd}.
156 @deftypevr Macro uc_general_category_t UC_LETTER_NUMBER
157 This is another name for @code{UC_CATEGORY_Nl}.
160 @deftypevr Macro uc_general_category_t UC_OTHER_NUMBER
161 This is another name for @code{UC_CATEGORY_No}.
164 @deftypevr Macro uc_general_category_t UC_PUNCTUATION
165 This is another name for @code{UC_CATEGORY_P}.
168 @deftypevr Macro uc_general_category_t UC_CONNECTOR_PUNCTUATION
169 This is another name for @code{UC_CATEGORY_Pc}.
172 @deftypevr Macro uc_general_category_t UC_DASH_PUNCTUATION
173 This is another name for @code{UC_CATEGORY_Pd}.
176 @deftypevr Macro uc_general_category_t UC_OPEN_PUNCTUATION
177 This is another name for @code{UC_CATEGORY_Ps} (``start punctuation'').
180 @deftypevr Macro uc_general_category_t UC_CLOSE_PUNCTUATION
181 This is another name for @code{UC_CATEGORY_Pe} (``end punctuation'').
184 @deftypevr Macro uc_general_category_t UC_INITIAL_QUOTE_PUNCTUATION
185 This is another name for @code{UC_CATEGORY_Pi}.
188 @deftypevr Macro uc_general_category_t UC_FINAL_QUOTE_PUNCTUATION
189 This is another name for @code{UC_CATEGORY_Pf}.
192 @deftypevr Macro uc_general_category_t UC_OTHER_PUNCTUATION
193 This is another name for @code{UC_CATEGORY_Po}.
196 @deftypevr Macro uc_general_category_t UC_SYMBOL
197 This is another name for @code{UC_CATEGORY_S}.
200 @deftypevr Macro uc_general_category_t UC_MATH_SYMBOL
201 This is another name for @code{UC_CATEGORY_Sm}.
204 @deftypevr Macro uc_general_category_t UC_CURRENCY_SYMBOL
205 This is another name for @code{UC_CATEGORY_Sc}.
208 @deftypevr Macro uc_general_category_t UC_MODIFIER_SYMBOL
209 This is another name for @code{UC_CATEGORY_Sk}.
212 @deftypevr Macro uc_general_category_t UC_OTHER_SYMBOL
213 This is another name for @code{UC_CATEGORY_So}.
216 @deftypevr Macro uc_general_category_t UC_SEPARATOR
217 This is another name for @code{UC_CATEGORY_Z}.
220 @deftypevr Macro uc_general_category_t UC_SPACE_SEPARATOR
221 This is another name for @code{UC_CATEGORY_Zs}.
224 @deftypevr Macro uc_general_category_t UC_LINE_SEPARATOR
225 This is another name for @code{UC_CATEGORY_Zl}.
228 @deftypevr Macro uc_general_category_t UC_PARAGRAPH_SEPARATOR
229 This is another name for @code{UC_CATEGORY_Zp}.
232 @deftypevr Macro uc_general_category_t UC_OTHER
233 This is another name for @code{UC_CATEGORY_C}.
236 @deftypevr Macro uc_general_category_t UC_CONTROL
237 This is another name for @code{UC_CATEGORY_Cc}.
240 @deftypevr Macro uc_general_category_t UC_FORMAT
241 This is another name for @code{UC_CATEGORY_Cf}.
244 @deftypevr Macro uc_general_category_t UC_SURROGATE
245 This is another name for @code{UC_CATEGORY_Cs}. All code points in this
246 category are invalid characters.
249 @deftypevr Macro uc_general_category_t UC_PRIVATE_USE
250 This is another name for @code{UC_CATEGORY_Co}.
253 @deftypevr Macro uc_general_category_t UC_UNASSIGNED
254 This is another name for @code{UC_CATEGORY_Cn}. Some code points in this
255 category are invalid characters.
258 The following functions combine general categories, like in a boolean algebra,
259 except that there is no @samp{not} operation.
261 @deftypefun uc_general_category_t uc_general_category_or (uc_general_category_t @var{category1}, uc_general_category_t @var{category2})
262 Returns the union of two general categories.
263 This corresponds to the unions of the two sets of characters.
266 @deftypefun uc_general_category_t uc_general_category_and (uc_general_category_t @var{category1}, uc_general_category_t @var{category2})
267 Returns the intersection of two general categories as bit masks.
268 This @emph{does not} correspond to the intersection of the two sets of
273 @deftypefun uc_general_category_t uc_general_category_and_not (uc_general_category_t @var{category1}, uc_general_category_t @var{category2})
274 Returns the intersection of a general category with the complement of a
275 second general category, as bit masks.
276 This @emph{does not} correspond to the intersection with complement, when
277 viewing the categories as sets of characters.
281 The following functions associate general categories with their name.
283 @deftypefun {const char *} uc_general_category_name (uc_general_category_t @var{category})
284 Returns the name of a general category.
285 Returns NULL if the general category corresponds to a bit mask that does not
289 @deftypefun uc_general_category_t uc_general_category_byname (const char *@var{category_name})
290 Returns the general category given by name, e.g@. @code{"Lu"}.
293 The following functions view general categories as sets of Unicode characters.
295 @deftypefun uc_general_category_t uc_general_category (ucs4_t @var{uc})
296 Returns the general category of a Unicode character.
298 This function uses a big table.
301 @deftypefun bool uc_is_general_category (ucs4_t @var{uc}, uc_general_category_t @var{category})
302 Tests whether a Unicode character belongs to a given category.
303 The @var{category} argument can be a predefined general category or the
304 combination of several predefined general categories.
308 @subsection The bit mask API for general category
310 The following are the predefined general category value as bit masks.
311 Additional general categories may be added in the future.
313 @deftypevr Macro uint32_t UC_CATEGORY_MASK_L
314 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Lu
315 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Ll
316 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Lt
317 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Lm
318 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Lo
319 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_M
320 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Mn
321 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Mc
322 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Me
323 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_N
324 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Nd
325 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Nl
326 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_No
327 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_P
328 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Pc
329 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Pd
330 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Ps
331 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Pe
332 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Pi
333 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Pf
334 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Po
335 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_S
336 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Sm
337 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Sc
338 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Sk
339 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_So
340 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Z
341 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Zs
342 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Zl
343 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Zp
344 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_C
345 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Cc
346 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Cf
347 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Cs
348 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Co
349 @deftypevrx Macro uint32_t UC_CATEGORY_MASK_Cn
352 The following function views general categories as sets of Unicode characters.
354 @deftypefun bool uc_is_general_category_withtable (ucs4_t @var{uc}, uint32_t @var{bitmask})
355 Tests whether a Unicode character belongs to a given category.
356 The @var{bitmask} argument can be a predefined general category bitmask or the
357 combination of several predefined general category bitmasks.
359 This function uses a big table comprising all general categories.
362 @node Canonical combining class
363 @section Canonical combining class
365 @cindex canonical combining class
366 @cindex Unicode character, canonical combining class
367 Every Unicode character or code point has a @emph{canonical combining class}
370 What is the meaning of the canonical combining class? Essentially, it
371 indicates the priority with which a combining character is attached to its
372 base character. The characters for which the canonical combining class is 0
373 are the base characters, and the characters for which it is greater than 0 are
374 the combining characters. Combining characters are rendered
375 near/attached/around their base character, and combining characters with small
376 combining classes are attached "first" or "closer" to the base character.
378 The canonical combining class of a character is a number in the range
379 0..255. The possible values are described in the Unicode Character Database
380 @texnl{}@url{http://www.unicode.org/Public/UNIDATA/UCD.html}. The list here is
381 not definitive; more values can be added in future versions.
383 @deftypevr Constant int UC_CCC_NR
384 The canonical combining class value for ``Not Reordered'' characters.
388 @deftypevr Constant int UC_CCC_OV
389 The canonical combining class value for ``Overlay'' characters.
392 @deftypevr Constant int UC_CCC_NK
393 The canonical combining class value for ``Nukta'' characters.
396 @deftypevr Constant int UC_CCC_KV
397 The canonical combining class value for ``Kana Voicing'' characters.
400 @deftypevr Constant int UC_CCC_VR
401 The canonical combining class value for ``Virama'' characters.
404 @deftypevr Constant int UC_CCC_ATBL
405 The canonical combining class value for ``Attached Below Left'' characters.
408 @deftypevr Constant int UC_CCC_ATB
409 The canonical combining class value for ``Attached Below'' characters.
412 @deftypevr Constant int UC_CCC_ATAR
413 The canonical combining class value for ``Attached Above Right'' characters.
416 @deftypevr Constant int UC_CCC_BL
417 The canonical combining class value for ``Below Left'' characters.
420 @deftypevr Constant int UC_CCC_B
421 The canonical combining class value for ``Below'' characters.
424 @deftypevr Constant int UC_CCC_BR
425 The canonical combining class value for ``Below Right'' characters.
428 @deftypevr Constant int UC_CCC_L
429 The canonical combining class value for ``Left'' characters.
432 @deftypevr Constant int UC_CCC_R
433 The canonical combining class value for ``Right'' characters.
436 @deftypevr Constant int UC_CCC_AL
437 The canonical combining class value for ``Above Left'' characters.
440 @deftypevr Constant int UC_CCC_A
441 The canonical combining class value for ``Above'' characters.
444 @deftypevr Constant int UC_CCC_AR
445 The canonical combining class value for ``Above Right'' characters.
448 @deftypevr Constant int UC_CCC_DB
449 The canonical combining class value for ``Double Below'' characters.
452 @deftypevr Constant int UC_CCC_DA
453 The canonical combining class value for ``Double Above'' characters.
456 @deftypevr Constant int UC_CCC_IS
457 The canonical combining class value for ``Iota Subscript'' characters.
460 The following function looks up the canonical combining class of a character.
462 @deftypefun int uc_combining_class (ucs4_t @var{uc})
463 Returns the canonical combining class of a Unicode character.
466 @node Bidirectional category
467 @section Bidirectional category
469 @cindex bidirectional category
470 @cindex Unicode character, bidirectional category
471 Every Unicode character or code point has a @emph{bidirectional category}
474 The bidirectional category guides the bidirectional algorithm@texnl{}
475 (@url{http://www.unicode.org/reports/tr9/}). The possible values are
478 @deftypevr Constant int UC_BIDI_L
479 The bidirectional category for `Left-to-Right`'' characters.
482 @deftypevr Constant int UC_BIDI_LRE
483 The bidirectional category for ``Left-to-Right Embedding'' characters.
486 @deftypevr Constant int UC_BIDI_LRO
487 The bidirectional category for ``Left-to-Right Override'' characters.
490 @deftypevr Constant int UC_BIDI_R
491 The bidirectional category for ``Right-to-Left'' characters.
494 @deftypevr Constant int UC_BIDI_AL
495 The bidirectional category for ``Right-to-Left Arabic'' characters.
498 @deftypevr Constant int UC_BIDI_RLE
499 The bidirectional category for ``Right-to-Left Embedding'' characters.
502 @deftypevr Constant int UC_BIDI_RLO
503 The bidirectional category for ``Right-to-Left Override'' characters.
506 @deftypevr Constant int UC_BIDI_PDF
507 The bidirectional category for ``Pop Directional Format'' characters.
510 @deftypevr Constant int UC_BIDI_EN
511 The bidirectional category for ``European Number'' characters.
514 @deftypevr Constant int UC_BIDI_ES
515 The bidirectional category for ``European Number Separator'' characters.
518 @deftypevr Constant int UC_BIDI_ET
519 The bidirectional category for ``European Number Terminator'' characters.
522 @deftypevr Constant int UC_BIDI_AN
523 The bidirectional category for ``Arabic Number'' characters.
526 @deftypevr Constant int UC_BIDI_CS
527 The bidirectional category for ``Common Number Separator'' characters.
530 @deftypevr Constant int UC_BIDI_NSM
531 The bidirectional category for ``Non-Spacing Mark'' characters.
534 @deftypevr Constant int UC_BIDI_BN
535 The bidirectional category for ``Boundary Neutral'' characters.
538 @deftypevr Constant int UC_BIDI_B
539 The bidirectional category for ``Paragraph Separator'' characters.
542 @deftypevr Constant int UC_BIDI_S
543 The bidirectional category for ``Segment Separator'' characters.
546 @deftypevr Constant int UC_BIDI_WS
547 The bidirectional category for ``Whitespace'' characters.
550 @deftypevr Constant int UC_BIDI_ON
551 The bidirectional category for ``Other Neutral'' characters.
554 The following functions implement the association between a bidirectional
555 category and its name.
557 @deftypefun {const char *} uc_bidi_category_name (int @var{category})
558 Returns the name of a bidirectional category.
561 @deftypefun int uc_bidi_category_byname (const char *@var{category_name})
562 Returns the bidirectional category given by name, e.g@. @code{"LRE"}.
565 The following functions view bidirectional categories as sets of Unicode
568 @deftypefun int uc_bidi_category (ucs4_t @var{uc})
569 Returns the bidirectional category of a Unicode character.
572 @deftypefun bool uc_is_bidi_category (ucs4_t @var{uc}, int @var{category})
573 Tests whether a Unicode character belongs to a given bidirectional category.
576 @node Decimal digit value
577 @section Decimal digit value
579 @cindex value, of Unicode character
580 @cindex Unicode character, value
581 Decimal digits (like the digits from @samp{0} to @samp{9}) exist in many
582 scripts. The following function converts a decimal digit character to its
585 @deftypefun int uc_decimal_value (ucs4_t @var{uc})
586 Returns the decimal digit value of a Unicode character.
587 The return value is an integer in the range 0..9, or -1 for characters that
588 do not represent a decimal digit.
594 @cindex value, of Unicode character
595 @cindex Unicode character, value
596 Digit characters are like decimal digit characters, possibly in special forms,
597 like as superscript, subscript, or circled. The following function converts a
598 digit character to its numerical value.
600 @deftypefun int uc_digit_value (ucs4_t @var{uc})
601 Returns the digit value of a Unicode character.
602 The return value is an integer in the range 0..9, or -1 for characters that
603 do not represent a digit.
607 @section Numeric value
609 @cindex value, of Unicode character
610 @cindex Unicode character, value
611 There are also characters that represent numbers without a digit system, like
612 the Roman numerals, and fractional numbers, like 1/4 or 3/4.
614 The following type represents the numeric value of a Unicode character.
615 @deftp Type uc_fraction_t
616 This is a structure type with the following fields:
621 An integer @var{n} is represented by @code{numerator = @var{n}},
622 @code{denominator = 1}.
625 The following function converts a number character to its numerical value.
627 @deftypefun uc_fraction_t uc_numeric_value (ucs4_t @var{uc})
628 Returns the numeric value of a Unicode character.
629 The return value is a fraction, or the pseudo-fraction @code{@{ 0, 0 @}} for
630 characters that do not represent a number.
633 @node Mirrored character
634 @section Mirrored character
636 @cindex mirroring, of Unicode character
637 @cindex Unicode character, mirroring
638 Character mirroring is used to associate the closing parenthesis character
639 to the opening parenthesis character, the closing brace character with the
640 opening brace character, and so on.
642 The following function looks up the mirrored character of a Unicode character.
644 @deftypefun bool uc_mirror_char (ucs4_t @var{uc}, ucs4_t *@var{puc})
645 Stores the mirrored character of a Unicode character @var{uc} in
646 @code{*@var{puc}} and returns @code{true}, if it exists. Otherwise it
647 stores @var{uc} unmodified in @code{*@var{puc}} and returns @code{false}.
653 @cindex properties, of Unicode character
654 @cindex Unicode character, properties
655 This section defines boolean properties of Unicode characters. This
656 means, a character either has the given property or does not have it.
657 In other words, the property can be viewed as a subset of the set of
660 The GNU libunistring library provides two kinds of API for working with
661 properties. The object oriented API uses a type @code{uc_property_t}
662 to designate a property. In the function-based API, which is a bit more
663 low level, a property is merely a function.
666 * Properties as objects::
667 * Properties as functions::
670 @node Properties as objects
671 @subsection Properties as objects -- the object oriented API
673 The following type designates a property on Unicode characters.
675 @deftp Type uc_property_t
676 This data type denotes a boolean property on Unicode characters. It is an
677 immediate type that can be copied by simple assignment, without involving
678 memory allocation. It is not an array type.
681 Many Unicode properties are predefined.
683 The following are general properties.
685 @deftypevr Constant uc_property_t UC_PROPERTY_WHITE_SPACE
686 @deftypevrx Constant uc_property_t UC_PROPERTY_ALPHABETIC
687 @deftypevrx Constant uc_property_t UC_PROPERTY_OTHER_ALPHABETIC
688 @deftypevrx Constant uc_property_t UC_PROPERTY_NOT_A_CHARACTER
689 @deftypevrx Constant uc_property_t UC_PROPERTY_DEFAULT_IGNORABLE_CODE_POINT
690 @deftypevrx Constant uc_property_t UC_PROPERTY_OTHER_DEFAULT_IGNORABLE_CODE_POINT
691 @deftypevrx Constant uc_property_t UC_PROPERTY_DEPRECATED
692 @deftypevrx Constant uc_property_t UC_PROPERTY_LOGICAL_ORDER_EXCEPTION
693 @deftypevrx Constant uc_property_t UC_PROPERTY_VARIATION_SELECTOR
694 @deftypevrx Constant uc_property_t UC_PROPERTY_PRIVATE_USE
695 @deftypevrx Constant uc_property_t UC_PROPERTY_UNASSIGNED_CODE_VALUE
698 The following properties are related to case folding.
700 @deftypevr Constant uc_property_t UC_PROPERTY_UPPERCASE
701 @deftypevrx Constant uc_property_t UC_PROPERTY_OTHER_UPPERCASE
702 @deftypevrx Constant uc_property_t UC_PROPERTY_LOWERCASE
703 @deftypevrx Constant uc_property_t UC_PROPERTY_OTHER_LOWERCASE
704 @deftypevrx Constant uc_property_t UC_PROPERTY_TITLECASE
705 @deftypevrx Constant uc_property_t UC_PROPERTY_SOFT_DOTTED
708 The following properties are related to identifiers.
710 @deftypevr Constant uc_property_t UC_PROPERTY_ID_START
711 @deftypevrx Constant uc_property_t UC_PROPERTY_OTHER_ID_START
712 @deftypevrx Constant uc_property_t UC_PROPERTY_ID_CONTINUE
713 @deftypevrx Constant uc_property_t UC_PROPERTY_OTHER_ID_CONTINUE
714 @deftypevrx Constant uc_property_t UC_PROPERTY_XID_START
715 @deftypevrx Constant uc_property_t UC_PROPERTY_XID_CONTINUE
716 @deftypevrx Constant uc_property_t UC_PROPERTY_PATTERN_WHITE_SPACE
717 @deftypevrx Constant uc_property_t UC_PROPERTY_PATTERN_SYNTAX
720 The following properties have an influence on shaping and rendering.
722 @deftypevr Constant uc_property_t UC_PROPERTY_JOIN_CONTROL
723 @deftypevrx Constant uc_property_t UC_PROPERTY_GRAPHEME_BASE
724 @deftypevrx Constant uc_property_t UC_PROPERTY_GRAPHEME_EXTEND
725 @deftypevrx Constant uc_property_t UC_PROPERTY_OTHER_GRAPHEME_EXTEND
726 @deftypevrx Constant uc_property_t UC_PROPERTY_GRAPHEME_LINK
729 The following properties relate to bidirectional reordering.
731 @deftypevr Constant uc_property_t UC_PROPERTY_BIDI_CONTROL
732 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_LEFT_TO_RIGHT
733 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_HEBREW_RIGHT_TO_LEFT
734 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_ARABIC_RIGHT_TO_LEFT
735 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_EUROPEAN_DIGIT
736 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_EUR_NUM_SEPARATOR
737 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_EUR_NUM_TERMINATOR
738 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_ARABIC_DIGIT
739 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_COMMON_SEPARATOR
740 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_BLOCK_SEPARATOR
741 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_SEGMENT_SEPARATOR
742 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_WHITESPACE
743 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_NON_SPACING_MARK
744 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_BOUNDARY_NEUTRAL
745 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_PDF
746 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_EMBEDDING_OR_OVERRIDE
747 @deftypevrx Constant uc_property_t UC_PROPERTY_BIDI_OTHER_NEUTRAL
750 The following properties deal with number representations.
752 @deftypevr Constant uc_property_t UC_PROPERTY_HEX_DIGIT
753 @deftypevrx Constant uc_property_t UC_PROPERTY_ASCII_HEX_DIGIT
756 The following properties deal with CJK.
758 @deftypevr Constant uc_property_t UC_PROPERTY_IDEOGRAPHIC
759 @deftypevrx Constant uc_property_t UC_PROPERTY_UNIFIED_IDEOGRAPH
760 @deftypevrx Constant uc_property_t UC_PROPERTY_RADICAL
761 @deftypevrx Constant uc_property_t UC_PROPERTY_IDS_BINARY_OPERATOR
762 @deftypevrx Constant uc_property_t UC_PROPERTY_IDS_TRINARY_OPERATOR
765 Other miscellaneous properties are:
767 @deftypevr Constant uc_property_t UC_PROPERTY_ZERO_WIDTH
768 @deftypevrx Constant uc_property_t UC_PROPERTY_SPACE
769 @deftypevrx Constant uc_property_t UC_PROPERTY_NON_BREAK
770 @deftypevrx Constant uc_property_t UC_PROPERTY_ISO_CONTROL
771 @deftypevrx Constant uc_property_t UC_PROPERTY_FORMAT_CONTROL
772 @deftypevrx Constant uc_property_t UC_PROPERTY_DASH
773 @deftypevrx Constant uc_property_t UC_PROPERTY_HYPHEN
774 @deftypevrx Constant uc_property_t UC_PROPERTY_PUNCTUATION
775 @deftypevrx Constant uc_property_t UC_PROPERTY_LINE_SEPARATOR
776 @deftypevrx Constant uc_property_t UC_PROPERTY_PARAGRAPH_SEPARATOR
777 @deftypevrx Constant uc_property_t UC_PROPERTY_QUOTATION_MARK
778 @deftypevrx Constant uc_property_t UC_PROPERTY_SENTENCE_TERMINAL
779 @deftypevrx Constant uc_property_t UC_PROPERTY_TERMINAL_PUNCTUATION
780 @deftypevrx Constant uc_property_t UC_PROPERTY_CURRENCY_SYMBOL
781 @deftypevrx Constant uc_property_t UC_PROPERTY_MATH
782 @deftypevrx Constant uc_property_t UC_PROPERTY_OTHER_MATH
783 @deftypevrx Constant uc_property_t UC_PROPERTY_PAIRED_PUNCTUATION
784 @deftypevrx Constant uc_property_t UC_PROPERTY_LEFT_OF_PAIR
785 @deftypevrx Constant uc_property_t UC_PROPERTY_COMBINING
786 @deftypevrx Constant uc_property_t UC_PROPERTY_COMPOSITE
787 @deftypevrx Constant uc_property_t UC_PROPERTY_DECIMAL_DIGIT
788 @deftypevrx Constant uc_property_t UC_PROPERTY_NUMERIC
789 @deftypevrx Constant uc_property_t UC_PROPERTY_DIACRITIC
790 @deftypevrx Constant uc_property_t UC_PROPERTY_EXTENDER
791 @deftypevrx Constant uc_property_t UC_PROPERTY_IGNORABLE_CONTROL
794 The following function looks up a property by its name.
796 @deftypefun uc_property_t uc_property_byname (const char *@var{property_name})
797 Returns the property given by name, e.g. @code{"White space"}. If a property
798 with the given name exists, the result will satisfy the
799 @code{uc_property_is_valid} predicate. Otherwise the result will not satisfy
800 this predicate and must not be passed to functions that expect an
801 @code{uc_property_t} argument.
803 This function references a big table of all predefined properties. Its use
804 can significantly increase the size of your application.
807 @deftypefun bool uc_property_is_valid (uc_property_t property)
808 Returns @code{true} when the given property is valid, or @code{false}
812 The following function views a property as a set of Unicode characters.
814 @deftypefun bool uc_is_property (ucs4_t @var{uc}, uc_property_t @var{property})
815 Tests whether the Unicode character @var{uc} has the given property.
818 @node Properties as functions
819 @subsection Properties as functions -- the functional API
821 The following are general properties.
823 @deftypefun bool uc_is_property_white_space (ucs4_t @var{uc})
824 @deftypefunx bool uc_is_property_alphabetic (ucs4_t @var{uc})
825 @deftypefunx bool uc_is_property_other_alphabetic (ucs4_t @var{uc})
826 @deftypefunx bool uc_is_property_not_a_character (ucs4_t @var{uc})
827 @deftypefunx bool uc_is_property_default_ignorable_code_point (ucs4_t @var{uc})
828 @deftypefunx bool uc_is_property_other_default_ignorable_code_point (ucs4_t @var{uc})
829 @deftypefunx bool uc_is_property_deprecated (ucs4_t @var{uc})
830 @deftypefunx bool uc_is_property_logical_order_exception (ucs4_t @var{uc})
831 @deftypefunx bool uc_is_property_variation_selector (ucs4_t @var{uc})
832 @deftypefunx bool uc_is_property_private_use (ucs4_t @var{uc})
833 @deftypefunx bool uc_is_property_unassigned_code_value (ucs4_t @var{uc})
836 The following properties are related to case folding.
838 @deftypefun bool uc_is_property_uppercase (ucs4_t @var{uc})
839 @deftypefunx bool uc_is_property_other_uppercase (ucs4_t @var{uc})
840 @deftypefunx bool uc_is_property_lowercase (ucs4_t @var{uc})
841 @deftypefunx bool uc_is_property_other_lowercase (ucs4_t @var{uc})
842 @deftypefunx bool uc_is_property_titlecase (ucs4_t @var{uc})
843 @deftypefunx bool uc_is_property_soft_dotted (ucs4_t @var{uc})
846 The following properties are related to identifiers.
848 @deftypefun bool uc_is_property_id_start (ucs4_t @var{uc})
849 @deftypefunx bool uc_is_property_other_id_start (ucs4_t @var{uc})
850 @deftypefunx bool uc_is_property_id_continue (ucs4_t @var{uc})
851 @deftypefunx bool uc_is_property_other_id_continue (ucs4_t @var{uc})
852 @deftypefunx bool uc_is_property_xid_start (ucs4_t @var{uc})
853 @deftypefunx bool uc_is_property_xid_continue (ucs4_t @var{uc})
854 @deftypefunx bool uc_is_property_pattern_white_space (ucs4_t @var{uc})
855 @deftypefunx bool uc_is_property_pattern_syntax (ucs4_t @var{uc})
858 The following properties have an influence on shaping and rendering.
860 @deftypefun bool uc_is_property_join_control (ucs4_t @var{uc})
861 @deftypefunx bool uc_is_property_grapheme_base (ucs4_t @var{uc})
862 @deftypefunx bool uc_is_property_grapheme_extend (ucs4_t @var{uc})
863 @deftypefunx bool uc_is_property_other_grapheme_extend (ucs4_t @var{uc})
864 @deftypefunx bool uc_is_property_grapheme_link (ucs4_t @var{uc})
867 The following properties relate to bidirectional reordering.
869 @deftypefun bool uc_is_property_bidi_control (ucs4_t @var{uc})
870 @deftypefunx bool uc_is_property_bidi_left_to_right (ucs4_t @var{uc})
871 @deftypefunx bool uc_is_property_bidi_hebrew_right_to_left (ucs4_t @var{uc})
872 @deftypefunx bool uc_is_property_bidi_arabic_right_to_left (ucs4_t @var{uc})
873 @deftypefunx bool uc_is_property_bidi_european_digit (ucs4_t @var{uc})
874 @deftypefunx bool uc_is_property_bidi_eur_num_separator (ucs4_t @var{uc})
875 @deftypefunx bool uc_is_property_bidi_eur_num_terminator (ucs4_t @var{uc})
876 @deftypefunx bool uc_is_property_bidi_arabic_digit (ucs4_t @var{uc})
877 @deftypefunx bool uc_is_property_bidi_common_separator (ucs4_t @var{uc})
878 @deftypefunx bool uc_is_property_bidi_block_separator (ucs4_t @var{uc})
879 @deftypefunx bool uc_is_property_bidi_segment_separator (ucs4_t @var{uc})
880 @deftypefunx bool uc_is_property_bidi_whitespace (ucs4_t @var{uc})
881 @deftypefunx bool uc_is_property_bidi_non_spacing_mark (ucs4_t @var{uc})
882 @deftypefunx bool uc_is_property_bidi_boundary_neutral (ucs4_t @var{uc})
883 @deftypefunx bool uc_is_property_bidi_pdf (ucs4_t @var{uc})
884 @deftypefunx bool uc_is_property_bidi_embedding_or_override (ucs4_t @var{uc})
885 @deftypefunx bool uc_is_property_bidi_other_neutral (ucs4_t @var{uc})
888 The following properties deal with number representations.
890 @deftypefun bool uc_is_property_hex_digit (ucs4_t @var{uc})
891 @deftypefunx bool uc_is_property_ascii_hex_digit (ucs4_t @var{uc})
894 The following properties deal with CJK.
896 @deftypefun bool uc_is_property_ideographic (ucs4_t @var{uc})
897 @deftypefunx bool uc_is_property_unified_ideograph (ucs4_t @var{uc})
898 @deftypefunx bool uc_is_property_radical (ucs4_t @var{uc})
899 @deftypefunx bool uc_is_property_ids_binary_operator (ucs4_t @var{uc})
900 @deftypefunx bool uc_is_property_ids_trinary_operator (ucs4_t @var{uc})
903 Other miscellaneous properties are:
905 @deftypefun bool uc_is_property_zero_width (ucs4_t @var{uc})
906 @deftypefunx bool uc_is_property_space (ucs4_t @var{uc})
907 @deftypefunx bool uc_is_property_non_break (ucs4_t @var{uc})
908 @deftypefunx bool uc_is_property_iso_control (ucs4_t @var{uc})
909 @deftypefunx bool uc_is_property_format_control (ucs4_t @var{uc})
910 @deftypefunx bool uc_is_property_dash (ucs4_t @var{uc})
911 @deftypefunx bool uc_is_property_hyphen (ucs4_t @var{uc})
912 @deftypefunx bool uc_is_property_punctuation (ucs4_t @var{uc})
913 @deftypefunx bool uc_is_property_line_separator (ucs4_t @var{uc})
914 @deftypefunx bool uc_is_property_paragraph_separator (ucs4_t @var{uc})
915 @deftypefunx bool uc_is_property_quotation_mark (ucs4_t @var{uc})
916 @deftypefunx bool uc_is_property_sentence_terminal (ucs4_t @var{uc})
917 @deftypefunx bool uc_is_property_terminal_punctuation (ucs4_t @var{uc})
918 @deftypefunx bool uc_is_property_currency_symbol (ucs4_t @var{uc})
919 @deftypefunx bool uc_is_property_math (ucs4_t @var{uc})
920 @deftypefunx bool uc_is_property_other_math (ucs4_t @var{uc})
921 @deftypefunx bool uc_is_property_paired_punctuation (ucs4_t @var{uc})
922 @deftypefunx bool uc_is_property_left_of_pair (ucs4_t @var{uc})
923 @deftypefunx bool uc_is_property_combining (ucs4_t @var{uc})
924 @deftypefunx bool uc_is_property_composite (ucs4_t @var{uc})
925 @deftypefunx bool uc_is_property_decimal_digit (ucs4_t @var{uc})
926 @deftypefunx bool uc_is_property_numeric (ucs4_t @var{uc})
927 @deftypefunx bool uc_is_property_diacritic (ucs4_t @var{uc})
928 @deftypefunx bool uc_is_property_extender (ucs4_t @var{uc})
929 @deftypefunx bool uc_is_property_ignorable_control (ucs4_t @var{uc})
936 The Unicode characters are subdivided into scripts.
938 The following type is used to represent a script:
940 @deftp Type uc_script_t
941 This data type is a structure type that refers to statically allocated
942 read-only data. It contains the following fields:
947 The @code{name} field contains the name of the script.
950 @cindex Unicode character, script
951 The following functions look up a script.
953 @deftypefun {const uc_script_t *} uc_script (ucs4_t @var{uc})
954 Returns the script of a Unicode character. Returns NULL if @var{uc} does not
955 belong to any script.
958 @deftypefun {const uc_script_t *} uc_script_byname (const char *@var{script_name})
959 Returns the script given by its name, e.g@. @code{"HAN"}. Returns NULL if a
960 script with the given name does not exist.
963 The following function views a script as a set of Unicode characters.
965 @deftypefun bool uc_is_script (ucs4_t @var{uc}, const uc_script_t *@var{script})
966 Tests whether a Unicode character belongs to a given script.
969 The following gives a global picture of all scripts.
971 @deftypefun void uc_all_scripts (const uc_script_t **@var{scripts}, size_t *@var{count})
972 Get the list of all scripts. Stores a pointer to an array of all scripts in
973 @code{*@var{scripts}} and the length of this array in @code{*@var{count}}.
980 The Unicode characters are subdivided into blocks. A block is an interval of
983 The following type is used to represent a block.
985 @deftp Type uc_block_t
986 This data type is a structure type that refers to statically allocated data.
987 It contains the following fields:
994 The @code{start} field is the first Unicode code point in the block.
996 The @code{end} field is the last Unicode code point in the block.
998 The @code{name} field is the name of the block.
1001 @cindex Unicode character, block
1002 The following function looks up a block.
1004 @deftypefun {const uc_block_t *} uc_block (ucs4_t @var{uc})
1005 Returns the block a character belongs to.
1008 The following function views a block as a set of Unicode characters.
1010 @deftypefun bool uc_is_block (ucs4_t @var{uc}, const uc_block_t *@var{block})
1011 Tests whether a Unicode character belongs to a given block.
1014 The following gives a global picture of all block.
1016 @deftypefun void uc_all_blocks (const uc_block_t **@var{blocks}, size_t *@var{count})
1017 Get the list of all blocks. Stores a pointer to an array of all blocks in
1018 @code{*@var{blocks}} and the length of this array in @code{*@var{count}}.
1021 @node ISO C and Java syntax
1022 @section ISO C and Java syntax
1024 @cindex C, programming language
1025 @cindex Java, programming language
1027 The following properties are taken from language standards. The supported
1028 language standards are ISO C 99 and Java.
1030 @deftypefun bool uc_is_c_whitespace (ucs4_t @var{uc})
1031 Tests whether a Unicode character is considered whitespace in ISO C 99.
1034 @deftypefun bool uc_is_java_whitespace (ucs4_t @var{uc})
1035 Tests whether a Unicode character is considered whitespace in Java.
1038 The following enumerated values are the possible return values of the functions
1039 @code{uc_c_ident_category} and @code{uc_java_ident_category}.
1041 @deftypevr Constant int UC_IDENTIFIER_START
1042 This return value means that the given character is valid as first or
1043 subsequent character in an identifier.
1046 @deftypevr Constant int UC_IDENTIFIER_VALID
1047 This return value means that the given character is valid as subsequent
1051 @deftypevr Constant int UC_IDENTIFIER_INVALID
1052 This return value means that the given character is not valid in an identifier.
1055 @deftypevr Constant int UC_IDENTIFIER_IGNORABLE
1056 This return value (only for Java) means that the given character is ignorable.
1059 The following function determine whether a given character can be a constituent
1060 of an identifier in the given programming language.
1062 @cindex Unicode character, validity in C identifiers
1063 @deftypefun int uc_c_ident_category (ucs4_t @var{uc})
1064 Returns the categorization of a Unicode character with respect to the ISO C 99
1068 @cindex Unicode character, validity in Java identifiers
1069 @deftypefun int uc_java_ident_category (ucs4_t @var{uc})
1070 Returns the categorization of a Unicode character with respect to the Java
1074 @node Classifications like in ISO C
1075 @section Classifications like in ISO C
1078 @cindex Unicode character, classification like in C
1079 The following character classifications mimic those declared in the ISO C
1080 header files @code{<ctype.h>} and @code{<wctype.h>}. These functions are
1081 deprecated, because this set of functions was designed with ASCII in mind and
1082 cannot reflect the more diverse reality of the Unicode character set. But
1083 they can be a quick-and-dirty porting aid when migrating from @code{wchar_t}
1084 APIs to Unicode strings.
1086 @deftypefun bool uc_is_alnum (ucs4_t @var{uc})
1087 Tests for any character for which @code{uc_is_alpha} or @code{uc_is_digit} is
1091 @deftypefun bool uc_is_alpha (ucs4_t @var{uc})
1092 Tests for any character for which @code{uc_is_upper} or @code{uc_is_lower} is
1093 true, or any character that is one of a locale-specific set of characters for
1094 which none of @code{uc_is_cntrl}, @code{uc_is_digit}, @code{uc_is_punct}, or
1095 @code{uc_is_space} is true.
1098 @deftypefun bool uc_is_cntrl (ucs4_t @var{uc})
1099 Tests for any control character.
1102 @deftypefun bool uc_is_digit (ucs4_t @var{uc})
1103 Tests for any character that corresponds to a decimal-digit character.
1106 @deftypefun bool uc_is_graph (ucs4_t @var{uc})
1107 Tests for any character for which @code{uc_is_print} is true and
1108 @code{uc_is_space} is false.
1111 @deftypefun bool uc_is_lower (ucs4_t @var{uc})
1112 Tests for any character that corresponds to a lowercase letter or is one
1113 of a locale-specific set of characters for which none of @code{uc_is_cntrl},
1114 @code{uc_is_digit}, @code{uc_is_punct}, or @code{uc_is_space} is true.
1117 @deftypefun bool uc_is_print (ucs4_t @var{uc})
1118 Tests for any printing character.
1121 @deftypefun bool uc_is_punct (ucs4_t @var{uc})
1122 Tests for any printing character that is one of a locale-specific set of
1123 characters for which neither @code{uc_is_space} nor @code{uc_is_alnum} is true.
1126 @deftypefun bool uc_is_space (ucs4_t @var{uc})
1127 Test for any character that corresponds to a locale-specific set of characters
1128 for which none of @code{uc_is_alnum}, @code{uc_is_graph}, or @code{uc_is_punct}
1132 @deftypefun bool uc_is_upper (ucs4_t @var{uc})
1133 Tests for any character that corresponds to an uppercase letter or is one
1134 of a locale-specific set of characters for which none of @code{uc_is_cntrl},
1135 @code{uc_is_digit}, @code{uc_is_punct}, or @code{uc_is_space} is true.
1138 @deftypefun bool uc_is_xdigit (ucs4_t @var{uc})
1139 Tests for any character that corresponds to a hexadecimal-digit character.
1142 @deftypefun bool uc_is_blank (ucs4_t @var{uc})
1143 Tests for any character that corresponds to a standard blank character or
1144 a locale-specific set of characters for which @code{uc_is_alnum} is false.