[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

8. Unicode character classification and properties <unictype.h>

This include file declares functions that classify Unicode characters and that test whether Unicode characters have specific properties.

The classification assigns a “general category” to every Unicode character. This is similar to the classification provided by ISO C in <wctype.h>.

Properties are the data that guides various text processing algorithms in the presence of specific Unicode characters.


8.1 General category

Every Unicode character or code point has a general category assigned to it. This classification is important for most algorithms that work on Unicode text.

The GNU libunistring library provides two kinds of API for working with general categories. The object oriented API uses a variable to denote every predefined general category value or combinations thereof. The low-level API uses a bit mask instead. The advantage of the object oriented API is that if only a few predefined general category values are used, the data tables are relatively small. When you combine general category values (using uc_general_category_or, uc_general_category_and, or uc_general_category_and_not), or when you use the low level bit masks, a big table is used thats holds the complete general category information for all Unicode characters.


8.1.1 The object oriented API for general category

Type: uc_general_category_t

This data type denotes a general category value. It is an immediate type that can be copied by simple assignment, without involving memory allocation. It is not an array type.

The following are the predefined general category value. Additional general categories may be added in the future.

Constant: uc_general_category_t UC_CATEGORY_L
Constant: uc_general_category_t UC_CATEGORY_LC
Constant: uc_general_category_t UC_CATEGORY_Lu
Constant: uc_general_category_t UC_CATEGORY_Ll
Constant: uc_general_category_t UC_CATEGORY_Lt
Constant: uc_general_category_t UC_CATEGORY_Lm
Constant: uc_general_category_t UC_CATEGORY_Lo
Constant: uc_general_category_t UC_CATEGORY_M
Constant: uc_general_category_t UC_CATEGORY_Mn
Constant: uc_general_category_t UC_CATEGORY_Mc
Constant: uc_general_category_t UC_CATEGORY_Me
Constant: uc_general_category_t UC_CATEGORY_N
Constant: uc_general_category_t UC_CATEGORY_Nd
Constant: uc_general_category_t UC_CATEGORY_Nl
Constant: uc_general_category_t UC_CATEGORY_No
Constant: uc_general_category_t UC_CATEGORY_P
Constant: uc_general_category_t UC_CATEGORY_Pc
Constant: uc_general_category_t UC_CATEGORY_Pd
Constant: uc_general_category_t UC_CATEGORY_Ps
Constant: uc_general_category_t UC_CATEGORY_Pe
Constant: uc_general_category_t UC_CATEGORY_Pi
Constant: uc_general_category_t UC_CATEGORY_Pf
Constant: uc_general_category_t UC_CATEGORY_Po
Constant: uc_general_category_t UC_CATEGORY_S
Constant: uc_general_category_t UC_CATEGORY_Sm
Constant: uc_general_category_t UC_CATEGORY_Sc
Constant: uc_general_category_t UC_CATEGORY_Sk
Constant: uc_general_category_t UC_CATEGORY_So
Constant: uc_general_category_t UC_CATEGORY_Z
Constant: uc_general_category_t UC_CATEGORY_Zs
Constant: uc_general_category_t UC_CATEGORY_Zl
Constant: uc_general_category_t UC_CATEGORY_Zp
Constant: uc_general_category_t UC_CATEGORY_C
Constant: uc_general_category_t UC_CATEGORY_Cc
Constant: uc_general_category_t UC_CATEGORY_Cf
Constant: uc_general_category_t UC_CATEGORY_Cs
Constant: uc_general_category_t UC_CATEGORY_Co
Constant: uc_general_category_t UC_CATEGORY_Cn

The following are alias names for predefined General category values.

Macro: uc_general_category_t UC_LETTER

This is another name for UC_CATEGORY_L.

Macro: uc_general_category_t UC_CASED_LETTER

This is another name for UC_CATEGORY_LC.

Macro: uc_general_category_t UC_UPPERCASE_LETTER

This is another name for UC_CATEGORY_Lu.

Macro: uc_general_category_t UC_LOWERCASE_LETTER

This is another name for UC_CATEGORY_Ll.

Macro: uc_general_category_t UC_TITLECASE_LETTER

This is another name for UC_CATEGORY_Lt.

Macro: uc_general_category_t UC_MODIFIER_LETTER

This is another name for UC_CATEGORY_Lm.

Macro: uc_general_category_t UC_OTHER_LETTER

This is another name for UC_CATEGORY_Lo.

Macro: uc_general_category_t UC_MARK

This is another name for UC_CATEGORY_M.

Macro: uc_general_category_t UC_NON_SPACING_MARK

This is another name for UC_CATEGORY_Mn.

Macro: uc_general_category_t UC_COMBINING_SPACING_MARK

This is another name for UC_CATEGORY_Mc.

Macro: uc_general_category_t UC_ENCLOSING_MARK

This is another name for UC_CATEGORY_Me.

Macro: uc_general_category_t UC_NUMBER

This is another name for UC_CATEGORY_N.

Macro: uc_general_category_t UC_DECIMAL_DIGIT_NUMBER

This is another name for UC_CATEGORY_Nd.

Macro: uc_general_category_t UC_LETTER_NUMBER

This is another name for UC_CATEGORY_Nl.

Macro: uc_general_category_t UC_OTHER_NUMBER

This is another name for UC_CATEGORY_No.

Macro: uc_general_category_t UC_PUNCTUATION

This is another name for UC_CATEGORY_P.

Macro: uc_general_category_t UC_CONNECTOR_PUNCTUATION

This is another name for UC_CATEGORY_Pc.

Macro: uc_general_category_t UC_DASH_PUNCTUATION

This is another name for UC_CATEGORY_Pd.

Macro: uc_general_category_t UC_OPEN_PUNCTUATION

This is another name for UC_CATEGORY_Ps (“start punctuation”).

Macro: uc_general_category_t UC_CLOSE_PUNCTUATION

This is another name for UC_CATEGORY_Pe (“end punctuation”).

Macro: uc_general_category_t UC_INITIAL_QUOTE_PUNCTUATION

This is another name for UC_CATEGORY_Pi.

Macro: uc_general_category_t UC_FINAL_QUOTE_PUNCTUATION

This is another name for UC_CATEGORY_Pf.

Macro: uc_general_category_t UC_OTHER_PUNCTUATION

This is another name for UC_CATEGORY_Po.

Macro: uc_general_category_t UC_SYMBOL

This is another name for UC_CATEGORY_S.

Macro: uc_general_category_t UC_MATH_SYMBOL

This is another name for UC_CATEGORY_Sm.

Macro: uc_general_category_t UC_CURRENCY_SYMBOL

This is another name for UC_CATEGORY_Sc.

Macro: uc_general_category_t UC_MODIFIER_SYMBOL

This is another name for UC_CATEGORY_Sk.

Macro: uc_general_category_t UC_OTHER_SYMBOL

This is another name for UC_CATEGORY_So.

Macro: uc_general_category_t UC_SEPARATOR

This is another name for UC_CATEGORY_Z.

Macro: uc_general_category_t UC_SPACE_SEPARATOR

This is another name for UC_CATEGORY_Zs.

Macro: uc_general_category_t UC_LINE_SEPARATOR

This is another name for UC_CATEGORY_Zl.

Macro: uc_general_category_t UC_PARAGRAPH_SEPARATOR

This is another name for UC_CATEGORY_Zp.

Macro: uc_general_category_t UC_OTHER

This is another name for UC_CATEGORY_C.

Macro: uc_general_category_t UC_CONTROL

This is another name for UC_CATEGORY_Cc.

Macro: uc_general_category_t UC_FORMAT

This is another name for UC_CATEGORY_Cf.

Macro: uc_general_category_t UC_SURROGATE

This is another name for UC_CATEGORY_Cs. All code points in this category are invalid characters.

Macro: uc_general_category_t UC_PRIVATE_USE

This is another name for UC_CATEGORY_Co.

Macro: uc_general_category_t UC_UNASSIGNED

This is another name for UC_CATEGORY_Cn. Some code points in this category are invalid characters.

The following functions combine general categories, like in a boolean algebra, except that there is no ‘not’ operation.

Function: uc_general_category_t uc_general_category_or (uc_general_category_t category1, uc_general_category_t category2)

Returns the union of two general categories. This corresponds to the unions of the two sets of characters.

Function: uc_general_category_t uc_general_category_and (uc_general_category_t category1, uc_general_category_t category2)

Returns the intersection of two general categories as bit masks. This does not correspond to the intersection of the two sets of characters.

Function: uc_general_category_t uc_general_category_and_not (uc_general_category_t category1, uc_general_category_t category2)

Returns the intersection of a general category with the complement of a second general category, as bit masks. This does not correspond to the intersection with complement, when viewing the categories as sets of characters.

The following functions associate general categories with their name.

Function: const char * uc_general_category_name (uc_general_category_t category)

Returns the name of a general category, more precisely, the abbreviated name. Returns NULL if the general category corresponds to a bit mask that does not have a name.

Function: const char * uc_general_category_long_name (uc_general_category_t category)

Returns the long name of a general category. Returns NULL if the general category corresponds to a bit mask that does not have a name.

Function: uc_general_category_t uc_general_category_byname (const char *category_name)

Returns the general category given by name, e.g. "Lu", or by long name, e.g. "Uppercase Letter". This lookup ignores spaces, underscores, or hyphens as word separators and is case-insignificant.

The following functions view general categories as sets of Unicode characters.

Function: uc_general_category_t uc_general_category (ucs4_t uc)

Returns the general category of a Unicode character.

This function uses a big table.

Function: bool uc_is_general_category (ucs4_t uc, uc_general_category_t category)

Tests whether a Unicode character belongs to a given category. The category argument can be a predefined general category or the combination of several predefined general categories.


8.1.2 The bit mask API for general category

The following are the predefined general category value as bit masks. Additional general categories may be added in the future.

Macro: uint32_t UC_CATEGORY_MASK_L
Macro: uint32_t UC_CATEGORY_MASK_LC
Macro: uint32_t UC_CATEGORY_MASK_Lu
Macro: uint32_t UC_CATEGORY_MASK_Ll
Macro: uint32_t UC_CATEGORY_MASK_Lt
Macro: uint32_t UC_CATEGORY_MASK_Lm
Macro: uint32_t UC_CATEGORY_MASK_Lo
Macro: uint32_t UC_CATEGORY_MASK_M
Macro: uint32_t UC_CATEGORY_MASK_Mn
Macro: uint32_t UC_CATEGORY_MASK_Mc
Macro: uint32_t UC_CATEGORY_MASK_Me
Macro: uint32_t UC_CATEGORY_MASK_N
Macro: uint32_t UC_CATEGORY_MASK_Nd
Macro: uint32_t UC_CATEGORY_MASK_Nl
Macro: uint32_t UC_CATEGORY_MASK_No
Macro: uint32_t UC_CATEGORY_MASK_P
Macro: uint32_t UC_CATEGORY_MASK_Pc
Macro: uint32_t UC_CATEGORY_MASK_Pd
Macro: uint32_t UC_CATEGORY_MASK_Ps
Macro: uint32_t UC_CATEGORY_MASK_Pe
Macro: uint32_t UC_CATEGORY_MASK_Pi
Macro: uint32_t UC_CATEGORY_MASK_Pf
Macro: uint32_t UC_CATEGORY_MASK_Po
Macro: uint32_t UC_CATEGORY_MASK_S
Macro: uint32_t UC_CATEGORY_MASK_Sm
Macro: uint32_t UC_CATEGORY_MASK_Sc
Macro: uint32_t UC_CATEGORY_MASK_Sk
Macro: uint32_t UC_CATEGORY_MASK_So
Macro: uint32_t UC_CATEGORY_MASK_Z
Macro: uint32_t UC_CATEGORY_MASK_Zs
Macro: uint32_t UC_CATEGORY_MASK_Zl
Macro: uint32_t UC_CATEGORY_MASK_Zp
Macro: uint32_t UC_CATEGORY_MASK_C
Macro: uint32_t UC_CATEGORY_MASK_Cc
Macro: uint32_t UC_CATEGORY_MASK_Cf
Macro: uint32_t UC_CATEGORY_MASK_Cs
Macro: uint32_t UC_CATEGORY_MASK_Co
Macro: uint32_t UC_CATEGORY_MASK_Cn

The following function views general categories as sets of Unicode characters.

Function: bool uc_is_general_category_withtable (ucs4_t uc, uint32_t bitmask)

Tests whether a Unicode character belongs to a given category. The bitmask argument can be a predefined general category bitmask or the combination of several predefined general category bitmasks.

This function uses a big table comprising all general categories.


8.2 Canonical combining class

Every Unicode character or code point has a canonical combining class assigned to it.

What is the meaning of the canonical combining class? Essentially, it indicates the priority with which a combining character is attached to its base character. The characters for which the canonical combining class is 0 are the base characters, and the characters for which it is greater than 0 are the combining characters. Combining characters are rendered near/attached/around their base character, and combining characters with small combining classes are attached "first" or "closer" to the base character.

The canonical combining class of a character is a number in the range 0..255. The possible values are described in the Unicode Character Database http://www.unicode.org/Public/UNIDATA/UCD.html. The list here is not definitive; more values can be added in future versions.

Constant: int UC_CCC_NR

The canonical combining class value for “Not Reordered” characters. The value is 0.

Constant: int UC_CCC_OV

The canonical combining class value for “Overlay” characters.

Constant: int UC_CCC_NK

The canonical combining class value for “Nukta” characters.

Constant: int UC_CCC_KV

The canonical combining class value for “Kana Voicing” characters.

Constant: int UC_CCC_VR

The canonical combining class value for “Virama” characters.

Constant: int UC_CCC_ATBL

The canonical combining class value for “Attached Below Left” characters.

Constant: int UC_CCC_ATB

The canonical combining class value for “Attached Below” characters.

Constant: int UC_CCC_ATA

The canonical combining class value for “Attached Above” characters.

Constant: int UC_CCC_ATAR

The canonical combining class value for “Attached Above Right” characters.

Constant: int UC_CCC_BL

The canonical combining class value for “Below Left” characters.

Constant: int UC_CCC_B

The canonical combining class value for “Below” characters.

Constant: int UC_CCC_BR

The canonical combining class value for “Below Right” characters.

Constant: int UC_CCC_L

The canonical combining class value for “Left” characters.

Constant: int UC_CCC_R

The canonical combining class value for “Right” characters.

Constant: int UC_CCC_AL

The canonical combining class value for “Above Left” characters.

Constant: int UC_CCC_A

The canonical combining class value for “Above” characters.

Constant: int UC_CCC_AR

The canonical combining class value for “Above Right” characters.

Constant: int UC_CCC_DB

The canonical combining class value for “Double Below” characters.

Constant: int UC_CCC_DA

The canonical combining class value for “Double Above” characters.

Constant: int UC_CCC_IS

The canonical combining class value for “Iota Subscript” characters.

The following functions associate canonical combining classes with their name.

Function: const char * uc_combining_class_name (int ccc)

Returns the name of a canonical combining class, more precisely, the abbreviated name. Returns NULL if the canonical combining class is a numeric value without a name.

Function: const char * uc_combining_class_long_name (int ccc)

Returns the long name of a canonical combining class. Returns NULL if the canonical combining class is a numeric value without a name.

Function: int uc_combining_class_byname (const char *ccc_name)

Returns the canonical combining class given by name, e.g. "BL", or by long name, e.g. "Below Left". This lookup ignores spaces, underscores, or hyphens as word separators and is case-insignificant.

The following function looks up the canonical combining class of a character.

Function: int uc_combining_class (ucs4_t uc)

Returns the canonical combining class of a Unicode character.


8.3 Bidi class

Every Unicode character or code point has a bidi class assigned to it. Before Unicode 4.0, this concept was known as bidirectional category.

The bidi class guides the bidirectional algorithm (http://www.unicode.org/reports/tr9/). The possible values are the following.

Constant: int UC_BIDI_L

The bidi class for `Left-to-Right`” characters.

Constant: int UC_BIDI_LRE

The bidi class for “Left-to-Right Embedding” characters.

Constant: int UC_BIDI_LRO

The bidi class for “Left-to-Right Override” characters.

Constant: int UC_BIDI_R

The bidi class for “Right-to-Left” characters.

Constant: int UC_BIDI_AL

The bidi class for “Right-to-Left Arabic” characters.

Constant: int UC_BIDI_RLE

The bidi class for “Right-to-Left Embedding” characters.

Constant: int UC_BIDI_RLO

The bidi class for “Right-to-Left Override” characters.

Constant: int UC_BIDI_PDF

The bidi class for “Pop Directional Format” characters.

Constant: int UC_BIDI_EN

The bidi class for “European Number” characters.

Constant: int UC_BIDI_ES

The bidi class for “European Number Separator” characters.

Constant: int UC_BIDI_ET

The bidi class for “European Number Terminator” characters.

Constant: int UC_BIDI_AN

The bidi class for “Arabic Number” characters.

Constant: int UC_BIDI_CS

The bidi class for “Common Number Separator” characters.

Constant: int UC_BIDI_NSM

The bidi class for “Non-Spacing Mark” characters.

Constant: int UC_BIDI_BN

The bidi class for “Boundary Neutral” characters.

Constant: int UC_BIDI_B

The bidi class for “Paragraph Separator” characters.

Constant: int UC_BIDI_S

The bidi class for “Segment Separator” characters.

Constant: int UC_BIDI_WS

The bidi class for “Whitespace” characters.

Constant: int UC_BIDI_ON

The bidi class for “Other Neutral” characters.

The following functions implement the association between a bidirectional category and its name.

Function: const char * uc_bidi_class_name (int bidi_class)
Function: const char * uc_bidi_category_name (int category)

Returns the name of a bidi class, more precisely, the abbreviated name.

Function: const char * uc_bidi_class_long_name (int bidi_class)

Returns the long name of a bidi class.

Function: int uc_bidi_class_byname (const char *bidi_class_name)
Function: int uc_bidi_category_byname (const char *category_name)

Returns the bidi class given by name, e.g. "LRE", or by long name, e.g. "Left-to-Right Embedding". This lookup ignores spaces, underscores, or hyphens as word separators and is case-insignificant.

The following functions view bidirectional categories as sets of Unicode characters.

Function: int uc_bidi_class (ucs4_t uc)
Function: int uc_bidi_category (ucs4_t uc)

Returns the bidi class of a Unicode character.

Function: bool uc_is_bidi_class (ucs4_t uc, int bidi_class)
Function: bool uc_is_bidi_category (ucs4_t uc, int category)

Tests whether a Unicode character belongs to a given bidi class.


8.4 Decimal digit value

Decimal digits (like the digits from ‘0’ to ‘9’) exist in many scripts. The following function converts a decimal digit character to its numerical value.

Function: int uc_decimal_value (ucs4_t uc)

Returns the decimal digit value of a Unicode character. The return value is an integer in the range 0..9, or -1 for characters that do not represent a decimal digit.


8.5 Digit value

Digit characters are like decimal digit characters, possibly in special forms, like as superscript, subscript, or circled. The following function converts a digit character to its numerical value.

Function: int uc_digit_value (ucs4_t uc)

Returns the digit value of a Unicode character. The return value is an integer in the range 0..9, or -1 for characters that do not represent a digit.


8.6 Numeric value

There are also characters that represent numbers without a digit system, like the Roman numerals, and fractional numbers, like 1/4 or 3/4.

The following type represents the numeric value of a Unicode character.

Type: uc_fraction_t

This is a structure type with the following fields:

 
int numerator;
int denominator;

An integer n is represented by numerator = n, denominator = 1.

The following function converts a number character to its numerical value.

Function: uc_fraction_t uc_numeric_value (ucs4_t uc)

Returns the numeric value of a Unicode character. The return value is a fraction, or the pseudo-fraction { 0, 0 } for characters that do not represent a number.


8.7 Mirrored character

Character mirroring is used to associate the closing parenthesis character to the opening parenthesis character, the closing brace character with the opening brace character, and so on.

The following function looks up the mirrored character of a Unicode character.

Function: bool uc_mirror_char (ucs4_t uc, ucs4_t *puc)

Stores the mirrored character of a Unicode character uc in *puc and returns true, if it exists. Otherwise it stores uc unmodified in *puc and returns false.


8.8 Arabic shaping

When Arabic characters are rendered, after bidi reordering has taken place, the shape of the glyphs are modified so that many adjacent glyphs are joined. Two character properties describe how this “Arabic shaping” takes place: the joining type and the joining group.


8.8.1 Joining type of Arabic characters

The joining type of a character describes on which of the left and right neighbour characters the character's shape depends, and which of the two neighbour characters are rendered depending on this character.

The joining type has the following possible values:

Constant: int UC_JOINING_TYPE_U

“Non joining”: Characters of this joining type prohibit joining.

Constant: int UC_JOINING_TYPE_T

“Transparent”: Characters of this joining type are skipped when considering joining.

Constant: int UC_JOINING_TYPE_C

“Join causing”: Characters of this joining type cause their neighbour characters to change their shapes but don't change their own shape.

Constant: int UC_JOINING_TYPE_L

“Left joining”: Characters of this joining type have two shapes, isolated and initial. Such characters currently don't exist.

Constant: int UC_JOINING_TYPE_R

“Right joining”: Characters of this joining type have two shapes, isolated and final.

Constant: int UC_JOINING_TYPE_D

“Dual joining”: Characters of this joining type have four shapes, initial, medial, final, and isolated.

The following functions implement the association between a joining type and its name.

Function: const char * uc_joining_type_name (int joining_type)

Returns the name of a joining type.

Function: const char * uc_joining_type_long_name (int joining_type)

Returns the long name of a joining type.

Function: int uc_joining_type_byname (const char *joining_type_name)

Returns the joining type given by name, e.g. "D", or by long name, e.g. "Dual Joining. This lookup ignores spaces, underscores, or hyphens as word separators and is case-insignificant.

The following function gives the joining type of every Unicode character.

Function: int uc_joining_type (ucs4_t uc)

Returns the joining type of a Unicode character.


8.8.2 Joining group of Arabic characters

The joining group of a character describes how the character's shape is modified in the four contexts of dual-joining characters or in the two contexts of right-joining characters.

The joining group has the following possible values:

Constant: int UC_JOINING_GROUP_NONE
Constant: int UC_JOINING_GROUP_AIN
Constant: int UC_JOINING_GROUP_ALAPH
Constant: int UC_JOINING_GROUP_ALEF
Constant: int UC_JOINING_GROUP_BEH
Constant: int UC_JOINING_GROUP_BETH
Constant: int UC_JOINING_GROUP_BURUSHASKI_YEH_BARREE
Constant: int UC_JOINING_GROUP_DAL
Constant: int UC_JOINING_GROUP_DALATH_RISH
Constant: int UC_JOINING_GROUP_E
Constant: int UC_JOINING_GROUP_FARSI_YEH
Constant: int UC_JOINING_GROUP_FE
Constant: int UC_JOINING_GROUP_FEH
Constant: int UC_JOINING_GROUP_FINAL_SEMKATH
Constant: int UC_JOINING_GROUP_GAF
Constant: int UC_JOINING_GROUP_GAMAL
Constant: int UC_JOINING_GROUP_HAH
Constant: int UC_JOINING_GROUP_HE
Constant: int UC_JOINING_GROUP_HEH
Constant: int UC_JOINING_GROUP_HEH_GOAL
Constant: int UC_JOINING_GROUP_HETH
Constant: int UC_JOINING_GROUP_KAF
Constant: int UC_JOINING_GROUP_KAPH
Constant: int UC_JOINING_GROUP_KHAPH
Constant: int UC_JOINING_GROUP_KNOTTED_HEH
Constant: int UC_JOINING_GROUP_LAM
Constant: int UC_JOINING_GROUP_LAMADH
Constant: int UC_JOINING_GROUP_MEEM
Constant: int UC_JOINING_GROUP_MIM
Constant: int UC_JOINING_GROUP_NOON
Constant: int UC_JOINING_GROUP_NUN
Constant: int UC_JOINING_GROUP_NYA
Constant: int UC_JOINING_GROUP_PE
Constant: int UC_JOINING_GROUP_QAF
Constant: int UC_JOINING_GROUP_QAPH
Constant: int UC_JOINING_GROUP_REH
Constant: int UC_JOINING_GROUP_REVERSED_PE
Constant: int UC_JOINING_GROUP_SAD
Constant: int UC_JOINING_GROUP_SADHE
Constant: int UC_JOINING_GROUP_SEEN
Constant: int UC_JOINING_GROUP_SEMKATH
Constant: int UC_JOINING_GROUP_SHIN
Constant: int UC_JOINING_GROUP_SWASH_KAF
Constant: int UC_JOINING_GROUP_SYRIAC_WAW
Constant: int UC_JOINING_GROUP_TAH
Constant: int UC_JOINING_GROUP_TAW
Constant: int UC_JOINING_GROUP_TEH_MARBUTA
Constant: int UC_JOINING_GROUP_TEH_MARBUTA_GOAL
Constant: int UC_JOINING_GROUP_TETH
Constant: int UC_JOINING_GROUP_WAW
Constant: int UC_JOINING_GROUP_YEH
Constant: int UC_JOINING_GROUP_YEH_BARREE
Constant: int UC_JOINING_GROUP_YEH_WITH_TAIL
Constant: int UC_JOINING_GROUP_YUDH
Constant: int UC_JOINING_GROUP_YUDH_HE
Constant: int UC_JOINING_GROUP_ZAIN
Constant: int UC_JOINING_GROUP_ZHAIN

The following functions implement the association between a joining group and its name.

Function: const char * uc_joining_group_name (int joining_group)

Returns the name of a joining group.

Function: int uc_joining_group_byname (const char *joining_group_name)

Returns the joining group given by name, e.g. "Teh_Marbuta". This lookup ignores spaces, underscores, or hyphens as word separators and is case-insignificant.

The following function gives the joining group of every Unicode character.

Function: int uc_joining_group (ucs4_t uc)

Returns the joining group of a Unicode character.


8.9 Properties

This section defines boolean properties of Unicode characters. This means, a character either has the given property or does not have it. In other words, the property can be viewed as a subset of the set of Unicode characters.

The GNU libunistring library provides two kinds of API for working with properties. The object oriented API uses a type uc_property_t to designate a property. In the function-based API, which is a bit more low level, a property is merely a function.


8.9.1 Properties as objects – the object oriented API

The following type designates a property on Unicode characters.

Type: uc_property_t

This data type denotes a boolean property on Unicode characters. It is an immediate type that can be copied by simple assignment, without involving memory allocation. It is not an array type.

Many Unicode properties are predefined.

The following are general properties.

Constant: uc_property_t UC_PROPERTY_WHITE_SPACE
Constant: uc_property_t UC_PROPERTY_ALPHABETIC
Constant: uc_property_t UC_PROPERTY_OTHER_ALPHABETIC
Constant: uc_property_t UC_PROPERTY_NOT_A_CHARACTER
Constant: uc_property_t UC_PROPERTY_DEFAULT_IGNORABLE_CODE_POINT
Constant: uc_property_t UC_PROPERTY_OTHER_DEFAULT_IGNORABLE_CODE_POINT
Constant: uc_property_t UC_PROPERTY_DEPRECATED
Constant: uc_property_t UC_PROPERTY_LOGICAL_ORDER_EXCEPTION
Constant: uc_property_t UC_PROPERTY_VARIATION_SELECTOR
Constant: uc_property_t UC_PROPERTY_PRIVATE_USE
Constant: uc_property_t UC_PROPERTY_UNASSIGNED_CODE_VALUE

The following properties are related to case folding.

Constant: uc_property_t UC_PROPERTY_UPPERCASE
Constant: uc_property_t UC_PROPERTY_OTHER_UPPERCASE
Constant: uc_property_t UC_PROPERTY_LOWERCASE
Constant: uc_property_t UC_PROPERTY_OTHER_LOWERCASE
Constant: uc_property_t UC_PROPERTY_TITLECASE
Constant: uc_property_t UC_PROPERTY_CASED
Constant: uc_property_t UC_PROPERTY_CASE_IGNORABLE
Constant: uc_property_t UC_PROPERTY_CHANGES_WHEN_LOWERCASED
Constant: uc_property_t UC_PROPERTY_CHANGES_WHEN_UPPERCASED
Constant: uc_property_t UC_PROPERTY_CHANGES_WHEN_TITLECASED
Constant: uc_property_t UC_PROPERTY_CHANGES_WHEN_CASEFOLDED
Constant: uc_property_t UC_PROPERTY_CHANGES_WHEN_CASEMAPPED
Constant: uc_property_t UC_PROPERTY_SOFT_DOTTED

The following properties are related to identifiers.

Constant: uc_property_t UC_PROPERTY_ID_START
Constant: uc_property_t UC_PROPERTY_OTHER_ID_START
Constant: uc_property_t UC_PROPERTY_ID_CONTINUE
Constant: uc_property_t UC_PROPERTY_OTHER_ID_CONTINUE
Constant: uc_property_t UC_PROPERTY_XID_START
Constant: uc_property_t UC_PROPERTY_XID_CONTINUE
Constant: uc_property_t UC_PROPERTY_PATTERN_WHITE_SPACE
Constant: uc_property_t UC_PROPERTY_PATTERN_SYNTAX

The following properties have an influence on shaping and rendering.

Constant: uc_property_t UC_PROPERTY_JOIN_CONTROL
Constant: uc_property_t UC_PROPERTY_GRAPHEME_BASE
Constant: uc_property_t UC_PROPERTY_GRAPHEME_EXTEND
Constant: uc_property_t UC_PROPERTY_OTHER_GRAPHEME_EXTEND
Constant: uc_property_t UC_PROPERTY_GRAPHEME_LINK

The following properties relate to bidirectional reordering.

Constant: uc_property_t UC_PROPERTY_BIDI_CONTROL
Constant: uc_property_t UC_PROPERTY_BIDI_LEFT_TO_RIGHT
Constant: uc_property_t UC_PROPERTY_BIDI_HEBREW_RIGHT_TO_LEFT
Constant: uc_property_t UC_PROPERTY_BIDI_ARABIC_RIGHT_TO_LEFT
Constant: uc_property_t UC_PROPERTY_BIDI_EUROPEAN_DIGIT
Constant: uc_property_t UC_PROPERTY_BIDI_EUR_NUM_SEPARATOR
Constant: uc_property_t UC_PROPERTY_BIDI_EUR_NUM_TERMINATOR
Constant: uc_property_t UC_PROPERTY_BIDI_ARABIC_DIGIT
Constant: uc_property_t UC_PROPERTY_BIDI_COMMON_SEPARATOR
Constant: uc_property_t UC_PROPERTY_BIDI_BLOCK_SEPARATOR
Constant: uc_property_t UC_PROPERTY_BIDI_SEGMENT_SEPARATOR
Constant: uc_property_t UC_PROPERTY_BIDI_WHITESPACE
Constant: uc_property_t UC_PROPERTY_BIDI_NON_SPACING_MARK
Constant: uc_property_t UC_PROPERTY_BIDI_BOUNDARY_NEUTRAL
Constant: uc_property_t UC_PROPERTY_BIDI_PDF
Constant: uc_property_t UC_PROPERTY_BIDI_EMBEDDING_OR_OVERRIDE
Constant: uc_property_t UC_PROPERTY_BIDI_OTHER_NEUTRAL

The following properties deal with number representations.

Constant: uc_property_t UC_PROPERTY_HEX_DIGIT
Constant: uc_property_t UC_PROPERTY_ASCII_HEX_DIGIT

The following properties deal with CJK.

Constant: uc_property_t UC_PROPERTY_IDEOGRAPHIC
Constant: uc_property_t UC_PROPERTY_UNIFIED_IDEOGRAPH
Constant: uc_property_t UC_PROPERTY_RADICAL
Constant: uc_property_t UC_PROPERTY_IDS_BINARY_OPERATOR
Constant: uc_property_t UC_PROPERTY_IDS_TRINARY_OPERATOR

Other miscellaneous properties are:

Constant: uc_property_t UC_PROPERTY_ZERO_WIDTH
Constant: uc_property_t UC_PROPERTY_SPACE
Constant: uc_property_t UC_PROPERTY_NON_BREAK
Constant: uc_property_t UC_PROPERTY_ISO_CONTROL
Constant: uc_property_t UC_PROPERTY_FORMAT_CONTROL
Constant: uc_property_t UC_PROPERTY_DASH
Constant: uc_property_t UC_PROPERTY_HYPHEN
Constant: uc_property_t UC_PROPERTY_PUNCTUATION
Constant: uc_property_t UC_PROPERTY_LINE_SEPARATOR
Constant: uc_property_t UC_PROPERTY_PARAGRAPH_SEPARATOR
Constant: uc_property_t UC_PROPERTY_QUOTATION_MARK
Constant: uc_property_t UC_PROPERTY_SENTENCE_TERMINAL
Constant: uc_property_t UC_PROPERTY_TERMINAL_PUNCTUATION
Constant: uc_property_t UC_PROPERTY_CURRENCY_SYMBOL
Constant: uc_property_t UC_PROPERTY_MATH
Constant: uc_property_t UC_PROPERTY_OTHER_MATH
Constant: uc_property_t UC_PROPERTY_PAIRED_PUNCTUATION
Constant: uc_property_t UC_PROPERTY_LEFT_OF_PAIR
Constant: uc_property_t UC_PROPERTY_COMBINING
Constant: uc_property_t UC_PROPERTY_COMPOSITE
Constant: uc_property_t UC_PROPERTY_DECIMAL_DIGIT
Constant: uc_property_t UC_PROPERTY_NUMERIC
Constant: uc_property_t UC_PROPERTY_DIACRITIC
Constant: uc_property_t UC_PROPERTY_EXTENDER
Constant: uc_property_t UC_PROPERTY_IGNORABLE_CONTROL

The following function looks up a property by its name.

Function: uc_property_t uc_property_byname (const char *property_name)

Returns the property given by name, e.g. "White space". If a property with the given name exists, the result will satisfy the uc_property_is_valid predicate. Otherwise the result will not satisfy this predicate and must not be passed to functions that expect an uc_property_t argument.

This lookup ignores spaces, underscores, or hyphens as word separators, is case-insignificant, and supports the aliases listed in Unicode's ‘PropertyAliases.txt’ file.

This function references a big table of all predefined properties. Its use can significantly increase the size of your application.

Function: bool uc_property_is_valid (uc_property_t property)

Returns true when the given property is valid, or false otherwise.

The following function views a property as a set of Unicode characters.

Function: bool uc_is_property (ucs4_t uc, uc_property_t property)

Tests whether the Unicode character uc has the given property.


8.9.2 Properties as functions – the functional API

The following are general properties.

Function: bool uc_is_property_white_space (ucs4_t uc)
Function: bool uc_is_property_alphabetic (ucs4_t uc)
Function: bool uc_is_property_other_alphabetic (ucs4_t uc)
Function: bool uc_is_property_not_a_character (ucs4_t uc)
Function: bool uc_is_property_default_ignorable_code_point (ucs4_t uc)
Function: bool uc_is_property_other_default_ignorable_code_point (ucs4_t uc)
Function: bool uc_is_property_deprecated (ucs4_t uc)
Function: bool uc_is_property_logical_order_exception (ucs4_t uc)
Function: bool uc_is_property_variation_selector (ucs4_t uc)
Function: bool uc_is_property_private_use (ucs4_t uc)
Function: bool uc_is_property_unassigned_code_value (ucs4_t uc)

The following properties are related to case folding.

Function: bool uc_is_property_uppercase (ucs4_t uc)
Function: bool uc_is_property_other_uppercase (ucs4_t uc)
Function: bool uc_is_property_lowercase (ucs4_t uc)
Function: bool uc_is_property_other_lowercase (ucs4_t uc)
Function: bool uc_is_property_titlecase (ucs4_t uc)
Function: bool uc_is_property_cased (ucs4_t uc)
Function: bool uc_is_property_case_ignorable (ucs4_t uc)
Function: bool uc_is_property_changes_when_lowercased (ucs4_t uc)
Function: bool uc_is_property_changes_when_uppercased (ucs4_t uc)
Function: bool uc_is_property_changes_when_titlecased (ucs4_t uc)
Function: bool uc_is_property_changes_when_casefolded (ucs4_t uc)
Function: bool uc_is_property_changes_when_casemapped (ucs4_t uc)
Function: bool uc_is_property_soft_dotted (ucs4_t uc)

The following properties are related to identifiers.

Function: bool uc_is_property_id_start (ucs4_t uc)
Function: bool uc_is_property_other_id_start (ucs4_t uc)
Function: bool uc_is_property_id_continue (ucs4_t uc)
Function: bool uc_is_property_other_id_continue (ucs4_t uc)
Function: bool uc_is_property_xid_start (ucs4_t uc)
Function: bool uc_is_property_xid_continue (ucs4_t uc)
Function: bool uc_is_property_pattern_white_space (ucs4_t uc)
Function: bool uc_is_property_pattern_syntax (ucs4_t uc)

The following properties have an influence on shaping and rendering.

Function: bool uc_is_property_join_control (ucs4_t uc)
Function: bool uc_is_property_grapheme_base (ucs4_t uc)
Function: bool uc_is_property_grapheme_extend (ucs4_t uc)
Function: bool uc_is_property_other_grapheme_extend (ucs4_t uc)
Function: bool uc_is_property_grapheme_link (ucs4_t uc)

The following properties relate to bidirectional reordering.

Function: bool uc_is_property_bidi_control (ucs4_t uc)
Function: bool uc_is_property_bidi_left_to_right (ucs4_t uc)
Function: bool uc_is_property_bidi_hebrew_right_to_left (ucs4_t uc)
Function: bool uc_is_property_bidi_arabic_right_to_left (ucs4_t uc)
Function: bool uc_is_property_bidi_european_digit (ucs4_t uc)
Function: bool uc_is_property_bidi_eur_num_separator (ucs4_t uc)
Function: bool uc_is_property_bidi_eur_num_terminator (ucs4_t uc)
Function: bool uc_is_property_bidi_arabic_digit (ucs4_t uc)
Function: bool uc_is_property_bidi_common_separator (ucs4_t uc)
Function: bool uc_is_property_bidi_block_separator (ucs4_t uc)
Function: bool uc_is_property_bidi_segment_separator (ucs4_t uc)
Function: bool uc_is_property_bidi_whitespace (ucs4_t uc)
Function: bool uc_is_property_bidi_non_spacing_mark (ucs4_t uc)
Function: bool uc_is_property_bidi_boundary_neutral (ucs4_t uc)
Function: bool uc_is_property_bidi_pdf (ucs4_t uc)
Function: bool uc_is_property_bidi_embedding_or_override (ucs4_t uc)
Function: bool uc_is_property_bidi_other_neutral (ucs4_t uc)

The following properties deal with number representations.

Function: bool uc_is_property_hex_digit (ucs4_t uc)
Function: bool uc_is_property_ascii_hex_digit (ucs4_t uc)

The following properties deal with CJK.

Function: bool uc_is_property_ideographic (ucs4_t uc)
Function: bool uc_is_property_unified_ideograph (ucs4_t uc)
Function: bool uc_is_property_radical (ucs4_t uc)
Function: bool uc_is_property_ids_binary_operator (ucs4_t uc)
Function: bool uc_is_property_ids_trinary_operator (ucs4_t uc)

Other miscellaneous properties are:

Function: bool uc_is_property_zero_width (ucs4_t uc)
Function: bool uc_is_property_space (ucs4_t uc)
Function: bool uc_is_property_non_break (ucs4_t uc)
Function: bool uc_is_property_iso_control (ucs4_t uc)
Function: bool uc_is_property_format_control (ucs4_t uc)
Function: bool uc_is_property_dash (ucs4_t uc)
Function: bool uc_is_property_hyphen (ucs4_t uc)
Function: bool uc_is_property_punctuation (ucs4_t uc)
Function: bool uc_is_property_line_separator (ucs4_t uc)
Function: bool uc_is_property_paragraph_separator (ucs4_t uc)
Function: bool uc_is_property_quotation_mark (ucs4_t uc)
Function: bool uc_is_property_sentence_terminal (ucs4_t uc)
Function: bool uc_is_property_terminal_punctuation (ucs4_t uc)
Function: bool uc_is_property_currency_symbol (ucs4_t uc)
Function: bool uc_is_property_math (ucs4_t uc)
Function: bool uc_is_property_other_math (ucs4_t uc)
Function: bool uc_is_property_paired_punctuation (ucs4_t uc)
Function: bool uc_is_property_left_of_pair (ucs4_t uc)
Function: bool uc_is_property_combining (ucs4_t uc)
Function: bool uc_is_property_composite (ucs4_t uc)
Function: bool uc_is_property_decimal_digit (ucs4_t uc)
Function: bool uc_is_property_numeric (ucs4_t uc)
Function: bool uc_is_property_diacritic (ucs4_t uc)
Function: bool uc_is_property_extender (ucs4_t uc)
Function: bool uc_is_property_ignorable_control (ucs4_t uc)

8.10 Scripts

The Unicode characters are subdivided into scripts.

The following type is used to represent a script:

Type: uc_script_t

This data type is a structure type that refers to statically allocated read-only data. It contains the following fields:

 
const char *name;

The name field contains the name of the script.

The following functions look up a script.

Function: const uc_script_t * uc_script (ucs4_t uc)

Returns the script of a Unicode character. Returns NULL if uc does not belong to any script.

Function: const uc_script_t * uc_script_byname (const char *script_name)

Returns the script given by its name, e.g. "HAN". Returns NULL if a script with the given name does not exist.

The following function views a script as a set of Unicode characters.

Function: bool uc_is_script (ucs4_t uc, const uc_script_t *script)

Tests whether a Unicode character belongs to a given script.

The following gives a global picture of all scripts.

Function: void uc_all_scripts (const uc_script_t **scripts, size_t *count)

Get the list of all scripts. Stores a pointer to an array of all scripts in *scripts and the length of this array in *count.


8.11 Blocks

The Unicode characters are subdivided into blocks. A block is an interval of Unicode code points.

The following type is used to represent a block.

Type: uc_block_t

This data type is a structure type that refers to statically allocated data. It contains the following fields:

 
ucs4_t start;
ucs4_t end;
const char *name;

The start field is the first Unicode code point in the block.

The end field is the last Unicode code point in the block.

The name field is the name of the block.

The following function looks up a block.

Function: const uc_block_t * uc_block (ucs4_t uc)

Returns the block a character belongs to.

The following function views a block as a set of Unicode characters.

Function: bool uc_is_block (ucs4_t uc, const uc_block_t *block)

Tests whether a Unicode character belongs to a given block.

The following gives a global picture of all block.

Function: void uc_all_blocks (const uc_block_t **blocks, size_t *count)

Get the list of all blocks. Stores a pointer to an array of all blocks in *blocks and the length of this array in *count.


8.12 ISO C and Java syntax

The following properties are taken from language standards. The supported language standards are ISO C 99 and Java.

Function: bool uc_is_c_whitespace (ucs4_t uc)

Tests whether a Unicode character is considered whitespace in ISO C 99.

Function: bool uc_is_java_whitespace (ucs4_t uc)

Tests whether a Unicode character is considered whitespace in Java.

The following enumerated values are the possible return values of the functions uc_c_ident_category and uc_java_ident_category.

Constant: int UC_IDENTIFIER_START

This return value means that the given character is valid as first or subsequent character in an identifier.

Constant: int UC_IDENTIFIER_VALID

This return value means that the given character is valid as subsequent character only.

Constant: int UC_IDENTIFIER_INVALID

This return value means that the given character is not valid in an identifier.

Constant: int UC_IDENTIFIER_IGNORABLE

This return value (only for Java) means that the given character is ignorable.

The following function determine whether a given character can be a constituent of an identifier in the given programming language.

Function: int uc_c_ident_category (ucs4_t uc)

Returns the categorization of a Unicode character with respect to the ISO C 99 identifier syntax.

Function: int uc_java_ident_category (ucs4_t uc)

Returns the categorization of a Unicode character with respect to the Java identifier syntax.


8.13 Classifications like in ISO C

The following character classifications mimic those declared in the ISO C header files <ctype.h> and <wctype.h>. These functions are deprecated, because this set of functions was designed with ASCII in mind and cannot reflect the more diverse reality of the Unicode character set. But they can be a quick-and-dirty porting aid when migrating from wchar_t APIs to Unicode strings.

Function: bool uc_is_alnum (ucs4_t uc)

Tests for any character for which uc_is_alpha or uc_is_digit is true.

Function: bool uc_is_alpha (ucs4_t uc)

Tests for any character for which uc_is_upper or uc_is_lower is true, or any character that is one of a locale-specific set of characters for which none of uc_is_cntrl, uc_is_digit, uc_is_punct, or uc_is_space is true.

Function: bool uc_is_cntrl (ucs4_t uc)

Tests for any control character.

Function: bool uc_is_digit (ucs4_t uc)

Tests for any character that corresponds to a decimal-digit character.

Function: bool uc_is_graph (ucs4_t uc)

Tests for any character for which uc_is_print is true and uc_is_space is false.

Function: bool uc_is_lower (ucs4_t uc)

Tests for any character that corresponds to a lowercase letter or is one of a locale-specific set of characters for which none of uc_is_cntrl, uc_is_digit, uc_is_punct, or uc_is_space is true.

Function: bool uc_is_print (ucs4_t uc)

Tests for any printing character.

Function: bool uc_is_punct (ucs4_t uc)

Tests for any printing character that is one of a locale-specific set of characters for which neither uc_is_space nor uc_is_alnum is true.

Function: bool uc_is_space (ucs4_t uc)

Test for any character that corresponds to a locale-specific set of characters for which none of uc_is_alnum, uc_is_graph, or uc_is_punct is true.

Function: bool uc_is_upper (ucs4_t uc)

Tests for any character that corresponds to an uppercase letter or is one of a locale-specific set of characters for which none of uc_is_cntrl, uc_is_digit, uc_is_punct, or uc_is_space is true.

Function: bool uc_is_xdigit (ucs4_t uc)

Tests for any character that corresponds to a hexadecimal-digit character.

Function: bool uc_is_blank (ucs4_t uc)

Tests for any character that corresponds to a standard blank character or a locale-specific set of characters for which uc_is_alnum is false.


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Daiki Ueno on September, 1 2014 using texi2html 1.78a.