1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
4 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
5 <title>Unicode character categories: HarfBuzz Manual</title>
6 <meta name="generator" content="DocBook XSL Stylesheets Vsnapshot">
7 <link rel="home" href="index.html" title="HarfBuzz Manual">
8 <link rel="up" href="shaping-concepts.html" title="Shaping concepts">
9 <link rel="prev" href="shaping-operations.html" title="Shaping operations">
10 <link rel="next" href="text-runs.html" title="Text runs">
11 <meta name="generator" content="GTK-Doc V1.32.1 (XML mode)">
12 <link rel="stylesheet" href="style.css" type="text/css">
14 <body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">
15 <table class="navigation" id="top" width="100%" summary="Navigation header" cellpadding="2" cellspacing="5"><tr valign="middle">
16 <td width="100%" align="left" class="shortcuts"></td>
17 <td><a accesskey="h" href="index.html"><img src="home.png" width="16" height="16" border="0" alt="Home"></a></td>
18 <td><a accesskey="u" href="shaping-concepts.html"><img src="up.png" width="16" height="16" border="0" alt="Up"></a></td>
19 <td><a accesskey="p" href="shaping-operations.html"><img src="left.png" width="16" height="16" border="0" alt="Prev"></a></td>
20 <td><a accesskey="n" href="text-runs.html"><img src="right.png" width="16" height="16" border="0" alt="Next"></a></td>
23 <div class="titlepage"><div><div><h2 class="title" style="clear: both">
24 <a name="unicode-character-categories"></a>Unicode character categories</h2></div></div></div>
26 Shaping models are typically specified with respect to how
27 scripts are defined in the Unicode standard.
30 Every codepoint in the Unicode Character Database (UCD) is
31 assigned a <span class="emphasis"><em>Unicode General Category</em></span> (UGC),
32 which provides the most fundamental information about the
33 codepoint: whether the codepoint represents a
34 <span class="emphasis"><em>Letter</em></span>, a <span class="emphasis"><em>Mark</em></span>, a
35 <span class="emphasis"><em>Number</em></span>, <span class="emphasis"><em>Punctuation</em></span>, a
36 <span class="emphasis"><em>Symbol</em></span>, a <span class="emphasis"><em>Separator</em></span>,
37 or something else (<span class="emphasis"><em>Other</em></span>).
40 These UGC properties are "Major" categories. Each codepoint is
41 further assigned to a "minor" category within its Major
42 category, such as "Letter, uppercase" (<code class="literal">Lu</code>) or
43 "Letter, modifier" (<code class="literal">Lm</code>).
46 Shaping models are concerned primarily with Letter and Mark
47 codepoints. The minor categories of Mark codepoints are
48 particularly important for shaping. Marks can be nonspacing
49 (<code class="literal">Mn</code>), spacing combining
50 (<code class="literal">Mc</code>), or enclosing (<code class="literal">Me</code>).
53 In addition to the UGC property, codepoints in the Indic and
54 Southeast Asian scripts are also assigned
55 <span class="emphasis"><em>Unicode Indic Syllabic Category</em></span> (UISC) and
56 <span class="emphasis"><em>Unicode Indic Positional Category</em></span> (UIPC)
57 properties that provide more detailed information needed for
61 The UISC property sub-categorizes Letters and Marks according to
62 common script-shaping behaviors. For example, UISC distinguishes
63 between consonant letters, vowel letters, and vowel marks. The
64 UIPC property sub-categorizes Mark codepoints by the relative visual
65 position that they occupy (above, below, right, left, or in
69 Some complex scripts require that the text run be split into
70 syllables. What constitutes a valid syllable in these
71 scripts is specified in regular expressions, formed from the
72 Letter and Mark codepoints, that take the UISC and UIPC
73 properties into account.
77 <hr>Generated by GTK-Doc V1.32.1</div>