docs/html/unicode-character-categories.html

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
   2 <html>
   3 <head>
   4 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
   5 <title>Unicode character categories: HarfBuzz Manual</title>
   6 <meta name="generator" content="DocBook XSL Stylesheets V1.79.1">
   7 <link rel="home" href="index.html" title="HarfBuzz Manual">
   8 <link rel="up" href="shaping-concepts.html" title="Shaping concepts">
   9 <link rel="prev" href="shaping-operations.html" title="Shaping operations">
  10 <link rel="next" href="text-runs.html" title="Text runs">
  11 <meta name="generator" content="GTK-Doc V1.25 (XML mode)">
  12 <link rel="stylesheet" href="style.css" type="text/css">
  13 </head>
  14 <body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">
  15 <table class="navigation" id="top" width="100%" summary="Navigation header" cellpadding="2" cellspacing="5"><tr valign="middle">
  16 <td width="100%" align="left" class="shortcuts"></td>
  17 <td><a accesskey="h" href="index.html"><img src="home.png" width="16" height="16" border="0" alt="Home"></a></td>
  18 <td><a accesskey="u" href="shaping-concepts.html"><img src="up.png" width="16" height="16" border="0" alt="Up"></a></td>
  19 <td><a accesskey="p" href="shaping-operations.html"><img src="left.png" width="16" height="16" border="0" alt="Prev"></a></td>
  20 <td><a accesskey="n" href="text-runs.html"><img src="right.png" width="16" height="16" border="0" alt="Next"></a></td>
  21 </tr></table>
  22 <div class="section">
  23 <div class="titlepage"><div><div><h2 class="title" style="clear: both">
  24 <a name="unicode-character-categories"></a>Unicode character categories</h2></div></div></div>
  25 <p>
  26       Shaping models are typically specified with respect to how
  27       scripts are defined in the Unicode standard.
  28     </p>
  29 <p>
  30       Every codepoint in the Unicode Character Database (UCD) is
  31       assigned a <span class="emphasis"><em>Unicode General Category</em></span> (UGC),
  32       which provides the most fundamental information about the
  33       codepoint: whether the codepoint represents a
  34       <span class="emphasis"><em>Letter</em></span>, a <span class="emphasis"><em>Mark</em></span>, a
  35       <span class="emphasis"><em>Number</em></span>, <span class="emphasis"><em>Punctuation</em></span>, a
  36       <span class="emphasis"><em>Symbol</em></span>, a <span class="emphasis"><em>Separator</em></span>,
  37       or something else (<span class="emphasis"><em>Other</em></span>).
  38     </p>
  39 <p>
  40       These UGC properties are "Major" categories. Each codepoint is
  41       further assigned to a "minor" category within its Major
  42       category, such as "Letter, uppercase" (<code class="literal">Lu</code>) or
  43       "Letter, modifier" (<code class="literal">Lm</code>).
  44     </p>
  45 <p>
  46       Shaping models are concerned primarily with Letter and Mark
  47       codepoints. The minor categories of Mark codepoints are
  48       particularly important for shaping. Marks can be nonspacing
  49       (<code class="literal">Mn</code>), spacing combining
  50       (<code class="literal">Mc</code>), or enclosing (<code class="literal">Me</code>).
  51     </p>
  52 <p>
  53       In addition to the UGC property, codepoints in the Indic and
  54       Southeast Asian scripts are also assigned
  55       <span class="emphasis"><em>Unicode Indic Syllabic Category</em></span> (UISC) and
  56       <span class="emphasis"><em>Unicode Indic Positional Category</em></span> (UIPC)
  57       properties that provide more detailed information needed for
  58       shaping.
  59     </p>
  60 <p>
  61       The UISC property sub-categorizes Letters and Marks according to
  62       common script-shaping behaviors. For example, UISC distinguishes
  63       between consonant letters, vowel letters, and vowel marks. The
  64       UIPC property sub-categorizes Mark codepoints by the relative visual
  65       position that they occupy (above, below, right, left, or in
  66       multiple positions).
  67     </p>
  68 <p>
  69       Some complex scripts require that the text run be split into
  70       syllables. What constitutes a valid syllable in these
  71       scripts is specified in regular expressions, formed from the
  72       Letter and Mark codepoints, that take the UISC and UIPC
  73       properties into account.
  74     </p>
  75 </div>
  76 <div class="footer">
  77 <hr>Generated by GTK-Doc V1.25</div>
  78 </body>
  79 </html>