docs/usermanual-shaping-concepts.xml

   1 <?xml version="1.0"?>
   2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
   3                "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
   4   <!ENTITY % local.common.attrib "xmlns:xi  CDATA  #FIXED 'http://www.w3.org/2003/XInclude'">
   5   <!ENTITY version SYSTEM "version.xml">
   6 ]>
   7 <chapter id="shaping-concepts">
   8   <title>Shaping concepts</title>
   9   <section id="text-shaping-concepts">
  10     <title>Text shaping</title>
  11     <para>
  12       Text shaping is the process of transforming a sequence of Unicode
  13       codepoints that represent individual characters (letters,
  14       diacritics, tone marks, numbers, symbols, etc.) into the
  15       orthographically and linguistically correct two-dimensional layout
  16       of glyph shapes taken from a specified font.
  17     </para>
  18     <para>
  19       For some writing systems (or <emphasis>scripts</emphasis>) and
  20       languages, the process is simple, requiring the shaper to do
  21       little more than advance the horizontal position forward by the
  22       correct amount for each successive glyph.
  23     </para>
  24     <para>
  25       But, for <emphasis>complex scripts</emphasis>, any combination of
  26       several shaping operations may be required, and the rules for how
  27       and when they are applied vary from script to script. HarfBuzz and
  28       other shaping engines implement these rules.
  29     </para>
  30     <para>
  31       The exact rules and necessary operations for a particular script
  32       constitute a shaping <emphasis>model</emphasis>. OpenType
  33       specifies a set of shaping models that covers all of
  34       Unicode. Other shaping models are available, however, including
  35       Graphite and Apple Advanced Typography (AAT).
  36     </para>
  37   </section>
  38
  39   <section id="complex-scripts">
  40     <title>Complex scripts</title>
  41     <para>
  42       In text-shaping terminology, scripts are generally classified as
  43       either <emphasis>complex</emphasis> or <emphasis>non-complex</emphasis>.
  44     </para>
  45     <para>
  46       Complex scripts are those for which transforming the input
  47       sequence into the final layout requires some combination of
  48       operations&mdash;such as context-dependent substitutions,
  49       context-dependent mark positioning, glyph-to-glyph joining,
  50       glyph reordering, or glyph stacking.
  51     </para>
  52     <para>
  53       In some complex scripts, the shaping rules require that a text
  54       run be divided into syllables before the operations can be
  55       applied. Other complex scripts may apply shaping operations over
  56       entire words or over the entire text run, with no subdivision
  57       required.
  58     </para>
  59     <para>
  60       Non-complex scripts, by definition, do not require these
  61       operations. However, correctly shaping a text run in a
  62       non-complex script may still involve Unicode normalization,
  63       ligature substitutions, mark positioning, kerning, and applying
  64       other font features. The key difference is that a text run in a
  65       non-complex script can be processed sequentially and in the same
  66       order as the input sequence of Unicode codepoints, without
  67       requiring an analysis stage.
  68     </para>
  69   </section>
  70
  71   <section id="shaping-operations">
  72     <title>Shaping operations</title>
  73     <para>
  74       Shaping a complex-script text run involves transforming the
  75       input sequence of Unicode codepoints with some combination of
  76       operations that is specified in the shaping model for the
  77       script.
  78     </para>
  79     <para>
  80       The specific conditions that trigger a given operation for a
  81       text run varies from script to script, as do the order that the
  82       operations are performed in and which codepoints are
  83       affected. However, the same general set of shaping operations is
  84       common to all of the complex-script shaping models.
  85     </para>
  86
  87     <itemizedlist>
  88       <listitem>
  89         <para>
  90           A <emphasis>reordering</emphasis> operation moves a glyph
  91           from its original ("logical") position in the sequence to
  92           some other ("visual") position.
  93         </para>
  94         <para>
  95           The shaping model for a given complex script might involve
  96           more than one reordering step.
  97         </para>
  98       </listitem>
  99
 100       <listitem>
 101         <para>
 102           A <emphasis>joining</emphasis> operation replaces a glyph
 103           with an alternate form that is designed to connect with one
 104           or more of the adjacent glyphs in the sequence.
 105         </para>
 106       </listitem>
 107
 108       <listitem>
 109         <para>
 110           A contextual <emphasis>substitution</emphasis> operation
 111           replaces either a single glyph or a subsequence of several
 112           glyphs with an alternate glyph. This substitution is
 113           performed when the original glyph or subsequence of glyphs
 114           occurs in a specified position with respect to the
 115           surrounding sequence. For example, one substitution might be
 116           performed only when the target glyph is the first glyph in
 117           the sequence, while another substitution is performed only
 118           when a different target glyph occurs immediately after a
 119           particular string pattern.
 120         </para>
 121         <para>
 122           The shaping model for a given complex script might involve
 123           multiple contextual-substitution operations, each applying
 124           to different target glyphs and patterns, and which are
 125           performed in separate steps.
 126         </para>
 127       </listitem>
 128
 129       <listitem>
 130         <para>
 131           A contextual <emphasis>positioning</emphasis> operation
 132           moves the horizontal and/or vertical position of a
 133           glyph. This positioning move is performed when the glyph
 134           occurs in a specified position with respect to the
 135           surrounding sequence.
 136         </para>
 137         <para>
 138           Many contextual positioning operations are used to place
 139           <emphasis>mark</emphasis> glyphs (such as diacritics, vowel
 140           signs, and tone markers) with respect to
 141           <emphasis>base</emphasis> glyphs. However, some complex
 142           scripts may use contextual positioning operations to
 143           correctly place base glyphs as well, such as
 144           when the script uses <emphasis>stacking</emphasis> characters.
 145         </para>
 146       </listitem>
 147
 148     </itemizedlist>
 149   </section>
 150
 151   <section id="unicode-character-categories">
 152     <title>Unicode character categories</title>
 153     <para>
 154       Shaping models are typically specified with respect to how
 155       scripts are defined in the Unicode standard.
 156     </para>
 157     <para>
 158       Every codepoint in the Unicode Character Database (UCD) is
 159       assigned a <emphasis>Unicode General Category</emphasis> (UGC),
 160       which provides the most fundamental information about the
 161       codepoint: whether the codepoint represents a
 162       <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a
 163       <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a
 164       <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>,
 165       or something else (<emphasis>Other</emphasis>).
 166     </para>
 167     <para>
 168       These UGC properties are "Major" categories. Each codepoint is
 169       further assigned to a "minor" category within its Major
 170       category, such as "Letter, uppercase" (<literal>Lu</literal>) or
 171       "Letter, modifier" (<literal>Lm</literal>).
 172     </para>
 173     <para>
 174       Shaping models are concerned primarily with Letter and Mark
 175       codepoints. The minor categories of Mark codepoints are
 176       particularly important for shaping. Marks can be nonspacing
 177       (<literal>Mn</literal>), spacing combining
 178       (<literal>Mc</literal>), or enclosing (<literal>Me</literal>).
 179     </para>
 180     <para>
 181       In addition to the UGC property, codepoints in the Indic and
 182       Southeast Asian scripts are also assigned
 183       <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and
 184       <emphasis>Unicode Indic Positional Category</emphasis> (UIPC)
 185       properties that provide more detailed information needed for
 186       shaping.
 187     </para>
 188     <para>
 189       The UISC property sub-categorizes Letters and Marks according to
 190       common script-shaping behaviors. For example, UISC distinguishes
 191       between consonant letters, vowel letters, and vowel marks. The
 192       UIPC property sub-categorizes Mark codepoints by the relative visual
 193       position that they occupy (above, below, right, left, or in
 194       multiple positions).
 195     </para>
 196     <para>
 197       Some complex scripts require that the text run be split into
 198       syllables. What constitutes a valid syllable in these
 199       scripts is specified in regular expressions, formed from the
 200       Letter and Mark codepoints, that take the UISC and UIPC
 201       properties into account.
 202     </para>
 203
 204   </section>
 205
 206   <section id="text-runs">
 207     <title>Text runs</title>
 208     <para>
 209       Real-world text usually contains codepoints from a mixture of
 210       different Unicode scripts (including punctuation, numbers, symbols,
 211       white-space characters, and other codepoints that do not belong
 212       to any script). Real-world text may also be marked up with
 213       formatting that changes font properties (including the font,
 214       font style, and font size).
 215     </para>
 216     <para>
 217       For shaping purposes, all real-world text streams must be first
 218       segmented into runs that have a uniform set of properties.
 219     </para>
 220     <para>
 221       In particular, shaping models always assume that every codepoint
 222       in a text run has the same <emphasis>direction</emphasis>,
 223       <emphasis>script</emphasis> tag, and
 224       <emphasis>language</emphasis> tag.
 225     </para>
 226   </section>
 227
 228   <section id="opentype-shaping-models">
 229     <title>OpenType shaping models</title>
 230     <para>
 231       OpenType provides shaping models for the following scripts:
 232     </para>
 233
 234     <itemizedlist>
 235       <listitem>
 236         <para>
 237           The <emphasis>default</emphasis> shaping model handles all
 238           non-complex scripts, and may also be used as a fallback for
 239           handling unrecognized scripts.
 240         </para>
 241       </listitem>
 242
 243       <listitem>
 244         <para>
 245           The <emphasis>Indic</emphasis> shaping model handles the Indic
 246           scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada,
 247           Malayalam, Oriya, Tamil, Telugu, and Sinhala.
 248         </para>
 249         <para>
 250           The Indic shaping model was revised significantly in
 251           2005. To denote the change, a new set of <emphasis>script
 252           tags</emphasis> was assigned for Bengali, Devanagari,
 253           Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and
 254           Telugu. For the sake of clarity, the term "Indic2" is
 255           sometimes used to refer to the current, revised shaping
 256           model.
 257         </para>
 258       </listitem>
 259
 260       <listitem>
 261         <para>
 262           The <emphasis>Arabic</emphasis> shaping model supports
 263           Arabic, Mongolian, N'Ko, Syriac, and several other connected
 264           or cursive scripts.
 265         </para>
 266       </listitem>
 267
 268       <listitem>
 269         <para>
 270           The <emphasis>Thai/Lao</emphasis> shaping model supports
 271           the Thai and Lao scripts.
 272         </para>
 273       </listitem>
 274
 275       <listitem>
 276         <para>
 277           The <emphasis>Khmer</emphasis> shaping model supports the
 278           Khmer script.
 279         </para>
 280       </listitem>
 281
 282       <listitem>
 283         <para>
 284           The <emphasis>Myanmar</emphasis> shaping model supports the
 285           Myanmar (or Burmese) script.
 286         </para>
 287       </listitem>
 288
 289       <listitem>
 290         <para>
 291           The <emphasis>Tibetan</emphasis> shaping model supports the
 292           Tibetan script.
 293         </para>
 294       </listitem>
 295
 296       <listitem>
 297         <para>
 298           The <emphasis>Hangul</emphasis> shaping model supports the
 299           Hangul script.
 300         </para>
 301       </listitem>
 302
 303       <listitem>
 304         <para>
 305           The <emphasis>Hebrew</emphasis> shaping model supports the
 306           Hebrew script.
 307         </para>
 308       </listitem>
 309
 310       <listitem>
 311         <para>
 312           The <emphasis>Universal Shaping Engine</emphasis> (USE)
 313           shaping model supports complex scripts not covered by one of
 314           the above, script-specific shaping models, including
 315           Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi,
 316           Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai
 317           Viet, and many others.
 318         </para>
 319       </listitem>
 320
 321       <listitem>
 322         <para>
 323           Text runs that do not fall under one of the above shaping
 324           models may still require processing by a shaping engine. Of
 325           particular note is <emphasis>Emoji</emphasis> shaping, which
 326           may involve variation-selector sequences and glyph
 327           substitution. Emoji shaping is handled by the default
 328           shaping model.
 329         </para>
 330       </listitem>
 331
 332     </itemizedlist>
 333
 334   </section>
 335
 336   <section id="graphite-shaping">
 337     <title>Graphite shaping</title>
 338     <para>
 339       In contrast to OpenType shaping, Graphite shaping does not
 340       specify a predefined set of shaping models or a set of supported
 341       scripts.
 342     </para>
 343     <para>
 344       Instead, each Graphite font contains a complete set of rules that
 345       implement the required shaping model for the intended
 346       script. These rules include finite-state machines to match
 347       sequences of codepoints to the shaping operations to perform.
 348     </para>
 349     <para>
 350       Graphite shaping can perform the same shaping operations used in
 351       OpenType shaping, as well as other functions that have not been
 352       defined for OpenType shaping.
 353     </para>
 354   </section>
 355
 356   <section id="aat-shaping">
 357     <title>AAT shaping</title>
 358     <para>
 359       In contrast to OpenType shaping, AAT shaping does not specify a
 360       predefined set of shaping models or a set of supported scripts.
 361     </para>
 362     <para>
 363       Instead, each AAT font includes a complete set of rules that
 364       implement the desired shaping model for the intended
 365       script. These rules include finite-state machines to match glyph
 366       sequences and the shaping operations to perform.
 367     </para>
 368     <para>
 369       Notably, AAT shaping rules are expressed for glyphs in the font,
 370       not for Unicode codepoints. AAT shaping can perform the same
 371       shaping operations used in OpenType shaping, as well as other
 372       functions that have not been defined for OpenType shaping.
 373     </para>
 374   </section>
 375 </chapter>