docs/usermanual-shaping-concepts.xml

   1 <?xml version="1.0"?>
   2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
   3                "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
   4   <!ENTITY % local.common.attrib "xmlns:xi  CDATA  #FIXED 'http://www.w3.org/2003/XInclude'">
   5   <!ENTITY version SYSTEM "version.xml">
   6 ]>
   7 <chapter id="shaping-concepts">
   8   <title>Shaping concepts</title>
   9   <section id="text-shaping-concepts">
  10     <title>Text shaping</title>
  11     <para>
  12       Text shaping is the process of transforming a sequence of Unicode
  13       codepoints that represent individual characters (letters,
  14       diacritics, tone marks, numbers, symbols, etc.) into the
  15       orthographically and linguistically correct two-dimensional layout
  16       of glyph shapes taken from a specified font.
  17     </para>
  18     <para>
  19       For some writing systems (or <emphasis>scripts</emphasis>) and
  20       languages, the process is simple, requiring the shaper to do
  21       little more than advance the horizontal position forward by the
  22       correct amount for each successive glyph.
  23     </para>
  24     <para>
  25       But, for other scripts (often unceremoniously called <emphasis>complex scripts</emphasis>), any combination of
  26       several shaping operations may be required, and the rules for how
  27       and when they are applied vary from script to script. HarfBuzz and
  28       other shaping engines implement these rules.
  29     </para>
  30     <para>
  31       The exact rules and necessary operations for a particular script
  32       constitute a shaping <emphasis>model</emphasis>. OpenType
  33       specifies a set of shaping models that covers all of
  34       Unicode. Other shaping models are available, however, including
  35       Graphite and Apple Advanced Typography (AAT).
  36     </para>
  37   </section>
  38
  39   <section id="script-specific-shaping">
  40     <title>Script-specific shaping</title>
  41     <para>
  42       In many scripts, transforming the input
  43       sequence into the final layout often requires some combination of
  44       operations&mdash;such as context-dependent substitutions,
  45       context-dependent mark positioning, glyph-to-glyph joining,
  46       glyph reordering, or glyph stacking.
  47     </para>
  48     <para>
  49       In some scripts, the shaping rules require that a text
  50       run be divided into syllables before the operations can be
  51       applied. Other scripts may apply shaping operations over
  52       entire words or over the entire text run, with no subdivision
  53       required.
  54     </para>
  55     <para>
  56       Other scripts, do not require these
  57       operations. However, correctly shaping a text run in
  58       any script may still involve Unicode normalization,
  59       ligature substitutions, mark positioning, kerning, and applying
  60       other font features.
  61     </para>
  62   </section>
  63
  64   <section id="shaping-operations">
  65     <title>Shaping operations</title>
  66     <para>
  67       Shaping a text run involves transforming the
  68       input sequence of Unicode codepoints with some combination of
  69       operations that is specified in the shaping model for the
  70       script.
  71     </para>
  72     <para>
  73       The specific conditions that trigger a given operation for a
  74       text run varies from script to script, as do the order that the
  75       operations are performed in and which codepoints are
  76       affected. However, the same general set of shaping operations is
  77       common to all of the script shaping models.
  78     </para>
  79
  80     <itemizedlist>
  81       <listitem>
  82         <para>
  83           A <emphasis>reordering</emphasis> operation moves a glyph
  84           from its original ("logical") position in the sequence to
  85           some other ("visual") position.
  86         </para>
  87         <para>
  88           The shaping model for a given script might involve
  89           more than one reordering step.
  90         </para>
  91       </listitem>
  92
  93       <listitem>
  94         <para>
  95           A <emphasis>joining</emphasis> operation replaces a glyph
  96           with an alternate form that is designed to connect with one
  97           or more of the adjacent glyphs in the sequence.
  98         </para>
  99       </listitem>
 100
 101       <listitem>
 102         <para>
 103           A contextual <emphasis>substitution</emphasis> operation
 104           replaces either a single glyph or a subsequence of several
 105           glyphs with an alternate glyph. This substitution is
 106           performed when the original glyph or subsequence of glyphs
 107           occurs in a specified position with respect to the
 108           surrounding sequence. For example, one substitution might be
 109           performed only when the target glyph is the first glyph in
 110           the sequence, while another substitution is performed only
 111           when a different target glyph occurs immediately after a
 112           particular string pattern.
 113         </para>
 114         <para>
 115           The shaping model for a given script might involve
 116           multiple contextual-substitution operations, each applying
 117           to different target glyphs and patterns, and which are
 118           performed in separate steps.
 119         </para>
 120       </listitem>
 121
 122       <listitem>
 123         <para>
 124           A contextual <emphasis>positioning</emphasis> operation
 125           moves the horizontal and/or vertical position of a
 126           glyph. This positioning move is performed when the glyph
 127           occurs in a specified position with respect to the
 128           surrounding sequence.
 129         </para>
 130         <para>
 131           Many contextual positioning operations are used to place
 132           <emphasis>mark</emphasis> glyphs (such as diacritics, vowel
 133           signs, and tone markers) with respect to
 134           <emphasis>base</emphasis> glyphs. However, some
 135           scripts may use contextual positioning operations to
 136           correctly place base glyphs as well, such as
 137           when the script uses <emphasis>stacking</emphasis> characters.
 138         </para>
 139       </listitem>
 140
 141     </itemizedlist>
 142   </section>
 143
 144   <section id="unicode-character-categories">
 145     <title>Unicode character categories</title>
 146     <para>
 147       Shaping models are typically specified with respect to how
 148       scripts are defined in the Unicode standard.
 149     </para>
 150     <para>
 151       Every codepoint in the Unicode Character Database (UCD) is
 152       assigned a <emphasis>Unicode General Category</emphasis> (UGC),
 153       which provides the most fundamental information about the
 154       codepoint: whether the codepoint represents a
 155       <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a
 156       <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a
 157       <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>,
 158       or something else (<emphasis>Other</emphasis>).
 159     </para>
 160     <para>
 161       These UGC properties are "Major" categories. Each codepoint is
 162       further assigned to a "minor" category within its Major
 163       category, such as "Letter, uppercase" (<literal>Lu</literal>) or
 164       "Letter, modifier" (<literal>Lm</literal>).
 165     </para>
 166     <para>
 167       Shaping models are concerned primarily with Letter and Mark
 168       codepoints. The minor categories of Mark codepoints are
 169       particularly important for shaping. Marks can be nonspacing
 170       (<literal>Mn</literal>), spacing combining
 171       (<literal>Mc</literal>), or enclosing (<literal>Me</literal>).
 172     </para>
 173     <para>
 174       In addition to the UGC property, codepoints in the Indic and
 175       Southeast Asian scripts are also assigned
 176       <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and
 177       <emphasis>Unicode Indic Positional Category</emphasis> (UIPC)
 178       properties that provide more detailed information needed for
 179       shaping.
 180     </para>
 181     <para>
 182       The UISC property sub-categorizes Letters and Marks according to
 183       common script-shaping behaviors. For example, UISC distinguishes
 184       between consonant letters, vowel letters, and vowel marks. The
 185       UIPC property sub-categorizes Mark codepoints by the relative visual
 186       position that they occupy (above, below, right, left, or in
 187       multiple positions).
 188     </para>
 189     <para>
 190       Some scripts require that the text run be split into
 191       syllables. What constitutes a valid syllable in these
 192       scripts is specified in regular expressions, formed from the
 193       Letter and Mark codepoints, that take the UISC and UIPC
 194       properties into account.
 195     </para>
 196
 197   </section>
 198
 199   <section id="text-runs">
 200     <title>Text runs</title>
 201     <para>
 202       Real-world text usually contains codepoints from a mixture of
 203       different Unicode scripts (including punctuation, numbers, symbols,
 204       white-space characters, and other codepoints that do not belong
 205       to any script). Real-world text may also be marked up with
 206       formatting that changes font properties (including the font,
 207       font style, and font size).
 208     </para>
 209     <para>
 210       For shaping purposes, all real-world text streams must be first
 211       segmented into runs that have a uniform set of properties.
 212     </para>
 213     <para>
 214       In particular, shaping models always assume that every codepoint
 215       in a text run has the same <emphasis>direction</emphasis>,
 216       <emphasis>script</emphasis> tag, and
 217       <emphasis>language</emphasis> tag.
 218     </para>
 219   </section>
 220
 221   <section id="opentype-shaping-models">
 222     <title>OpenType shaping models</title>
 223     <para>
 224       OpenType provides shaping models for the following scripts:
 225     </para>
 226
 227     <itemizedlist>
 228       <listitem>
 229         <para>
 230           The <emphasis>default</emphasis> shaping model handles all
 231           scripts with no script-specific shaping model, and may also be used as a fallback for
 232           handling unrecognized scripts.
 233         </para>
 234       </listitem>
 235
 236       <listitem>
 237         <para>
 238           The <emphasis>Indic</emphasis> shaping model handles the Indic
 239           scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada,
 240           Malayalam, Oriya, Tamil, and Telugu.
 241         </para>
 242         <para>
 243           The Indic shaping model was revised significantly in
 244           2005. To denote the change, a new set of <emphasis>script
 245           tags</emphasis> was assigned for Bengali, Devanagari,
 246           Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and
 247           Telugu. For the sake of clarity, the term "Indic2" is
 248           sometimes used to refer to the current, revised shaping
 249           model.
 250         </para>
 251       </listitem>
 252
 253       <listitem>
 254         <para>
 255           The <emphasis>Arabic</emphasis> shaping model supports
 256           Arabic, Mongolian, N'Ko, Syriac, and several other connected
 257           or cursive scripts.
 258         </para>
 259       </listitem>
 260
 261       <listitem>
 262         <para>
 263           The <emphasis>Thai/Lao</emphasis> shaping model supports
 264           the Thai and Lao scripts.
 265         </para>
 266       </listitem>
 267
 268       <listitem>
 269         <para>
 270           The <emphasis>Khmer</emphasis> shaping model supports the
 271           Khmer script.
 272         </para>
 273       </listitem>
 274
 275       <listitem>
 276         <para>
 277           The <emphasis>Myanmar</emphasis> shaping model supports the
 278           Myanmar (or Burmese) script.
 279         </para>
 280       </listitem>
 281
 282       <listitem>
 283         <para>
 284           The <emphasis>Tibetan</emphasis> shaping model supports the
 285           Tibetan script.
 286         </para>
 287       </listitem>
 288
 289       <listitem>
 290         <para>
 291           The <emphasis>Hangul</emphasis> shaping model supports the
 292           Hangul script.
 293         </para>
 294       </listitem>
 295
 296       <listitem>
 297         <para>
 298           The <emphasis>Hebrew</emphasis> shaping model supports the
 299           Hebrew script.
 300         </para>
 301       </listitem>
 302
 303       <listitem>
 304         <para>
 305           The <emphasis>Universal Shaping Engine</emphasis> (USE)
 306           shaping model supports scripts not covered by one of
 307           the above, script-specific shaping models, including
 308           Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi,
 309           Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai
 310           Viet, and many others.
 311         </para>
 312       </listitem>
 313
 314       <listitem>
 315         <para>
 316           Text runs that do not fall under one of the above shaping
 317           models may still require processing by a shaping engine. Of
 318           particular note is <emphasis>Emoji</emphasis> shaping, which
 319           may involve variation-selector sequences and glyph
 320           substitution. Emoji shaping is handled by the default
 321           shaping model.
 322         </para>
 323       </listitem>
 324
 325     </itemizedlist>
 326
 327   </section>
 328
 329   <section id="graphite-shaping">
 330     <title>Graphite shaping</title>
 331     <para>
 332       In contrast to OpenType shaping, Graphite shaping does not
 333       specify a predefined set of shaping models or a set of supported
 334       scripts.
 335     </para>
 336     <para>
 337       Instead, each Graphite font contains a complete set of rules that
 338       implement the required shaping model for the intended
 339       script. These rules include finite-state machines to match
 340       sequences of codepoints to the shaping operations to perform.
 341     </para>
 342     <para>
 343       Graphite shaping can perform the same shaping operations used in
 344       OpenType shaping, as well as other functions that have not been
 345       defined for OpenType shaping.
 346     </para>
 347   </section>
 348
 349   <section id="aat-shaping">
 350     <title>AAT shaping</title>
 351     <para>
 352       In contrast to OpenType shaping, AAT shaping does not specify a
 353       predefined set of shaping models or a set of supported scripts.
 354     </para>
 355     <para>
 356       Instead, each AAT font includes a complete set of rules that
 357       implement the desired shaping model for the intended
 358       script. These rules include finite-state machines to match glyph
 359       sequences and the shaping operations to perform.
 360     </para>
 361     <para>
 362       Notably, AAT shaping rules are expressed for glyphs in the font,
 363       not for Unicode codepoints. AAT shaping can perform the same
 364       shaping operations used in OpenType shaping, as well as other
 365       functions that have not been defined for OpenType shaping.
 366     </para>
 367   </section>
 368 </chapter>