From 3a27e8fb97f716c17b03e3a4a634a4900bcb6045 Mon Sep 17 00:00:00 2001 From: Nathan Willis Date: Fri, 12 Oct 2018 18:23:26 -0500 Subject: [PATCH] Docs: usermanual, add Shaping Concepts chapter. --- docs/harfbuzz-docs.xml | 5 +- docs/usermanual-shaping-concepts.xml | 368 +++++++++++++++++++++++++++++++++++ 2 files changed, 371 insertions(+), 2 deletions(-) create mode 100644 docs/usermanual-shaping-concepts.xml diff --git a/docs/harfbuzz-docs.xml b/docs/harfbuzz-docs.xml index 48ea67e..b7d5cec 100644 --- a/docs/harfbuzz-docs.xml +++ b/docs/harfbuzz-docs.xml @@ -13,8 +13,8 @@ HarfBuzz is an OpenType text shaping engine. Using the HarfBuzz library allows - programs to convert a sequence of Unicode input text into - properly formatted and positioned output—for any writing + programs to convert a sequence of Unicode input into + properly formatted and positioned text output—for any writing system and language. @@ -34,6 +34,7 @@ + diff --git a/docs/usermanual-shaping-concepts.xml b/docs/usermanual-shaping-concepts.xml new file mode 100644 index 0000000..8c49ab1 --- /dev/null +++ b/docs/usermanual-shaping-concepts.xml @@ -0,0 +1,368 @@ + + Shaping concepts +
+ Text shaping + + Text shaping is the process of transforming a sequence of Unicode + codepoints that represent individual characters (letters, + diacritics, tone marks, numbers, symbols, etc.) into the + orthographically and linguistically correct two-dimensional layout + of glyph shapes taken from a specified font. + + + For some writing systems (or scripts) and + languages, the process is simple, requiring the shaper to do + little more than advance the horizontal position forward by the + correct amount for each successive glyph. + + + But, for complex scripts, any combination of + several shaping operations may be required, and the rules for how + and when they are applied vary from script to script. HarfBuzz and + other shaping engines implement these rules. + + + The exact rules and necessary operations for a particular script + constitute a shaping model. OpenType + specifies a set of shaping models that covers all of + Unicode. Other shaping models are available, however, including + Graphite and Apple Advanced Typography (AAT). + +
+ +
+ Complex scripts + + In text-shaping terminology, scripts are generally classified as + either complex or non-complex. + + + Complex scripts are those for which transforming the input + sequence into the final layout requires some combination of + operations—such as context-dependent substitutions, + context-dependent mark positioning, glyph-to-glyph joining, + glyph reordering, or glyph stacking. + + + In some complex scripts, the shaping rules require that a text + run be divided into syllables before the operations can be + applied. Other complex scripts may apply shaping operations over + entire words or over the entire text run, with no subdivision + required. + + + Non-complex scripts, by definition, do not require these + operations. However, correctly shaping a text run in a + non-complex script may still involve Unicode normalization, + ligature substitutions, mark positioning, kerning, and applying + other font features. The key difference is that a text run in a + non-complex script can be processed sequentially and in the same + order as the input sequence of Unicode codepoints, without + requiring an analysis stage. + +
+ +
+ Shaping operations + + Shaping a complex-script text run involves transforming the + input sequence of Unicode codepoints with some combination of + operations that is specified in the shaping model for the + script. + + + The specific conditions that trigger a given operation for a + text run varies from script to script, as do the order that the + operations are performed in and which codepoints are + affected. However, the same general set of shaping operations is + common to all of the complex-script shaping models. + + + + + + A reordering operation moves a glyph + from its original ("logical") position in the sequence to + some other ("visual") position. + + + The shaping model for a given complex script might involve + more than one reordering step. + + + + + + A joining operation replaces a glyph + with an alternate form that is designed to connect with one + or more of the adjacent glyphs in the sequence. + + + + + + A contextual substitution operation + replaces either a single glyph or a subsequence of several + glyphs with an alternate glyph. This substitution is + performed when the original glyph or subsequence of glyphs + occurs in a specified position with respect to the + surrounding sequence. For example, one substitution might be + performed only when the target glyph is the first glyph in + the sequence, while another substitution is performed only + when a different target glyph occurs immediately after a + particular string pattern. + + + The shaping model for a given complex script might involve + multiple contextual-substitution operations, each applying + to different target glyphs and patterns, and which are + performed in separate steps. + + + + + + A contextual positioning operation + moves the horizontal and/or vertical position of a + glyph. This positioning move is performed when the glyph + occurs in a specified position with respect to the + surrounding sequence. + + + Many contextual positioning operations are used to place + mark glyphs (such as diacritics, vowel + signs, and tone markers) with respect to + base glyphs. However, some complex + scripts may use contextual positioning operations to + correctly place base glyphs as well, such as + when the script uses stacking characters. + + + + +
+ +
+ Unicode character categories + + Shaping models are typically specified with respect to how + scripts are defined in the Unicode standard. + + + Every codepoint in the Unicode Character Database (UCD) is + assigned a Unicode General Category (UGC), + which provides the most fundamental information about the + codepoint: whether the codepoint represents a + Letter, a Mark, a + Number, Punctuation, a + Symbol, a Separator, + or something else (Other). + + + These UGC properties are "Major" categories. Each codepoint is + further assigned to a "minor" category within its Major + category, such as "Letter, uppercase" (Lu) or + "Letter, modifier" (Lm). + + + Shaping models are concerned primarily with Letter and Mark + codepoints. The minor categories of Mark codepoints are + particularly important for shaping. Marks can be nonspacing + (Mn), spacing combining + (Mc), or enclosing (Me). + + + In addition to the UGC property, codepoints in the Indic and + Southeast Asian scripts are also assigned + Unicode Indic Syllabic Category (UISC) and + Unicode Indic Positional Category (UIPC) + property that provides more detailed information needed for + shaping. + + + The UISC property sub-categorizes Letters and Marks according to + common script-shaping behaviors. For example, UISC distinguishes + between consonant letters, vowel letters, and vowel marks. The + UIPC property sub-categorizes Mark codepoints by the visual + position that they occupy (above, below, right, left, or in + multiple positions). + + + Some complex scripts require that the text run be split into + syllables, and what constitutes a valid syllable in these + scripts is specified in regular expressions of the Letter and + Mark codepoints that take the UISC and UIPC properties into account. + + +
+ +
+ Text runs + + Real-world text usually contains codepoints from a mixture of + different Unicode scripts (including punctuation, numbers, symbols, + white-space characters, and other codepoints that do not belong + to any script). Real-world text may also be marked up with + formatting that changes font properties (including the font, + font style, and font size). + + + For shaping purposes, all real-world text streams must be first + segmented into runs that have a uniform set of properties. + + + In particular, shaping models always assume that every codepoint + in a text run has the same direction, + script tag, and + language tag. + +
+ +
+ OpenType shaping models + + OpenType provides shaping models for the following scripts: + + + + + + The default shaping model handles all + non-complex scripts, and may also be used as a fallback for + handling unrecognized scripts. + + + + + + The Indic shaping model handles the Indic + scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, + Malayalam, Oriya, Tamil, Telugu, and Sinhala. + + + The Indic shaping model was revised significantly in + 2005. To denote the change, a new set of script + tags was assigned for Bengali, Devanagari, + Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and + Telugu. For the sake of clarity, the term "Indic2" is + sometimes used to refer to the current, revised shaping + model. + + + + + + The Arabic shaping model supports + Arabic, Mongolian, N'Ko, Syriac, and several other connected + or cursive scripts. + + + + + + The Thai/Lao shaping model supports + the Thai and Lao scripts. + + + + + + The Khmer shaping model supports the + Khmer script. + + + + + + The Myanmar shaping model supports the + Myanmar (or Burmese) script. + + + + + + The Tibetan shaping model supports the + Tibetan script. + + + + + + The Hangul shaping model supports the + Hangul script. + + + + + + The Hebrew shaping model supports the + Hebrew script. + + + + + + The Universal Shaping Engine (USE) + shaping model supports complex scripts not covered by one of + the above, script-specific shaping models, including + Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi, + Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai + Viet, and many others. + + + + + + Text runs that do not fall under one of the above shaping + models may still require processing by a shaping engine. Of + particular note is Emoji shaping, which + may involve variation-selector sequences and glyph + substitution. Emoji shaping is handled by the default + shaping model. + + + + + +
+ +
+ Graphite shaping + + In contrast to OpenType shaping, Graphite shaping does not + specify a predefined set of shaping models or a set of supported + scripts. + + + Instead, each Graphite font contains a complete set of rules that + implement the required shaping model for the intended + script. These rules include finite-state machines to match + sequences of codepoints to the shaping operations to perform. + + + Graphite shaping can perform the same shaping operations used in + OpenType shaping, as well as other functions that have not been + defined for OpenType shaping. + +
+ +
+ AAT shaping + + In contrast to OpenType shaping, AAT shaping does not specify a + predefined set of shaping models or a set of supported scripts. + + + Instead, each AAT font includes a complete set of rules that + implement the desired shaping model for the intended + script. These rules include finite-state machines to match glyph + sequences and the shaping operations to perform. + + + Notably, AAT shaping rules are expressed for glyphs in the font, + not for Unicode codepoints. AAT shaping can perform the same + shaping operations used in OpenType shaping, as well as other + functions that have not been defined for OpenType shaping. + +
+
-- 2.7.4