2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
3 "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
4 <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'">
5 <!ENTITY version SYSTEM "version.xml">
7 <chapter id="shaping-concepts">
8 <title>Shaping concepts</title>
9 <section id="text-shaping-concepts">
10 <title>Text shaping</title>
12 Text shaping is the process of transforming a sequence of Unicode
13 codepoints that represent individual characters (letters,
14 diacritics, tone marks, numbers, symbols, etc.) into the
15 orthographically and linguistically correct two-dimensional layout
16 of glyph shapes taken from a specified font.
19 For some writing systems (or <emphasis>scripts</emphasis>) and
20 languages, the process is simple, requiring the shaper to do
21 little more than advance the horizontal position forward by the
22 correct amount for each successive glyph.
25 But, for <emphasis>complex scripts</emphasis>, any combination of
26 several shaping operations may be required, and the rules for how
27 and when they are applied vary from script to script. HarfBuzz and
28 other shaping engines implement these rules.
31 The exact rules and necessary operations for a particular script
32 constitute a shaping <emphasis>model</emphasis>. OpenType
33 specifies a set of shaping models that covers all of
34 Unicode. Other shaping models are available, however, including
35 Graphite and Apple Advanced Typography (AAT).
39 <section id="complex-scripts">
40 <title>Complex scripts</title>
42 In text-shaping terminology, scripts are generally classified as
43 either <emphasis>complex</emphasis> or <emphasis>non-complex</emphasis>.
46 Complex scripts are those for which transforming the input
47 sequence into the final layout requires some combination of
48 operations—such as context-dependent substitutions,
49 context-dependent mark positioning, glyph-to-glyph joining,
50 glyph reordering, or glyph stacking.
53 In some complex scripts, the shaping rules require that a text
54 run be divided into syllables before the operations can be
55 applied. Other complex scripts may apply shaping operations over
56 entire words or over the entire text run, with no subdivision
60 Non-complex scripts, by definition, do not require these
61 operations. However, correctly shaping a text run in a
62 non-complex script may still involve Unicode normalization,
63 ligature substitutions, mark positioning, kerning, and applying
64 other font features. The key difference is that a text run in a
65 non-complex script can be processed sequentially and in the same
66 order as the input sequence of Unicode codepoints, without
67 requiring an analysis stage.
71 <section id="shaping-operations">
72 <title>Shaping operations</title>
74 Shaping a complex-script text run involves transforming the
75 input sequence of Unicode codepoints with some combination of
76 operations that is specified in the shaping model for the
80 The specific conditions that trigger a given operation for a
81 text run varies from script to script, as do the order that the
82 operations are performed in and which codepoints are
83 affected. However, the same general set of shaping operations is
84 common to all of the complex-script shaping models.
90 A <emphasis>reordering</emphasis> operation moves a glyph
91 from its original ("logical") position in the sequence to
92 some other ("visual") position.
95 The shaping model for a given complex script might involve
96 more than one reordering step.
102 A <emphasis>joining</emphasis> operation replaces a glyph
103 with an alternate form that is designed to connect with one
104 or more of the adjacent glyphs in the sequence.
110 A contextual <emphasis>substitution</emphasis> operation
111 replaces either a single glyph or a subsequence of several
112 glyphs with an alternate glyph. This substitution is
113 performed when the original glyph or subsequence of glyphs
114 occurs in a specified position with respect to the
115 surrounding sequence. For example, one substitution might be
116 performed only when the target glyph is the first glyph in
117 the sequence, while another substitution is performed only
118 when a different target glyph occurs immediately after a
119 particular string pattern.
122 The shaping model for a given complex script might involve
123 multiple contextual-substitution operations, each applying
124 to different target glyphs and patterns, and which are
125 performed in separate steps.
131 A contextual <emphasis>positioning</emphasis> operation
132 moves the horizontal and/or vertical position of a
133 glyph. This positioning move is performed when the glyph
134 occurs in a specified position with respect to the
135 surrounding sequence.
138 Many contextual positioning operations are used to place
139 <emphasis>mark</emphasis> glyphs (such as diacritics, vowel
140 signs, and tone markers) with respect to
141 <emphasis>base</emphasis> glyphs. However, some complex
142 scripts may use contextual positioning operations to
143 correctly place base glyphs as well, such as
144 when the script uses <emphasis>stacking</emphasis> characters.
151 <section id="unicode-character-categories">
152 <title>Unicode character categories</title>
154 Shaping models are typically specified with respect to how
155 scripts are defined in the Unicode standard.
158 Every codepoint in the Unicode Character Database (UCD) is
159 assigned a <emphasis>Unicode General Category</emphasis> (UGC),
160 which provides the most fundamental information about the
161 codepoint: whether the codepoint represents a
162 <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a
163 <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a
164 <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>,
165 or something else (<emphasis>Other</emphasis>).
168 These UGC properties are "Major" categories. Each codepoint is
169 further assigned to a "minor" category within its Major
170 category, such as "Letter, uppercase" (<literal>Lu</literal>) or
171 "Letter, modifier" (<literal>Lm</literal>).
174 Shaping models are concerned primarily with Letter and Mark
175 codepoints. The minor categories of Mark codepoints are
176 particularly important for shaping. Marks can be nonspacing
177 (<literal>Mn</literal>), spacing combining
178 (<literal>Mc</literal>), or enclosing (<literal>Me</literal>).
181 In addition to the UGC property, codepoints in the Indic and
182 Southeast Asian scripts are also assigned
183 <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and
184 <emphasis>Unicode Indic Positional Category</emphasis> (UIPC)
185 properties that provide more detailed information needed for
189 The UISC property sub-categorizes Letters and Marks according to
190 common script-shaping behaviors. For example, UISC distinguishes
191 between consonant letters, vowel letters, and vowel marks. The
192 UIPC property sub-categorizes Mark codepoints by the relative visual
193 position that they occupy (above, below, right, left, or in
197 Some complex scripts require that the text run be split into
198 syllables. What constitutes a valid syllable in these
199 scripts is specified in regular expressions, formed from the
200 Letter and Mark codepoints, that take the UISC and UIPC
201 properties into account.
206 <section id="text-runs">
207 <title>Text runs</title>
209 Real-world text usually contains codepoints from a mixture of
210 different Unicode scripts (including punctuation, numbers, symbols,
211 white-space characters, and other codepoints that do not belong
212 to any script). Real-world text may also be marked up with
213 formatting that changes font properties (including the font,
214 font style, and font size).
217 For shaping purposes, all real-world text streams must be first
218 segmented into runs that have a uniform set of properties.
221 In particular, shaping models always assume that every codepoint
222 in a text run has the same <emphasis>direction</emphasis>,
223 <emphasis>script</emphasis> tag, and
224 <emphasis>language</emphasis> tag.
228 <section id="opentype-shaping-models">
229 <title>OpenType shaping models</title>
231 OpenType provides shaping models for the following scripts:
237 The <emphasis>default</emphasis> shaping model handles all
238 non-complex scripts, and may also be used as a fallback for
239 handling unrecognized scripts.
245 The <emphasis>Indic</emphasis> shaping model handles the Indic
246 scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada,
247 Malayalam, Oriya, Tamil, Telugu, and Sinhala.
250 The Indic shaping model was revised significantly in
251 2005. To denote the change, a new set of <emphasis>script
252 tags</emphasis> was assigned for Bengali, Devanagari,
253 Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and
254 Telugu. For the sake of clarity, the term "Indic2" is
255 sometimes used to refer to the current, revised shaping
262 The <emphasis>Arabic</emphasis> shaping model supports
263 Arabic, Mongolian, N'Ko, Syriac, and several other connected
270 The <emphasis>Thai/Lao</emphasis> shaping model supports
271 the Thai and Lao scripts.
277 The <emphasis>Khmer</emphasis> shaping model supports the
284 The <emphasis>Myanmar</emphasis> shaping model supports the
285 Myanmar (or Burmese) script.
291 The <emphasis>Tibetan</emphasis> shaping model supports the
298 The <emphasis>Hangul</emphasis> shaping model supports the
305 The <emphasis>Hebrew</emphasis> shaping model supports the
312 The <emphasis>Universal Shaping Engine</emphasis> (USE)
313 shaping model supports complex scripts not covered by one of
314 the above, script-specific shaping models, including
315 Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi,
316 Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai
317 Viet, and many others.
323 Text runs that do not fall under one of the above shaping
324 models may still require processing by a shaping engine. Of
325 particular note is <emphasis>Emoji</emphasis> shaping, which
326 may involve variation-selector sequences and glyph
327 substitution. Emoji shaping is handled by the default
336 <section id="graphite-shaping">
337 <title>Graphite shaping</title>
339 In contrast to OpenType shaping, Graphite shaping does not
340 specify a predefined set of shaping models or a set of supported
344 Instead, each Graphite font contains a complete set of rules that
345 implement the required shaping model for the intended
346 script. These rules include finite-state machines to match
347 sequences of codepoints to the shaping operations to perform.
350 Graphite shaping can perform the same shaping operations used in
351 OpenType shaping, as well as other functions that have not been
352 defined for OpenType shaping.
356 <section id="aat-shaping">
357 <title>AAT shaping</title>
359 In contrast to OpenType shaping, AAT shaping does not specify a
360 predefined set of shaping models or a set of supported scripts.
363 Instead, each AAT font includes a complete set of rules that
364 implement the desired shaping model for the intended
365 script. These rules include finite-state machines to match glyph
366 sequences and the shaping operations to perform.
369 Notably, AAT shaping rules are expressed for glyphs in the font,
370 not for Unicode codepoints. AAT shaping can perform the same
371 shaping operations used in OpenType shaping, as well as other
372 functions that have not been defined for OpenType shaping.