2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
3 "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
4 <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'">
5 <!ENTITY version SYSTEM "version.xml">
7 <chapter id="shaping-concepts">
8 <title>Shaping concepts</title>
9 <section id="text-shaping-concepts">
10 <title>Text shaping</title>
12 Text shaping is the process of transforming a sequence of Unicode
13 codepoints that represent individual characters (letters,
14 diacritics, tone marks, numbers, symbols, etc.) into the
15 orthographically and linguistically correct two-dimensional layout
16 of glyph shapes taken from a specified font.
19 For some writing systems (or <emphasis>scripts</emphasis>) and
20 languages, the process is simple, requiring the shaper to do
21 little more than advance the horizontal position forward by the
22 correct amount for each successive glyph.
25 But, for other scripts (often unceremoniously called <emphasis>complex scripts</emphasis>), any combination of
26 several shaping operations may be required, and the rules for how
27 and when they are applied vary from script to script. HarfBuzz and
28 other shaping engines implement these rules.
31 The exact rules and necessary operations for a particular script
32 constitute a shaping <emphasis>model</emphasis>. OpenType
33 specifies a set of shaping models that covers all of
34 Unicode. Other shaping models are available, however, including
35 Graphite and Apple Advanced Typography (AAT).
39 <section id="script-specific-shaping">
40 <title>Script-specific shaping</title>
42 In many scripts, transforming the input
43 sequence into the final layout often requires some combination of
44 operations—such as context-dependent substitutions,
45 context-dependent mark positioning, glyph-to-glyph joining,
46 glyph reordering, or glyph stacking.
49 In some scripts, the shaping rules require that a text
50 run be divided into syllables before the operations can be
51 applied. Other scripts may apply shaping operations over
52 entire words or over the entire text run, with no subdivision
56 Other scripts, do not require these
57 operations. However, correctly shaping a text run in
58 any script may still involve Unicode normalization,
59 ligature substitutions, mark positioning, kerning, and applying
64 <section id="shaping-operations">
65 <title>Shaping operations</title>
67 Shaping a text run involves transforming the
68 input sequence of Unicode codepoints with some combination of
69 operations that is specified in the shaping model for the
73 The specific conditions that trigger a given operation for a
74 text run varies from script to script, as do the order that the
75 operations are performed in and which codepoints are
76 affected. However, the same general set of shaping operations is
77 common to all of the script shaping models.
83 A <emphasis>reordering</emphasis> operation moves a glyph
84 from its original ("logical") position in the sequence to
85 some other ("visual") position.
88 The shaping model for a given script might involve
89 more than one reordering step.
95 A <emphasis>joining</emphasis> operation replaces a glyph
96 with an alternate form that is designed to connect with one
97 or more of the adjacent glyphs in the sequence.
103 A contextual <emphasis>substitution</emphasis> operation
104 replaces either a single glyph or a subsequence of several
105 glyphs with an alternate glyph. This substitution is
106 performed when the original glyph or subsequence of glyphs
107 occurs in a specified position with respect to the
108 surrounding sequence. For example, one substitution might be
109 performed only when the target glyph is the first glyph in
110 the sequence, while another substitution is performed only
111 when a different target glyph occurs immediately after a
112 particular string pattern.
115 The shaping model for a given script might involve
116 multiple contextual-substitution operations, each applying
117 to different target glyphs and patterns, and which are
118 performed in separate steps.
124 A contextual <emphasis>positioning</emphasis> operation
125 moves the horizontal and/or vertical position of a
126 glyph. This positioning move is performed when the glyph
127 occurs in a specified position with respect to the
128 surrounding sequence.
131 Many contextual positioning operations are used to place
132 <emphasis>mark</emphasis> glyphs (such as diacritics, vowel
133 signs, and tone markers) with respect to
134 <emphasis>base</emphasis> glyphs. However, some
135 scripts may use contextual positioning operations to
136 correctly place base glyphs as well, such as
137 when the script uses <emphasis>stacking</emphasis> characters.
144 <section id="unicode-character-categories">
145 <title>Unicode character categories</title>
147 Shaping models are typically specified with respect to how
148 scripts are defined in the Unicode standard.
151 Every codepoint in the Unicode Character Database (UCD) is
152 assigned a <emphasis>Unicode General Category</emphasis> (UGC),
153 which provides the most fundamental information about the
154 codepoint: whether the codepoint represents a
155 <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a
156 <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a
157 <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>,
158 or something else (<emphasis>Other</emphasis>).
161 These UGC properties are "Major" categories. Each codepoint is
162 further assigned to a "minor" category within its Major
163 category, such as "Letter, uppercase" (<literal>Lu</literal>) or
164 "Letter, modifier" (<literal>Lm</literal>).
167 Shaping models are concerned primarily with Letter and Mark
168 codepoints. The minor categories of Mark codepoints are
169 particularly important for shaping. Marks can be nonspacing
170 (<literal>Mn</literal>), spacing combining
171 (<literal>Mc</literal>), or enclosing (<literal>Me</literal>).
174 In addition to the UGC property, codepoints in the Indic and
175 Southeast Asian scripts are also assigned
176 <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and
177 <emphasis>Unicode Indic Positional Category</emphasis> (UIPC)
178 properties that provide more detailed information needed for
182 The UISC property sub-categorizes Letters and Marks according to
183 common script-shaping behaviors. For example, UISC distinguishes
184 between consonant letters, vowel letters, and vowel marks. The
185 UIPC property sub-categorizes Mark codepoints by the relative visual
186 position that they occupy (above, below, right, left, or in
190 Some scripts require that the text run be split into
191 syllables. What constitutes a valid syllable in these
192 scripts is specified in regular expressions, formed from the
193 Letter and Mark codepoints, that take the UISC and UIPC
194 properties into account.
199 <section id="text-runs">
200 <title>Text runs</title>
202 Real-world text usually contains codepoints from a mixture of
203 different Unicode scripts (including punctuation, numbers, symbols,
204 white-space characters, and other codepoints that do not belong
205 to any script). Real-world text may also be marked up with
206 formatting that changes font properties (including the font,
207 font style, and font size).
210 For shaping purposes, all real-world text streams must be first
211 segmented into runs that have a uniform set of properties.
214 In particular, shaping models always assume that every codepoint
215 in a text run has the same <emphasis>direction</emphasis>,
216 <emphasis>script</emphasis> tag, and
217 <emphasis>language</emphasis> tag.
221 <section id="opentype-shaping-models">
222 <title>OpenType shaping models</title>
224 OpenType provides shaping models for the following scripts:
230 The <emphasis>default</emphasis> shaping model handles all
231 scripts with no script-specific shaping model, and may also be used as a fallback for
232 handling unrecognized scripts.
238 The <emphasis>Indic</emphasis> shaping model handles the Indic
239 scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada,
240 Malayalam, Oriya, Tamil, and Telugu.
243 The Indic shaping model was revised significantly in
244 2005. To denote the change, a new set of <emphasis>script
245 tags</emphasis> was assigned for Bengali, Devanagari,
246 Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and
247 Telugu. For the sake of clarity, the term "Indic2" is
248 sometimes used to refer to the current, revised shaping
255 The <emphasis>Arabic</emphasis> shaping model supports
256 Arabic, Mongolian, N'Ko, Syriac, and several other connected
263 The <emphasis>Thai/Lao</emphasis> shaping model supports
264 the Thai and Lao scripts.
270 The <emphasis>Khmer</emphasis> shaping model supports the
277 The <emphasis>Myanmar</emphasis> shaping model supports the
278 Myanmar (or Burmese) script.
284 The <emphasis>Tibetan</emphasis> shaping model supports the
291 The <emphasis>Hangul</emphasis> shaping model supports the
298 The <emphasis>Hebrew</emphasis> shaping model supports the
305 The <emphasis>Universal Shaping Engine</emphasis> (USE)
306 shaping model supports scripts not covered by one of
307 the above, script-specific shaping models, including
308 Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi,
309 Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai
310 Viet, and many others.
316 Text runs that do not fall under one of the above shaping
317 models may still require processing by a shaping engine. Of
318 particular note is <emphasis>Emoji</emphasis> shaping, which
319 may involve variation-selector sequences and glyph
320 substitution. Emoji shaping is handled by the default
329 <section id="graphite-shaping">
330 <title>Graphite shaping</title>
332 In contrast to OpenType shaping, Graphite shaping does not
333 specify a predefined set of shaping models or a set of supported
337 Instead, each Graphite font contains a complete set of rules that
338 implement the required shaping model for the intended
339 script. These rules include finite-state machines to match
340 sequences of codepoints to the shaping operations to perform.
343 Graphite shaping can perform the same shaping operations used in
344 OpenType shaping, as well as other functions that have not been
345 defined for OpenType shaping.
349 <section id="aat-shaping">
350 <title>AAT shaping</title>
352 In contrast to OpenType shaping, AAT shaping does not specify a
353 predefined set of shaping models or a set of supported scripts.
356 Instead, each AAT font includes a complete set of rules that
357 implement the desired shaping model for the intended
358 script. These rules include finite-state machines to match glyph
359 sequences and the shaping operations to perform.
362 Notably, AAT shaping rules are expressed for glyphs in the font,
363 not for Unicode codepoints. AAT shaping can perform the same
364 shaping operations used in OpenType shaping, as well as other
365 functions that have not been defined for OpenType shaping.