2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
3 "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
4 <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'">
5 <!ENTITY version SYSTEM "version.xml">
7 <chapter id="clusters">
8 <title>Clusters</title>
9 <section id="clusters-and-shaping">
10 <title>Clusters and shaping</title>
12 In text shaping, a <emphasis>cluster</emphasis> is a sequence of
13 characters that needs to be treated as a single, indivisible
14 unit. A single letter or symbol can be a cluster of its
15 own. Other clusters correspond to longer subsequences of the
16 input code points — such as a ligature or conjunct form
17 — and require the shaper to ensure that the cluster is not
18 broken during the shaping process.
21 A cluster is distinct from a <emphasis>grapheme</emphasis>,
22 which is the smallest unit of meaning in a writing system or
26 The definitions of the two terms are similar. However, clusters
27 are only relevant for script shaping and glyph layout. In
28 contrast, graphemes are a property of the underlying script, and
29 are of interest when client programs implement orthographic
30 or linguistic functionality.
33 For example, two individual letters are often two separate
34 graphemes. When two letters form a ligature, however, they
35 combine into a single glyph. They are then part of the same
36 cluster and are treated as a unit by the shaping engine —
37 even though the two original, underlying letters remain separate
41 HarfBuzz is concerned with clusters, <emphasis>not</emphasis>
42 with graphemes — although client programs using HarfBuzz
43 may still care about graphemes for other reasons from time to time.
46 During the shaping process, there are several shaping operations
47 that may merge adjacent characters (for example, when two code
48 points form a ligature or a conjunct form and are replaced by a
49 single glyph) or split one character into several (for example,
50 when decomposing a code point through the
51 <literal>ccmp</literal> feature). Operations like these alter
52 clusters; HarfBuzz tracks the changes to ensure that no clusters
53 get lost or broken during shaping.
56 HarfBuzz records cluster information independently from how
57 shaping operations affect the individual glyphs returned in an
58 output buffer. Consequently, a client program using HarfBuzz can
59 utilize the cluster information to implement features such as:
64 Correctly positioning the cursor within a shaped text run,
65 even when characters have formed ligatures, composed or
66 decomposed, reordered, or undergone other shaping operations.
71 Correctly highlighting a text selection that includes some,
72 but not all, of the characters in a word.
77 Applying text attributes (such as color or underlining) to
78 part, but not all, of a word.
83 Generating output document formats (such as PDF) with
84 embedded text that can be fully extracted.
89 Determining the mapping between input characters and output
90 glyphs, such as which glyphs are ligatures.
95 Performing line-breaking, justification, and other
96 line-level or paragraph-level operations that must be done
97 after shaping is complete, but which require examining
98 character-level properties.
103 <section id="working-with-harfbuzz-clusters">
104 <title>Working with HarfBuzz clusters</title>
106 When you add text to a HarfBuzz buffer, each code point must be
107 assigned a <emphasis>cluster value</emphasis>.
110 This cluster value is an arbitrary number; HarfBuzz uses it only
111 to distinguish between clusters. Many client programs will use
112 the index of each code point in the input text stream as the
113 cluster value. This is for the sake of convenience; the actual
114 value does not matter.
117 Some of the shaping operations performed by HarfBuzz —
118 such as reordering, composition, decomposition, and substitution
119 — may alter the cluster values of some characters. The
120 final cluster values in the buffer at the end of the shaping
121 process will indicate to client programs which subsequences of
122 glyphs represent a cluster and, therefore, must not be
126 In addition, client programs can query the final cluster values
127 to discern other potentially important information about the
128 glyphs in the output buffer (such as whether or not a ligature
132 For example, if the initial sequence of cluster values was:
138 and the final sequence of cluster values is:
144 then there are two clusters in the output buffer: the first
145 cluster includes the first two glyphs, and the second cluster
146 includes the third and fourth glyphs. It is also evident that a
147 ligature or conjunct has been formed, because there are fewer
148 glyphs in the output buffer (four) than there were code points
149 in the input buffer (five).
152 Although client programs using HarfBuzz are free to assign
153 initial cluster values in any manner they choose to, HarfBuzz
154 does offer some useful guarantees if the cluster values are
155 assigned in a monotonic (either non-decreasing or non-increasing)
159 For left-to-right scripts (LTR) and top-to-bottom scripts (TTB),
160 HarfBuzz will preserve the monotonic property: client programs
161 are guaranteed that monotonically increasing initial clulster
162 values will be returned as monotonically increasing final
166 For right-to-left scripts (RTL) and bottom-to-top scripts (BTT),
167 the directionality of the buffer itself is reversed for final
168 output as a matter of design. Therefore, HarfBuzz inverts the
169 monotonic property: client programs are guaranteed that
170 monotonically increasing initial clulster values will be
171 returned as monotonically <emphasis>decreasing</emphasis> final
175 Client programs can adjust how HarfBuzz handles clusters during
176 shaping by setting the
177 <literal>cluster_level</literal> of the
178 buffer. HarfBuzz offers three <emphasis>levels</emphasis> of
179 clustering support for this property:
183 <para><emphasis>Level 0</emphasis> is the default and
184 reproduces the behavior of the old HarfBuzz library.
187 The distinguishing feature of level 0 behavior is that, at
188 the beginning of processing the buffer, all code points that
189 are categorized as <emphasis>marks</emphasis>,
190 <emphasis>modifier symbols</emphasis>, or
191 <emphasis>Emoji extended pictographic</emphasis> modifiers,
192 as well as the <emphasis>Zero Width Joiner</emphasis> and
193 <emphasis>Zero Width Non-Joiner</emphasis> code points, are
194 assigned the cluster value of the closest preceding code
195 point from <emphasis>different</emphasis> category.
198 In essence, whenever a base character is followed by a mark
199 character or a sequence of mark characters, those marks are
200 reassigned to the same initial cluster value as the base
201 character. This reassignment is referred to as
202 "merging" the affected clusters. This behavior is based on
203 the Grapheme Cluster Boundary specification in <ulink
204 url="https://www.unicode.org/reports/tr29/#Regex_Definitions">Unicode
205 Technical Report 29</ulink>.
208 Client programs can specify level 0 behavior for a buffer by
209 setting its <literal>cluster_level</literal> to
210 <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_GRAPHEMES</literal>.
215 <emphasis>Level 1</emphasis> tweaks the old behavior
216 slightly to produce better results. Therefore, level 1
217 clustering is recommended for code that is not required to
218 implement backward compatibility with the old HarfBuzz.
221 Level 1 differs from level 0 by not merging the
222 clusters of marks and other modifier code points with the
223 preceding "base" code point's cluster. By preserving the
224 separate cluster values of these marks and modifier code
225 points, script shapers can perform additional operations
226 that might lead to improved results (for example, reordering
227 a sequence of marks).
230 Client programs can specify level 1 behavior for a buffer by
231 setting its <literal>cluster_level</literal> to
232 <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS</literal>.
237 <emphasis>Level 2</emphasis> differs significantly in how it
238 treats cluster values. In level 2, HarfBuzz never merges
242 This difference can be seen most clearly when HarfBuzz processes
243 ligature substitutions and glyph decompositions. In level 0
244 and level 1, ligatures and glyph decomposition both involve
245 merging clusters; in level 2, neither of these operations
249 Client programs can specify level 2 behavior for a buffer by
250 setting its <literal>cluster_level</literal> to
251 <literal>HB_BUFFER_CLUSTER_LEVEL_CHARACTERS</literal>.
256 As mentioned earlier, client programs using HarfBuzz often
257 assign initial cluster values in a buffer by reusing the indices
258 of the code points in the input text. This gives a sequence of
259 cluster values that is monotonically increasing (for example,
263 It is not <emphasis>required</emphasis> that the cluster values
264 in a buffer be monotonically increasing. However, if the initial
265 cluster values in a buffer are monotonic and the buffer is
266 configured to use cluster level 0 or 1, then HarfBuzz
267 guarantees that the final cluster values in the shaped buffer
268 will also be monotonic. No such guarantee is made for cluster
272 In levels 0 and 1, HarfBuzz implements the following conceptual
273 model for cluster values:
275 <itemizedlist spacing="compact">
278 If the sequence of input cluster values is monotonic, the
279 sequence of cluster values will remain monotonic.
284 Each cluster value represents a single cluster.
289 Each cluster contains one or more glyphs and one or more
295 In practice, this model offers several benefits. Assuming that
296 the initial cluster values were monotonically increasing
297 and distinct before shaping began, then, in the final output:
299 <itemizedlist spacing="compact">
302 All adjacent glyphs having the same final cluster
303 value belong to the same cluster.
308 Each character belongs to the cluster that has the highest
309 cluster value <emphasis>not larger than</emphasis> its
310 initial cluster value.
316 <section id="a-clustering-example-for-levels-0-and-1">
317 <title>A clustering example for levels 0 and 1</title>
319 The basic shaping operations affect clusters in a predictable
320 manner when using level 0 or level 1:
325 When two or more clusters <emphasis>merge</emphasis>, the
326 resulting merged cluster takes as its cluster value the
327 <emphasis>minimum</emphasis> of the incoming cluster values.
332 When a cluster <emphasis>decomposes</emphasis>, all of the
333 resulting child clusters inherit as their cluster value the
334 cluster value of the parent cluster.
339 When a character is <emphasis>reordered</emphasis>, the
340 reordered character and all clusters that the character
341 moves past as part of the reordering are merged into one cluster.
346 The functionality, guarantees, and benefits of level 0 and level
347 1 behavior can be seen with some examples. First, let us examine
348 what happens with cluster values when shaping involves cluster
349 merging with ligatures and decomposition.
353 Let's say we start with the following character sequence (top row) and
354 initial cluster values (bottom row):
361 During shaping, HarfBuzz maps these characters to glyphs from
362 the font. For simplicity, let us assume that each character maps
363 to the corresponding, identical-looking glyph:
370 Now if, for example, <literal>B</literal> and <literal>C</literal>
371 form a ligature, then the clusters to which they belong
372 "merge". This merged cluster takes for its cluster
373 value the minimum of all the cluster values of the clusters that
374 went in to the ligature. In this case, we get:
381 because 1 is the minimum of the set {1,2}, which were the
382 cluster values of <literal>B</literal> and
383 <literal>C</literal>.
386 Next, let us say that the <literal>BC</literal> ligature glyph
387 decomposes into three components, and <literal>D</literal> also
388 decomposes into two components. Whenever a cluster decomposes,
389 its components each inherit the cluster value of their parent:
392 A,BC0,BC1,BC2,D0,D1,E
396 Next, if <literal>BC2</literal> and <literal>D0</literal> form a
397 ligature, then their clusters (cluster values 1 and 3) merge into
398 <literal>min(1,3) = 1</literal>:
405 Note that the entirety of cluster 3 merges into cluster 1, not
406 just the <literal>D0</literal> glyph. This reflects the fact
407 that the cluster <emphasis>must</emphasis> be treated as an
411 At this point, cluster 1 means: the character sequence
412 <literal>BCD</literal> is represented by glyphs
413 <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any
417 <section id="reordering-in-levels-0-and-1">
418 <title>Reordering in levels 0 and 1</title>
420 Another common operation in the more complex shapers is glyph
421 reordering. In order to maintain a monotonic cluster sequence
422 when glyph reordering takes place, HarfBuzz merges the clusters
423 of everything in the reordering sequence.
426 For example, let us again start with the character sequence (top
427 row) and initial cluster values (bottom row):
434 If <literal>D</literal> is reordered to the position immediately
435 before <literal>B</literal>, then HarfBuzz merges the
436 <literal>B</literal>, <literal>C</literal>, and
437 <literal>D</literal> clusters — all the clusters between
438 the final position of the reordered glyph and its original
439 position. This means that we get:
446 as the final cluster sequence.
449 Merging this many clusters is not ideal, but it is the only
450 sensible way for HarfBuzz to maintain the guarantee that the
451 sequence of cluster values remains monotonic and to retain the
452 true relationship between glyphs and characters.
455 <section id="the-distinction-between-levels-0-and-1">
456 <title>The distinction between levels 0 and 1</title>
458 The preceding examples demonstrate the main effects of using
459 cluster levels 0 and 1. The only difference between the two
460 levels is this: in level 0, at the very beginning of the shaping
461 process, HarfBuzz merges the cluster of each base character
462 with the clusters of all Unicode marks (combining or not) and
463 modifiers that follow it.
466 For example, let us start with the following character sequence
467 (top row) and accompanying initial cluster values (bottom row):
474 The <literal>acute</literal> is a Unicode mark. If HarfBuzz is
475 using cluster level 0 on this sequence, then the
476 <literal>A</literal> and <literal>acute</literal> clusters will
477 merge, and the result will become:
484 This merger is performed before any other script-shaping
488 This initial cluster merging is the default behavior of the
489 Windows shaping engine, and the old HarfBuzz codebase copied
490 that behavior to maintain compatibility. Consequently, it has
491 remained the default behavior in the new HarfBuzz codebase.
494 But this initial cluster-merging behavior makes it impossible
495 for client programs to implement some features (such as to
496 color diacritic marks differently from their base
497 characters). That is why, in level 1, HarfBuzz does not perform
498 the initial merging step.
501 For client programs that rely on HarfBuzz cluster values to
502 perform cursor positioning, level 0 is more convenient. But
503 relying on cluster boundaries for cursor positioning is wrong: cursor
504 positions should be determined based on Unicode grapheme
505 boundaries, not on shaping-cluster boundaries. As such, using
506 level 1 clustering behavior is recommended.
509 One final facet of levels 0 and 1 is worth noting. HarfBuzz
510 currently does not allow any
511 <emphasis>multiple-substitution</emphasis> GSUB lookups to
512 replace a glyph with zero glyphs (in other words, to delete a
516 But, in some other situations, glyphs can be deleted. In
517 those cases, if the glyph being deleted is the last glyph of its
518 cluster, HarfBuzz makes sure to merge the deleted glyph's
519 cluster with a neighboring cluster.
522 This is done primarily to make sure that the starting cluster of the
523 text always has the cluster index pointing to the start of the text
524 for the run; more than one client program currently relies on this
528 Incidentally, Apple's CoreText does something different to
529 maintain the same promise: it inserts a glyph with id 65535 at
530 the beginning of the glyph string if the glyph corresponding to
531 the first character in the run was deleted. HarfBuzz might do
532 something similar in the future.
535 <section id="level-2">
536 <title>Level 2</title>
538 HarfBuzz's level 2 cluster behavior uses a significantly
539 different model than that of level 0 and level 1.
542 The level 2 behavior is easy to describe, but it may be
543 difficult to understand in practical terms. In brief, level 2
544 performs no merging of clusters whatsoever.
547 This means that there is no initial base-and-mark merging step
548 (as is done in level 0), and it means that reordering moves and
549 ligature substitutions do not trigger a cluster merge.
552 Only one shaping operation directly affects clusters when using
558 When a cluster <emphasis>decomposes</emphasis>, all of the
559 resulting child clusters inherit as their cluster value the
560 cluster value of the parent cluster.
565 When glyphs do form a ligature (or when some other feature
566 substitutes multiple glyphs with one glyph) the cluster value
567 of the first glyph is retained as the cluster value for the
571 This occurrence sounds similar to a cluster merge, but it is
572 different. In particular, no subsequent characters —
573 including marks and modifiers — are affected. They retain
574 their previous cluster values.
577 Level 2 cluster behavior is ultimately less complex than level 0
578 or level 1, but there are several cases for which processing
579 cluster values produced at level 2 may be tricky.
581 <section id="ligatures-with-combining-marks-in-level-2">
582 <title>Ligatures with combining marks in level 2</title>
584 The first example of how HarfBuzz's level 2 cluster behavior
585 can be tricky is when the text to be shaped includes combining
586 marks attached to ligatures.
589 Let us start with an input sequence with the following
590 characters (top row) and initial cluster values (bottom row):
593 A,acute,B,breve,C,circumflex
597 If the sequence <literal>A,B,C</literal> forms a ligature,
598 then these are the cluster values HarfBuzz will return under
599 the various cluster levels:
605 ABC,acute,breve,circumflex
612 ABC,acute,breve,circumflex
619 ABC,acute,breve,circumflex
623 Making sense of the level 2 result is the hardest for a client
624 program, because there is nothing in the cluster values that
625 indicates that <literal>B</literal> and <literal>C</literal>
626 formed a ligature with <literal>A</literal>.
629 In contrast, the "merged" cluster values of the mark glyphs
630 that are seen in the level 0 and level 1 output are evidence
631 that a ligature substitution took place.
634 <section id="reordering-in-level-2">
635 <title>Reordering in level 2</title>
637 Another example of how HarfBuzz's level 2 cluster behavior
638 can be tricky is when glyphs reorder. Consider an input sequence
639 with the following characters (top row) and initial cluster
647 Now imagine <literal>D</literal> moves before
648 <literal>B</literal> in a reordering operation. The cluster
656 Next, if <literal>D</literal> forms a ligature with
657 <literal>B</literal>, the output is:
664 However, in a different scenario, in which the shaping rules
665 of the script instead caused <literal>A</literal> and
666 <literal>B</literal> to form a ligature
667 <emphasis>before</emphasis> the <literal>D</literal> reordered, the
675 There is no way for a client program to differentiate between
676 these two scenarios based on the cluster values
677 alone. Consequently, client programs that use level 2 might
678 need to undertake additional work in order to manage cursor
679 positioning, text attributes, or other desired features.
682 <section id="other-considerations-in-level-2">
683 <title>Other considerations in level 2</title>
685 There may be other problems encountered with ligatures under
686 level 2, such as if the direction of the text is forced to
687 the opposite of its natural direction (for example, Arabic text
688 that is forced into left-to-right directionality). But,
689 generally speaking, these other scenarios are minor corner
690 cases that are too obscure for most client programs to need to