+<?xml version="1.0"?>
+<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
+ "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
+ <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'">
+ <!ENTITY version SYSTEM "version.xml">
+]>
<chapter id="what-is-harfbuzz">
<title>What is HarfBuzz?</title>
<para>
- HarfBuzz is a <emphasis>text shaping engine</emphasis>. It solves
- the problem of selecting and positioning glyphs from a font given a
- Unicode string.
+ HarfBuzz is a <emphasis>text-shaping engine</emphasis>. If you
+ give HarfBuzz a font and a string containing a sequence of Unicode
+ codepoints, HarfBuzz selects and positions the corresponding
+ glyphs from the font, applying all of the necessary layout rules
+ and font features. HarfBuzz then returns the string to you in the
+ form that is correctly arranged for the language and writing
+ system.
</para>
- <section id="why-do-i-need-it">
- <title>Why do I need it?</title>
+ <para>
+ HarfBuzz can properly shape all of the world's major writing
+ systems. It runs on all major operating systems and software
+ platforms and it supports the major font formats in use
+ today.
+ </para>
+ <section id="what-is-text-shaping">
+ <title>What is text shaping?</title>
+ <para>
+ Text shaping is the process of translating a string of character
+ codes (such as Unicode codepoints) into a properly arranged
+ sequence of glyphs that can be rendered onto a screen or into
+ final output form for inclusion in a document.
+ </para>
+ <para>
+ The shaping process is dependent on the input string, the active
+ font, the script (or writing system) that the string is in, and
+ the language that the string is in.
+ </para>
+ <para>
+ Modern software systems generally only deal with strings in the
+ Unicode encoding scheme (although legacy systems and documents may
+ involve other encodings).
+ </para>
+ <para>
+ There are several font formats that a program might
+ encounter, each of which has a set of standard text-shaping
+ rules.
+ </para>
+ <para>The dominant format is <ulink
+ url="http://www.microsoft.com/typography/otspec/">OpenType</ulink>. The
+ OpenType specification defines a series of <ulink url="https://github.com/n8willis/opentype-shaping-documents">shaping models</ulink> for
+ various scripts from around the world. These shaping models depend on
+ the font incorporating certain features as
+ <emphasis>lookups</emphasis> in its <literal>GSUB</literal>
+ and <literal>GPOS</literal> tables.
+ </para>
<para>
- Text shaping is an integral part of preparing text for display. It
- is a fairly low level operation; HarfBuzz is used directly by
- graphic rendering libraries such as Pango, and the layout engines
- in Firefox, LibreOffice and Chromium. Unless you are
- <emphasis>writing</emphasis> one of these layout engines yourself,
- you will probably not need to use HarfBuzz - normally higher level
- libraries will turn text into glyphs for you.
+ Alternatively, OpenType fonts can include shaping features for
+ the <ulink url="https://graphite.sil.org/">Graphite</ulink> shaping model.
+ </para>
+ <para>
+ TrueType fonts can also include OpenType shaping
+ features. Alternatively, TrueType fonts can also include <ulink url="https://developer.apple.com/fonts/TrueType-Reference-Manual/RM09/AppendixF.html">Apple
+ Advanced Typography</ulink> (AAT) tables to implement shaping
+ support. AAT fonts are generally only found on macOS and iOS systems.
+ </para>
+ <para>
+ Text strings will usually be tagged with a script and language
+ tag that provide the context needed to perform text shaping
+ correctly. The necessary <ulink
+ url="https://docs.microsoft.com/en-us/typography/opentype/spec/scripttags">script</ulink>
+ and <ulink
+ url="https://docs.microsoft.com/en-us/typography/opentype/spec/languagetags">language</ulink>
+ tags are defined by OpenType.
+ </para>
+ </section>
+
+ <section id="why-do-i-need-a-shaping-engine">
+ <title>Why do I need a shaping engine?</title>
+ <para>
+ Text shaping is an integral part of preparing text for
+ display. Before a Unicode sequence can be rendered, the
+ codepoints in the sequence must be mapped to the corresponding
+ glyphs provided in the font, and those glyphs must be positioned
+ correctly relative to each other. For many of the scripts
+ supported in Unicode, these steps involve script-specific layout
+ rules, including complex joining, reordering, and positioning
+ behavior. Implementing these rules is the job of the shaping engine.
+ </para>
+ <para>
+ Text shaping is a fairly low-level operation. HarfBuzz is
+ used directly by text-handling libraries like <ulink
+ url="https://www.pango.org/">Pango</ulink>, as well as by the layout
+ engines in Firefox, LibreOffice, and Chromium. Unless you are
+ <emphasis>writing</emphasis> one of these layout engines
+ yourself, you will probably not need to use HarfBuzz: normally,
+ a layout engine, toolkit, or other library will turn text into
+ glyphs for you.
</para>
<para>
However, if you <emphasis>are</emphasis> writing a layout engine
- or graphics library yourself, you will need to perform text
- shaping, and this is where HarfBuzz can help you. Here are some
- reasons why you need it:
+ or graphics library yourself, then you will need to perform text
+ shaping, and this is where HarfBuzz can help you.
+ </para>
+ <para>
+ Here are some specific scenarios where a text-shaping engine
+ like HarfBuzz helps you:
</para>
<itemizedlist>
<listitem>
<para>
- OpenType fonts contain a set of glyphs, indexed by glyph ID.
- The glyph ID within the font does not necessarily relate to a
- Unicode codepoint. For instance, some fonts have the letter
- "a" as glyph ID 1. To pull the right glyph out of
- the font in order to display it, you need to consult a table
- within the font (the "cmap" table) which maps
- Unicode codepoints to glyph IDs. Text shaping turns codepoints
- into glyph IDs.
+ OpenType fonts contain a set of glyphs (that is, shapes
+ to represent the letters, numbers, punctuation marks, and
+ all other symbols), which are indexed by a <literal>glyph ID</literal>.
+ </para>
+ <para>
+ A particular glyph ID within the font does not necessarily
+ correlate to a predictable Unicode codepoint. For instance,
+ some fonts have the letter "a" as glyph ID 1, but
+ many others do not. In order to retrieve the right glyph
+ from the font to display "a", you need to consult
+ the table inside the font (the <literal>cmap</literal>
+ table) that maps Unicode codepoints to glyph IDs. In other
+ words, <emphasis>text shaping turns codepoints into glyph
+ IDs</emphasis>.
</para>
</listitem>
<listitem>
<para>
Many OpenType fonts contain ligatures: combinations of
- characters which are rendered together. For instance, it's
- common for the <literal>fi</literal> combination to appear in
- print as the single ligature "fi". Whether you should
- render text as <literal>fi</literal> or "fi" does not
- depend on the input text, but on the capabilities of the font
- and the level of ligature application you wish to perform.
- Text shaping involves querying the font's ligature tables and
- determining what substitutions should be made.
+ characters that are rendered as a single unit. For instance,
+ it is common for the "f, i" letter
+ sequence to appear in print as the single ligature glyph
+ "fi".
+ </para>
+ <para>
+ Whether you should render an "f, i" sequence
+ as <literal>fi</literal> or as "fi" does not
+ depend on the input text. Instead, it depends on the whether
+ or not the font includes an "fi" glyph and on the
+ level of ligature application you wish to perform. The font
+ and the amount of ligature application used are under your
+ control. In other words, <emphasis>text shaping involves
+ querying the font's ligature tables and determining what
+ substitutions should be made</emphasis>.
</para>
</listitem>
<listitem>
<para>
- While ligatures like "fi" are typographic
- refinements, some languages <emphasis>require</emphasis> such
+ While ligatures like "fi" are optional typographic
+ refinements, some languages <emphasis>require</emphasis> certain
substitutions to be made in order to display text correctly.
- In Tamil, when the letter "TTA" (ட) letter is
- followed by "U" (உ), the combination should appear
- as the single glyph "டு". The sequence of Unicode
- characters "டஉ" needs to be rendered as a single
- glyph from the font - text shaping chooses the correct glyph
- from the sequence of characters provided.
+ </para>
+ <para>
+ For example, in Tamil, when the letter "TTA" (ட)
+ letter is followed by "U" (உ), the pair
+ must be replaced by the single glyph "டு". The
+ sequence of Unicode characters "டஉ" needs to be
+ substituted with a single "டு" glyph from the
+ font.
+ </para>
+ <para>
+ But "டு" does not have a Unicode codepoint. To
+ find this glyph, you need to consult the table inside
+ the font (the <literal>GSUB</literal> table) that contains
+ substitution information. In other words, <emphasis>text shaping
+ chooses the correct glyph for a sequence of characters
+ provided</emphasis>.
</para>
</listitem>
<listitem>
<para>
- Similarly, each Arabic character has four different variants:
- within a font, there will be glyphs for the initial, medial,
- final, and isolated forms of each letter. Unicode only encodes
- one codepoint per character, and so a Unicode string will not
- tell you which glyph to use. Text shaping chooses the correct
- form of the letter and returns the correct glyph from the font
- that you need to render.
+ Similarly, each Arabic character has four different variants
+ corresponding to the different positions it might appear in
+ within a sequence. Inside a font, there will be separate
+ glyphs for the initial, medial, final, and isolated forms of
+ each letter, each at a different glyph ID.
+ </para>
+ <para>
+ Unicode only assigns one codepoint per character, so a
+ Unicode string will not tell you which glyph variant to use
+ for each character. To decide, you need to analyze the whole
+ string and determine the appropriate glyph for each character
+ based on its position. In other words, <emphasis>text
+ shaping chooses the correct form of the letter by its
+ position and returns the correct glyph from the font</emphasis>.
</para>
</listitem>
<listitem>
<para>
- Other languages have marks and accents which need to be
- rendered in certain positions around a base character. For
- instance, the Moldovan language has the Cyrillic letter
- "zhe" (ж) with a breve accent, like so: ӂ. Some
- fonts will contain this character as an individual glyph,
- whereas other fonts will not contain a zhe-with-breve glyph
- but expect the rendering engine to form the character by
- overlaying the two glyphs ж and ˘. Where you should draw the
- combining breve depends on the height of the preceding glyph.
- Again, for Arabic, the correct positioning of vowel marks
- depends on the height of the character on which you are
- placing the mark. Text shaping tells you whether you have a
- precomposed glyph within your font or if you need to compose a
- glyph yourself out of combining marks, and if so, where to
- position those marks.
+ Other languages involve marks and accents that need to be
+ rendered in specific positions relative a base character. For
+ instance, the Moldovan language includes the Cyrillic letter
+ "zhe" (ж) with a breve accent, like so: "ӂ".
+ </para>
+ <para>
+ Some fonts will provide this character as a single
+ zhe-with-breve glyph, but other fonts will not and, instead,
+ will expect the rendering engine to form the character by
+ superimposing the separate "ж" and "˘"
+ glyphs.
+ </para>
+ <para>
+ But exactly where you should draw the breve depends on the
+ height and width of the preceding zhe glyph. To find the
+ right position, you need to consult the table inside
+ the font (the <literal>GPOS</literal> table) that contains
+ positioning information.
+ In other words, <emphasis>text shaping tells you whether you
+ have a precomposed glyph within your font or if you need to
+ compose a glyph yourself out of combining marks—and,
+ if so, where to position those marks.</emphasis>
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ If tasks like these are something that you need to do, then you
+ need a text shaping engine. You could use Uniscribe if you are
+ writing Windows software; you could use CoreText on macOS; or
+ you could use HarfBuzz.
+ </para>
+ <note>
+ <para>
+ In the rest of this manual, the text will assume that the reader
+ is that implementor of a text-layout engine.
+ </para>
+ </note>
+ </section>
+
+
+ <section>
+ <title>What does HarfBuzz do?</title>
+ <para>
+ HarfBuzz provides text shaping through a cross-platform
+ C API that accepts sequences of Unicode codepoints as input. Currently,
+ the following OpenType shaping models are supported:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Indic (covering Devanagari, Bengali, Gujarati,
+ Gurmukhi, Kannada, Malayalam, Oriya, Tamil, Telugu, and
+ Sinhala)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Arabic (covering Arabic, N'Ko, Syriac, and Mongolian)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Thai and Lao
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Khmer
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Myanmar
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Tibetan
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Hangul
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Hebrew
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ The Universal Shaping Engine or <emphasis>USE</emphasis>
+ (covering complex scripts not covered by the above shaping
+ models)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ A default shaping model for non-complex scripts
+ (covering Latin, Cyrillic, Greek, Armenian, Georgian, Tifinagh,
+ and many others)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Emoji (including emoji modifier sequences, flag sequences,
+ and ZWJ sequences)
+ </para>
+ </listitem>
+ </itemizedlist>
+
+ <para>
+ In addition to OpenType shaping, HarfBuzz supports the latest
+ version of Graphite shaping (the "Graphite 2" model) and AAT
+ shaping.
+ </para>
+
+ <para>
+ HarfBuzz can read and understand TrueType fonts (.ttf), TrueType
+ collections (.ttc), and OpenType fonts (.otf, including those
+ fonts that contain TrueType-style outlines and those that
+ contain PostScript CFF or CFF2 outlines).
+ </para>
+
+ <para>
+ HarfBuzz is designed and tested to run on top of the FreeType
+ font renderer. It can run on Linux, Android, Windows, macOS, and
+ iOS systems.
+ </para>
+
+ <para>
+ In addition to its core shaping functionality, HarfBuzz provides
+ functions for accessing other font features, including optional
+ GSUB and GPOS OpenType features, as well as
+ all color-font formats (<literal>CBDT</literal>,
+ <literal>sbix</literal>, <literal>COLR/CPAL</literal>, and
+ <literal>SVG-OT</literal>) and OpenType variable fonts. HarfBuzz
+ also includes a font-subsetting feature. HarfBuzz can perform
+ some low-level math-shaping operations, although it does not
+ currently perform full shaping for mathematical typesetting.
+ </para>
+
+ <para>
+ A suite of command-line utilities is also provided in the
+ source-code tree, designed to help users test and debug
+ HarfBuzz's features on real-world fonts and input.
+ </para>
+ </section>
+
+ <section id="what-harfbuzz-doesnt-do">
+ <title>What HarfBuzz doesn't do</title>
+ <para>
+ HarfBuzz will take a Unicode string, shape it, and give you the
+ information required to lay it out correctly on a single
+ horizontal (or vertical) line using the font provided. That is the
+ extent of HarfBuzz's responsibility.
+ </para>
+ <para>
+ It is important to note that if you are implementing a complete
+ text-layout engine you may have other responsibilities that
+ HarfBuzz will <emphasis>not</emphasis> help you with. For example:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ HarfBuzz won't help you with bidirectionality. If you want to
+ lay out text that includes a mix of Hebrew and English, you
+ will need to ensure that each buffer provided to HarfBuzz
+ has all of its characters in the same order and that the
+ directionality of the buffer is set correctly. This may mean
+ segmenting the text before it is placed into HarfBuzz buffers. In
+ other words, the user will hit the keys in the following
+ sequence:
+ </para>
+ <programlisting>
+ A B C [space] ג ב א [space] D E F
+ </programlisting>
+ <para>
+ but will expect to see in the output:
+ </para>
+ <programlisting>
+ ABC אבג DEF
+ </programlisting>
+ <para>
+ This reordering is called <emphasis>bidi processing</emphasis>
+ ("bidi" is short for bidirectional), and there's an
+ algorithm as an annex to the Unicode Standard which tells you how
+ to process a string of mixed directionality.
+ Before sending your string to HarfBuzz, you may need to apply the
+ bidi algorithm to it. Libraries such as <ulink
+ url="http://icu-project.org/">ICU</ulink> and <ulink
+ url="http://fribidi.org/">fribidi</ulink> can do this for you.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ HarfBuzz won't help you with text that contains different font
+ properties. For instance, if you have the string "a
+ <emphasis>huge</emphasis> breakfast", and you expect
+ "huge" to be italic, then you will need to send three
+ strings to HarfBuzz: <literal>a</literal>, in your Roman font;
+ <literal>huge</literal> using your italic font; and
+ <literal>breakfast</literal> using your Roman font again.
+ </para>
+ <para>
+ Similarly, if you change the font, font size, script,
+ language, or direction within your string, then you will
+ need to shape each run independently and output them
+ independently. HarfBuzz expects to shape a run of characters
+ that all share the same properties.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ HarfBuzz won't help you with line breaking, hyphenation, or
+ justification. As mentioned above, HarfBuzz lays out the string
+ along a <emphasis>single line</emphasis> of, notionally,
+ infinite length. If you want to find out where the potential
+ word, sentence and line break points are in your text, you
+ could use the ICU library's break iterator functions.
+ </para>
+ <para>
+ HarfBuzz can tell you how wide a shaped piece of text is, which is
+ useful input to a justification algorithm, but it knows nothing
+ about paragraphs, lines or line lengths. Nor will it adjust the
+ space between words to fit them proportionally into a line.
</para>
</listitem>
</itemizedlist>
<para>
- If this is something that you need to do, then you need a text
- shaping engine: you could use Uniscribe if you are using Windows;
- you could use CoreText on OS X; or you could use HarfBuzz. In the
- rest of this manual, we are going to assume that you are the
- implementor of a text layout engine.
+ As a layout-engine implementor, HarfBuzz will help you with the
+ interface between your text and your font, and that's something
+ that you'll need—what you then do with the glyphs that your font
+ returns is up to you.
</para>
</section>
+
<section id="why-is-it-called-harfbuzz">
<title>Why is it called HarfBuzz?</title>
<para>
- HarfBuzz began its life as text shaping code within the FreeType
- project, (and you will see references to the FreeType authors
- within the source code copyright declarations) but was then
- abstracted out to its own project. This project is maintained by
- Behdad Esfahbod, and named HarfBuzz. Originally, it was a shaping
- engine for OpenType fonts - "HarfBuzz" is the Persian
- for "open type".
+ HarfBuzz began its life as text-shaping code within the FreeType
+ project (and you will see references to the FreeType authors
+ within the source code copyright declarations), but was then
+ extracted out to its own project. This project is maintained by
+ Behdad Esfahbod, who named it HarfBuzz. Originally, it was a
+ shaping engine for OpenType fonts—"HarfBuzz" is
+ the Persian for "open type".
</para>
</section>
-</chapter>
\ No newline at end of file
+</chapter>