docs/usermanual-buffers-language-script-and-direction.xml

   1 <?xml version="1.0"?>
   2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
   3                "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
   4   <!ENTITY % local.common.attrib "xmlns:xi  CDATA  #FIXED 'http://www.w3.org/2003/XInclude'">
   5   <!ENTITY version SYSTEM "version.xml">
   6 ]>
   7 <chapter id="buffers-language-script-and-direction">
   8   <title>Buffers, language, script and direction</title>
   9   <para>
  10     The input to the HarfBuzz shaper is a series of Unicode characters, stored in a
  11     buffer. In this chapter, we'll look at how to set up a buffer with
  12     the text that we want and how to customize the properties of the
  13     buffer. We'll also look at a piece of lower-level machinery that
  14     you will need to understand before proceeding: the functions that
  15     HarfBuzz uses to retrieve Unicode information.
  16   </para>
  17   <para>
  18     After shaping is complete, HarfBuzz puts its output back
  19     into the buffer. But getting that output requires setting up a
  20     face and a font first, so we will look at that in the next chapter
  21     instead of here.
  22   </para>
  23   <section id="creating-and-destroying-buffers">
  24     <title>Creating and destroying buffers</title>
  25     <para>
  26       As we saw in our <emphasis>Getting Started</emphasis> example, a
  27       buffer is created and
  28       initialized with <function>hb_buffer_create()</function>. This
  29       produces a new, empty buffer object, instantiated with some
  30       default values and ready to accept your Unicode strings.
  31     </para>
  32     <para>
  33       HarfBuzz manages the memory of objects (such as buffers) that it
  34       creates, so you don't have to. When you have finished working on
  35       a buffer, you can call <function>hb_buffer_destroy()</function>:
  36     </para>
  37     <programlisting language="C">
  38       hb_buffer_t *buf = hb_buffer_create();
  39       ...
  40       hb_buffer_destroy(buf);
  41     </programlisting>
  42     <para>
  43       This will destroy the object and free its associated memory -
  44       unless some other part of the program holds a reference to this
  45       buffer. If you acquire a HarfBuzz buffer from another subsystem
  46       and want to ensure that it is not garbage collected by someone
  47       else destroying it, you should increase its reference count:
  48     </para>
  49     <programlisting language="C">
  50       void somefunc(hb_buffer_t *buf) {
  51       buf = hb_buffer_reference(buf);
  52       ...
  53     </programlisting>
  54     <para>
  55       And then decrease it once you're done with it:
  56     </para>
  57     <programlisting language="C">
  58       hb_buffer_destroy(buf);
  59       }
  60     </programlisting>
  61     <para>
  62       While we are on the subject of reference-counting buffers, it is
  63       worth noting that an individual buffer can only meaningfully be
  64       used by one thread at a time.
  65     </para>
  66     <para>
  67       To throw away all the data in your buffer and start from scratch,
  68       call <function>hb_buffer_reset(buf)</function>. If you want to
  69       throw away the string in the buffer but keep the options, you can
  70       instead call <function>hb_buffer_clear_contents(buf)</function>.
  71     </para>
  72   </section>
  73
  74   <section id="adding-text-to-the-buffer">
  75     <title>Adding text to the buffer</title>
  76     <para>
  77       Now we have a brand new HarfBuzz buffer. Let's start filling it
  78       with text! From HarfBuzz's perspective, a buffer is just a stream
  79       of Unicode code points, but your input string is probably in one of
  80       the standard Unicode character encodings (UTF-8, UTF-16, or
  81       UTF-32). HarfBuzz provides convenience functions that accept
  82       each of these encodings:
  83       <function>hb_buffer_add_utf8()</function>,
  84       <function>hb_buffer_add_utf16()</function>, and
  85       <function>hb_buffer_add_utf32()</function>. Other than the
  86       character encoding they accept, they function identically.
  87     </para>
  88     <para>
  89       You can add UTF-8 text to a buffer by passing in the text array,
  90       the array's length, an offset into the array for the first
  91       character to add, and the length of the segment to add:
  92     </para>
  93     <programlisting language="C">
  94     hb_buffer_add_utf8 (hb_buffer_t *buf,
  95                     const char *text,
  96                     int text_length,
  97                     unsigned int item_offset,
  98                     int item_length)
  99     </programlisting>
 100     <para>
 101       So, in practice, you can say:
 102     </para>
 103     <programlisting language="C">
 104       hb_buffer_add_utf8(buf, text, strlen(text), 0, strlen(text));
 105     </programlisting>
 106     <para>
 107       This will append your new characters to
 108       <parameter>buf</parameter>, not replace its existing
 109       contents. Also, note that you can use <literal>-1</literal> in
 110       place of the first instance of <function>strlen(text)</function>
 111       if your text array is NULL-terminated. Similarly, you can also use
 112       <literal>-1</literal> as the final argument want to add its full
 113       contents.
 114     </para>
 115     <para>
 116       Whatever start <parameter>item_offset</parameter> and
 117       <parameter>item_length</parameter> you provide, HarfBuzz will also
 118       attempt to grab the five characters <emphasis>before</emphasis>
 119       the offset point and the five characters
 120       <emphasis>after</emphasis> the designated end. These are the
 121       before and after "context" segments, which are used internally
 122       for HarfBuzz to make shaping decisions. They will not be part of
 123       the final output, but they ensure that HarfBuzz's
 124       script-specific shaping operations are correct. If there are
 125       fewer than five characters available for the before or after
 126       contexts, HarfBuzz will just grab what is there.
 127     </para>
 128     <para>
 129       For longer text runs, such as full paragraphs, it might be
 130       tempting to only add smaller sub-segments to a buffer and
 131       shape them in piecemeal fashion. Generally, this is not a good
 132       idea, however, because a lot of shaping decisions are
 133       dependent on this context information. For example, in Arabic
 134       and other connected scripts, HarfBuzz needs to know the code
 135       points before and after each character in order to correctly
 136       determine which glyph to return.
 137     </para>
 138     <para>
 139       The safest approach is to add all of the text available, then
 140       use <parameter>item_offset</parameter> and
 141       <parameter>item_length</parameter> to indicate which characters you
 142       want shaped, so that HarfBuzz has access to any context.
 143     </para>
 144     <para>
 145       You can also add Unicode code points directly with
 146       <function>hb_buffer_add_codepoints()</function>. The arguments
 147       to this function are the same as those for the UTF
 148       encodings. But it is particularly important to note that
 149       HarfBuzz does not do validity checking on the text that is added
 150       to a buffer. Invalid code points will be replaced, but it is up
 151       to you to do any deep-sanity checking necessary.
 152     </para>
 153
 154   </section>
 155
 156   <section id="setting-buffer-properties">
 157     <title>Setting buffer properties</title>
 158     <para>
 159       Buffers containing input characters still need several
 160       properties set before HarfBuzz can shape their text correctly.
 161     </para>
 162     <para>
 163       Initially, all buffers are set to the
 164       <literal>HB_BUFFER_CONTENT_TYPE_INVALID</literal> content
 165       type. After adding text, the buffer should be set to
 166       <literal>HB_BUFFER_CONTENT_TYPE_UNICODE</literal> instead, which
 167       indicates that it contains un-shaped input
 168       characters. After shaping, the buffer will have the
 169       <literal>HB_BUFFER_CONTENT_TYPE_GLYPHS</literal> content type.
 170     </para>
 171     <para>
 172       <function>hb_buffer_add_utf8()</function> and the
 173       other UTF functions set the content type of their buffer
 174       automatically. But if you are reusing a buffer you may want to
 175       check its state with
 176       <function>hb_buffer_get_content_type(buffer)</function>. If
 177       necessary you can set the content type with
 178     </para>
 179     <programlisting language="C">
 180       hb_buffer_set_content_type(buf, HB_BUFFER_CONTENT_TYPE_UNICODE);
 181     </programlisting>
 182     <para>
 183       to prepare for shaping.
 184     </para>
 185     <para>
 186       Buffers also need to carry information about the script,
 187       language, and text direction of their contents. You can set
 188       these properties individually:
 189     </para>
 190     <programlisting language="C">
 191       hb_buffer_set_direction(buf, HB_DIRECTION_LTR);
 192       hb_buffer_set_script(buf, HB_SCRIPT_LATIN);
 193       hb_buffer_set_language(buf, hb_language_from_string("en", -1));
 194     </programlisting>
 195     <para>
 196       However, since these properties are often repeated for
 197       multiple text runs, you can also save them in a
 198       <literal>hb_segment_properties_t</literal> for reuse:
 199     </para>
 200     <programlisting language="C">
 201       hb_segment_properties_t *savedprops;
 202       hb_buffer_get_segment_properties (buf, savedprops);
 203       ...
 204       hb_buffer_set_segment_properties (buf2, savedprops);
 205     </programlisting>
 206     <para>
 207       HarfBuzz also provides getter functions to retrieve a buffer's
 208       direction, script, and language properties individually.
 209     </para>
 210     <para>
 211       HarfBuzz recognizes four text directions in
 212       <type>hb_direction_t</type>: left-to-right
 213       (<literal>HB_DIRECTION_LTR</literal>), right-to-left (<literal>HB_DIRECTION_RTL</literal>),
 214       top-to-bottom (<literal>HB_DIRECTION_TTB</literal>), and
 215       bottom-to-top (<literal>HB_DIRECTION_BTT</literal>).  For the
 216       script property, HarfBuzz uses identifiers based on the
 217       <ulink
 218       url="https://unicode.org/iso15924/">ISO 15924
 219       standard</ulink>. For languages, HarfBuzz uses tags based on the
 220       <ulink url="https://tools.ietf.org/html/bcp47">IETF BCP 47</ulink> standard.
 221     </para>
 222     <para>
 223       Helper functions are provided to convert character strings into
 224       the necessary script and language tag types.
 225     </para>
 226     <para>
 227       Two additional buffer properties to be aware of are the
 228       "invisible glyph" and the replacement code point. The
 229       replacement code point is inserted into buffer output in place of
 230       any invalid code points encountered in the input. By default, it
 231       is the Unicode <literal>REPLACEMENT CHARACTER</literal> code
 232       point, <literal>U+FFFD</literal> "&#xFFFD;". You can change this with
 233     </para>
 234     <programlisting language="C">
 235       hb_buffer_set_replacement_codepoint(buf, replacement);
 236     </programlisting>
 237     <para>
 238       passing in the replacement Unicode code point as the
 239       <parameter>replacement</parameter> parameter.
 240     </para>
 241     <para>
 242       The invisible glyph is used to replace all output glyphs that
 243       are invisible. By default, the standard space character
 244       <literal>U+0020</literal> is used; you can replace this (for
 245       example, when using a font that provides script-specific
 246       spaces) with
 247     </para>
 248     <programlisting language="C">
 249       hb_buffer_set_invisible_glyph(buf, replacement_glyph);
 250     </programlisting>
 251     <para>
 252       Do note that in the <parameter>replacement_glyph</parameter>
 253       parameter, you must provide the glyph ID of the replacement you
 254       wish to use, not the Unicode code point.
 255     </para>
 256     <para>
 257       HarfBuzz supports a few additional flags you might want to set
 258       on your buffer under certain circumstances. The
 259       <literal>HB_BUFFER_FLAG_BOT</literal> and
 260       <literal>HB_BUFFER_FLAG_EOT</literal> flags tell HarfBuzz
 261       that the buffer represents the beginning or end (respectively)
 262       of a text element (such as a paragraph or other block). Knowing
 263       this allows HarfBuzz to apply certain contextual font features
 264       when shaping, such as initial or final variants in connected
 265       scripts.
 266     </para>
 267     <para>
 268       <literal>HB_BUFFER_FLAG_PRESERVE_DEFAULT_IGNORABLES</literal>
 269       tells HarfBuzz not to hide glyphs with the
 270       <literal>Default_Ignorable</literal> property in Unicode. This
 271       property designates control characters and other non-printing
 272       code points, such as joiners and variation selectors. Normally
 273       HarfBuzz replaces them in the output buffer with zero-width
 274       space glyphs (using the "invisible glyph" property discussed
 275       above); setting this flag causes them to be printed, which can
 276       be helpful for troubleshooting.
 277     </para>
 278     <para>
 279       Conversely, setting the
 280       <literal>HB_BUFFER_FLAG_REMOVE_DEFAULT_IGNORABLES</literal> flag
 281       tells HarfBuzz to remove <literal>Default_Ignorable</literal>
 282       glyphs from the output buffer entirely. Finally, setting the
 283       <literal>HB_BUFFER_FLAG_DO_NOT_INSERT_DOTTED_CIRCLE</literal>
 284       flag tells HarfBuzz not to insert the dotted-circle glyph
 285       (<literal>U+25CC</literal>, "&#x25CC;"), which is normally
 286       inserted into buffer output when broken character sequences are
 287       encountered (such as combining marks that are not attached to a
 288       base character).
 289     </para>
 290   </section>
 291
 292   <section id="customizing-unicode-functions">
 293     <title>Customizing Unicode functions</title>
 294     <para>
 295       HarfBuzz requires some simple functions for accessing
 296       information from the Unicode Character Database (such as the
 297       <literal>General_Category</literal> (gc) and
 298       <literal>Script</literal> (sc) properties) that is useful
 299       for shaping, as well as some useful operations like composing and
 300       decomposing code points.
 301     </para>
 302     <para>
 303       HarfBuzz includes its own internal, lightweight set of Unicode
 304       functions. At build time, it is also possible to compile support
 305       for some other options, such as the Unicode functions provided
 306       by GLib or the International Components for Unicode (ICU)
 307       library. Generally, this option is only of interest for client
 308       programs that have specific integration requirements or that do
 309       a significant amount of customization.
 310     </para>
 311     <para>
 312       If your program has access to other Unicode functions, however,
 313       such as through a system library or application framework, you
 314       might prefer to use those instead of the built-in
 315       options. HarfBuzz supports this by implementing its Unicode
 316       functions as a set of virtual methods that you can replace —
 317       without otherwise affecting HarfBuzz's functionality.
 318     </para>
 319     <para>
 320       The Unicode functions are specified in a structure called
 321       <literal>unicode_funcs</literal> which is attached to each
 322       buffer. But even though <literal>unicode_funcs</literal> is
 323       associated with a <type>hb_buffer_t</type>, the functions
 324       themselves are called by other HarfBuzz APIs that access
 325       buffers, so it would be unwise for you to hook different
 326       functions into different buffers.
 327     </para>
 328     <para>
 329       In addition, you can mark your <literal>unicode_funcs</literal>
 330       as immutable by calling
 331       <function>hb_unicode_funcs_make_immutable (ufuncs)</function>.
 332       This is especially useful if your code is a
 333       library or framework that will have its own client programs. By
 334       marking your Unicode function choices as immutable, you prevent
 335       your own client programs from changing the
 336       <literal>unicode_funcs</literal> configuration and introducing
 337       inconsistencies and errors downstream.
 338     </para>
 339     <para>
 340       You can retrieve the Unicode-functions configuration for
 341       your buffer by calling <function>hb_buffer_get_unicode_funcs()</function>:
 342     </para>
 343     <programlisting language="C">
 344       hb_unicode_funcs_t *ufunctions;
 345       ufunctions = hb_buffer_get_unicode_funcs(buf);
 346     </programlisting>
 347     <para>
 348       The current version of <literal>unicode_funcs</literal> uses six functions:
 349     </para>
 350     <itemizedlist>
 351       <listitem>
 352         <para>
 353           <function>hb_unicode_combining_class_func_t</function>:
 354           returns the Canonical Combining Class of a code point.
 355         </para>
 356       </listitem>
 357       <listitem>
 358         <para>
 359           <function>hb_unicode_general_category_func_t</function>:
 360           returns the General Category (gc) of a code point.
 361         </para>
 362       </listitem>
 363       <listitem>
 364         <para>
 365           <function>hb_unicode_mirroring_func_t</function>: returns
 366           the Mirroring Glyph code point (for bi-directional
 367           replacement) of a code point.
 368         </para>
 369       </listitem>
 370       <listitem>
 371         <para>
 372           <function>hb_unicode_script_func_t</function>: returns the
 373           Script (sc) property of a code point.
 374         </para>
 375       </listitem>
 376       <listitem>
 377         <para>
 378           <function>hb_unicode_compose_func_t</function>: returns the
 379           canonical composition of a sequence of two code points.
 380         </para>
 381       </listitem>
 382       <listitem>
 383         <para>
 384           <function>hb_unicode_decompose_func_t</function>: returns
 385           the canonical decomposition of a code point.
 386         </para>
 387       </listitem>
 388     </itemizedlist>
 389     <para>
 390       Note, however, that future HarfBuzz releases may alter this set.
 391     </para>
 392     <para>
 393       Each Unicode function has a corresponding setter, with which you
 394       can assign a callback to your replacement function. For example,
 395       to replace
 396       <function>hb_unicode_general_category_func_t</function>, you can call
 397     </para>
 398     <programlisting language="C">
 399       hb_unicode_funcs_set_general_category_func (*ufuncs, func, *user_data, destroy)
 400     </programlisting>
 401     <para>
 402       Virtualizing this set of Unicode functions is primarily intended
 403       to improve portability. There is no need for every client
 404       program to make the effort to replace the default options, so if
 405       you are unsure, do not feel any pressure to customize
 406       <literal>unicode_funcs</literal>.
 407     </para>
 408   </section>
 409
 410 </chapter>