2 @chapter Word breaks in strings @code{<uniwbrk.h>}
5 @cindex word boundaries
7 @cindex boundaries, between words
8 This include file declares functions for determining where in a string
9 ``words'' start and end. Here ``words'' are not necessarily the same as
10 entities that can be looked up in dictionaries, but rather groups of
11 consecutive characters that should not be split by text processing
15 * Word breaks in a string::
16 * Word break property::
19 @node Word breaks in a string
20 @section Word breaks in a string
22 The following functions determine the word breaks in a string.
24 @deftypefun void u8_wordbreaks (const uint8_t *@var{s}, size_t @var{n}, char *@var{p})
25 @deftypefunx void u16_wordbreaks (const uint16_t *@var{s}, size_t @var{n}, char *@var{p})
26 @deftypefunx void u32_wordbreaks (const uint32_t *@var{s}, size_t @var{n}, char *@var{p})
27 @deftypefunx void ulc_wordbreaks (const char *@var{s}, size_t @var{n}, char *@var{p})
28 Determines the word break points in @var{s}, an array of @var{n} units, and
29 stores the result at @code{@var{p}[0..@var{n}-1]}.
31 @item @code{@var{p}[i] = 1}
32 means that there is a word boundary between @code{@var{s}[i-1]} and
34 @item @code{@var{p}[i] = 0}
35 means that @code{@var{s}[i-1]} and @code{@var{s}[i]} must not be separated.
37 @code{@var{p}[0]} is always set to 0. If an application wants to consider a
38 word break to be present at the beginning of the string (before
39 @code{@var{s}[0]}) or at the end of the string (after
40 @code{@var{s}[0..@var{n}-1]}), it has to treat these cases explicitly.
43 @node Word break property
44 @section Word break property
46 This is a more low-level API. The word break property is a property defined
47 in Unicode Standard Annex #29, section ``Word Boundaries'', see
48 @url{http://www.unicode.org/reports/tr29/#Word_Boundaries}.@texnl{} It is
49 used for determining the word breaks in a string.
51 The following are the possible values of the word break property. More values
52 may be added in the future.
54 @deftypevr Constant int WBP_OTHER
55 @deftypevrx Constant int WBP_CR
56 @deftypevrx Constant int WBP_LF
57 @deftypevrx Constant int WBP_NEWLINE
58 @deftypevrx Constant int WBP_EXTEND
59 @deftypevrx Constant int WBP_FORMAT
60 @deftypevrx Constant int WBP_KATAKANA
61 @deftypevrx Constant int WBP_ALETTER
62 @deftypevrx Constant int WBP_MIDNUMLET
63 @deftypevrx Constant int WBP_MIDLETTER
64 @deftypevrx Constant int WBP_MIDNUM
65 @deftypevrx Constant int WBP_NUMERIC
66 @deftypevrx Constant int WBP_EXTENDNUMLET
69 The following function looks up the word break property of a character.
71 @deftypefun int uc_wordbreak_property (ucs4_t @var{uc})
72 Returns the Word_Break property of a Unicode character.