C<qr/\X/>, which matches a Unicode logical character, has been expanded to work
better with various Asian languages. It now is defined as an C<extended
grapheme cluster>. (See L<http://www.unicode.org/reports/tr29/>).
-Anything matched by previously will continue to be matched. But in addition:
+Anything matched previously that made sense will continue to be matched. But
+in addition:
=over
C<\X> will now always match at least one character, including an initial mark.
Marks generally come after a base character, but it is possible in Unicode to
have them in isolation, and C<\X> will now handle that case, for example at the
-beginning of a line or after a C<ZWSP>.
+beginning of a line or after a C<ZWSP>. And this is the part where C<\X>
+doesn't match the things that it used to that don't make sense. Formerly, for
+example, you could have the nonsensical case of an accented LF.
=item *
C<\X> matches quite well what normal (non-Unicode-programmer) usage
would consider a single character. As an example, consider a G with some sort
-of accent mark over it (a diacritic). There is no such single character in
-Unicode, but something like one can be constructed by using a G followed by a
-Unicode combining accent, and would be displayed by Unicode-aware software as
-if it were a single character.
+of diacritic mark, such as an arrow. There is no such single character in
+Unicode, but one can be composed using a G followed by a Unicode "COMBINING
+UPWARDS ARROW BELOW", and would be displayed by Unicode-aware software as if it
+were a single character.
Mnemonic: eI<X>tended Unicode character.
[c] Try the C<:crlf> layer (see L<PerlIO>).
-[d] Avoid C<use warning 'utf8';> (or say C<no warning 'utf8';>) to allow
-U+FFFF (C<\x{FFFF}>).
+[d] U+FFFF will currently generate a warning message if 'utf8' warnings are
+ enabled
=item *
Level 2 - Extended Unicode Support
RL2.1 Canonical Equivalents - MISSING [10][11]
- RL2.2 Default Grapheme Clusters - MISSING [12][13]
+ RL2.2 Default Grapheme Clusters - MISSING [12]
RL2.3 Default Word Boundaries - MISSING [14]
RL2.4 Default Loose Matches - MISSING [15]
RL2.5 Name Properties - MISSING [16]
A Unicode I<logical> "character" can actually consist of more than one internal
I<actual> "character" or code point. For Western languages, this is adequately
-represented by a I<base character> (like C<LATIN CAPITAL LETTER A>), followed
+modelled by a I<base character> (like C<LATIN CAPITAL LETTER A>) followed
by one or more I<modifiers> (like C<COMBINING ACUTE ACCENT>). This sequence of
base character and modifiers is called a I<combining character
sequence>. Some non-western languages require more complicated
-representations, so Unicode invented a I<grapheme cluster> and then an
-I<extended grapheme cluster>. For example, A Korean Hangul syllable is
+models, so Unicode created the I<grapheme cluster> concept, and then the
+I<extended grapheme cluster>. For example, a Korean Hangul syllable is
considered a single logical character, but most often consists of three actual
-characters: a leading consonant followed by an interior vowel followed by a
-trailing consonant.
+Unicode characters: a leading consonant followed by an interior vowel followed
+by a trailing consonant.
Whether to call these extended grapheme clusters "characters" depends on your
point of view. If you are a programmer, you probably would tend towards seeing