255 are displayed as "\x{...}", control characters (like "\n") are
displayed as "\x..", and the rest of the characters as themselves.
-sub nice_string {
- join("",
- map { $_ > 255 ? # if wide character...
- sprintf("\\x{%x}", $_) : # \x{...}
- chr($_) =~ /[[:cntrl:]]/ ? # else if control character ...
- sprintf("\\x%02x", $_) : # \x..
- chr($_) } # else as themselves
- unpack("U*", $_[0])); # unpack Unicode characters
-}
-
-For example, C<nice_string("foo\x{100}bar\n")> will return
-C<"foo\x{100}bar\x0a">.
+ sub nice_string {
+ join("",
+ map { $_ > 255 ? # if wide character...
+ sprintf("\\x{%x}", $_) : # \x{...}
+ chr($_) =~ /[[:cntrl:]]/ ? # else if control character ...
+ sprintf("\\x%02x", $_) : # \x..
+ chr($_) # else as themselves
+ } unpack("U*", $_[0])); # unpack Unicode characters
+ }
+
+For example,
+
+ nice_string("foo\x{100}bar\n")
+
+will return:
+
+ "foo\x{100}bar\x0a"
=head2 Special Cases
The short answer is that by default Perl compares equivalence
(C<eq>, C<ne>) based only on code points of the characters.
-In the above case, no (because 0x00C1 != 0x0041). But sometimes any
+In the above case, the answer is no (because 0x00C1 != 0x0041). But sometimes any
CAPITAL LETTER As being considered equal, or even any As of any case,
would be desirable.
Mappings>, http://www.unicode.org/unicode/reports/tr15/
http://www.unicode.org/unicode/reports/tr21/
-As of Perl 5.8.0, the's regular expression case-ignoring matching
+As of Perl 5.8.0, regular expression case-ignoring matching
implements only 1:1 semantics: one character matches one character.
In I<Case Mappings> both 1:N and N:1 matches are defined.
(Does C<LATIN CAPITAL LETTER A WITH ACUTE> come before or after
C<LATIN CAPITAL LETTER A WITH GRAVE>?)
-The short answer is that by default Perl compares strings (C<lt>,
+The short answer is that by default, Perl compares strings (C<lt>,
C<le>, C<cmp>, C<ge>, C<gt>) based only on the code points of the
-characters. In the above case, after, since 0x00C1 > 0x00C0.
+characters. In the above case, the answer is "after", since 0x00C1 > 0x00C0.
The long answer is that "it depends", and a good answer cannot be
given without knowing (at the very least) the language context.
Character ranges in regular expression character classes (C</[a-z]/>)
and in the C<tr///> (also known as C<y///>) operator are not magically
-Unicode-aware. What this means that C<[a-z]> will not magically start
+Unicode-aware. What this means that C<[A-Za-z]> will not magically start
to mean "all alphabetic letters" (not that it does mean that even for
8-bit characters, you should be using C</[[:alpha]]/> for that).
-For specifying things like that in regular expressions you can use the
-various Unicode properties, C<\pL> in this particular case. You can
+For specifying things like that in regular expressions, you can use the
+various Unicode properties, C<\pL> or perhaps C<\p{Alphabetic}>, in this particular case. You can
use Unicode code points as the end points of character ranges, but
that means that particular code point range, nothing more. For
further information, see L<perlunicode>.
Unicode does define several other decimal (and numeric) characters
than just the familiar 0 to 9, such as the Arabic and Indic digits.
Perl does not support string-to-number conversion for digits other
-than the 0 to 9 (and a to f for hexadecimal).
+than ASCII 0 to 9 (and ASCII a to f for hexadecimal).
=back