Will the real Unicode encoding please stand up?

author Jeffrey Friedl <jfriedl@regex.info>

Sun, 16 Dec 2001 11:36:32 +0000 (03:36 -0800)

committer Jarkko Hietaniemi <jhi@iki.fi>

Mon, 17 Dec 2001 16:57:57 +0000 (16:57 +0000)
author Jeffrey Friedl <jfriedl@regex.info>
Sun, 16 Dec 2001 11:36:32 +0000 (03:36 -0800)
committer Jarkko Hietaniemi <jhi@iki.fi>
Mon, 17 Dec 2001 16:57:57 +0000 (16:57 +0000)
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod

index 0ecfba0..67ce214 100644 (file)
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -358,18 +358,23 @@ its argument so that Unicode characters with code points greater than
  255 are displayed as "\x{...}", control characters (like "\n") are
  displayed as "\x..", and the rest of the characters as themselves.
  
-sub nice_string {
-   join("",
-       map { $_ > 255 ?                        # if wide character...
-                 sprintf("\\x{%x}", $_) :      # \x{...}
-                 chr($_) =~ /[[:cntrl:]]/ ?    # else if control character ...
-                     sprintf("\\x%02x", $_) :  # \x..
-                      chr($_) }                        # else as themselves
-            unpack("U*", $_[0]));              # unpack Unicode characters
-}
-
-For example, C<nice_string("foo\x{100}bar\n")> will return
-C<"foo\x{100}bar\x0a">.
+   sub nice_string {
+       join("",
+         map { $_ > 255 ?                  # if wide character...
+               sprintf("\\x{%x}", $_) :    # \x{...}
+               chr($_) =~ /[[:cntrl:]]/ ?  # else if control character ...
+               sprintf("\\x%02x", $_) :    # \x..
+               chr($_)                     # else as themselves
+         } unpack("U*", $_[0]));           # unpack Unicode characters
+   }
+
+For example,
+
+   nice_string("foo\x{100}bar\n")
+
+will return:
+
+   "foo\x{100}bar\x0a"
  
  =head2 Special Cases
  
@@ -423,7 +428,7 @@ C<LATIN CAPITAL LETTER A>?)
  
  The short answer is that by default Perl compares equivalence
  (C<eq>, C<ne>) based only on code points of the characters.
-In the above case, no (because 0x00C1 != 0x0041).  But sometimes any
+In the above case, the answer is no (because 0x00C1 != 0x0041).  But sometimes any
  CAPITAL LETTER As being considered equal, or even any As of any case,
  would be desirable.
  
@@ -433,7 +438,7 @@ Reports #15 and #21, I<Unicode Normalization Forms> and I<Case
  Mappings>, http://www.unicode.org/unicode/reports/tr15/
  http://www.unicode.org/unicode/reports/tr21/
  
-As of Perl 5.8.0, the's regular expression case-ignoring matching
+As of Perl 5.8.0, regular expression case-ignoring matching
  implements only 1:1 semantics: one character matches one character.
  In I<Case Mappings> both 1:N and N:1 matches are defined.
  
@@ -447,9 +452,9 @@ parlance goes, collated.  But again, what do you mean by collate?
  (Does C<LATIN CAPITAL LETTER A WITH ACUTE> come before or after
  C<LATIN CAPITAL LETTER A WITH GRAVE>?)
  
-The short answer is that by default Perl compares strings (C<lt>,
+The short answer is that by default, Perl compares strings (C<lt>,
  C<le>, C<cmp>, C<ge>, C<gt>) based only on the code points of the
-characters.  In the above case, after, since 0x00C1 > 0x00C0.
+characters.  In the above case, the answer is "after", since 0x00C1 > 0x00C0.
  
  The long answer is that "it depends", and a good answer cannot be
  given without knowing (at the very least) the language context.
@@ -468,12 +473,12 @@ Character Ranges
  
  Character ranges in regular expression character classes (C</[a-z]/>)
  and in the C<tr///> (also known as C<y///>) operator are not magically
-Unicode-aware.  What this means that C<[a-z]> will not magically start
+Unicode-aware.  What this means that C<[A-Za-z]> will not magically start
  to mean "all alphabetic letters" (not that it does mean that even for
  8-bit characters, you should be using C</[[:alpha]]/> for that).
  
-For specifying things like that in regular expressions you can use the
-various Unicode properties, C<\pL> in this particular case.  You can
+For specifying things like that in regular expressions, you can use the
+various Unicode properties, C<\pL> or perhaps C<\p{Alphabetic}>, in this particular case.  You can
  use Unicode code points as the end points of character ranges, but
  that means that particular code point range, nothing more.  For
  further information, see L<perlunicode>.
@@ -485,7 +490,7 @@ String-To-Number Conversions
  Unicode does define several other decimal (and numeric) characters
  than just the familiar 0 to 9, such as the Arabic and Indic digits.
  Perl does not support string-to-number conversion for digits other
-than the 0 to 9 (and a to f for hexadecimal).
+than ASCII 0 to 9 (and ASCII a to f for hexadecimal).
  
  =back
author	Jeffrey Friedl <jfriedl@regex.info>
	Sun, 16 Dec 2001 11:36:32 +0000 (03:36 -0800)
committer	Jarkko Hietaniemi <jhi@iki.fi>
	Mon, 17 Dec 2001 16:57:57 +0000 (16:57 +0000)