From 990e18f721a7d2ee48d50ea4262bd5d109e9f89c Mon Sep 17 00:00:00 2001 From: Audrey Tang Date: Wed, 10 Dec 2003 04:39:16 +0800 Subject: [PATCH] Implicit upgrading docs Message-ID: <20031209123915.GA1454@not.autrijus.org> p4raw-id: //depot/perl@21873 --- ext/Encode/encoding.pm | 19 +++++++++++++++++++ pod/perlunicode.pod | 27 +++++++++++++++++++++------ 2 files changed, 40 insertions(+), 6 deletions(-) diff --git a/ext/Encode/encoding.pm b/ext/Encode/encoding.pm index f203cb3..9366252 100644 --- a/ext/Encode/encoding.pm +++ b/ext/Encode/encoding.pm @@ -192,6 +192,25 @@ not "\x{99F1}\x{99DD} is the symbol of perl.\n". You can override this by giving extra arguments; see below. +=head2 Implicit upgrading for byte strings + +By default, if strings operating under byte semantics and strings +with Unicode character data are concatenated, the new string will +be created by decoding the byte strings as I. + +The B pragma changes this to use the specified encoding +instead. For example: + + use encoding 'utf8'; + my $string = chr(20000); # a Unicode string + utf8::encode($string); # now it's a UTF-8 encoded byte string + # concatenate with another Unicode string + print length($string . chr(20000)); + +Will print C<2>, because C<$string> is upgraded as UTF-8. Without +C, it will print C<4> instead, since C<$string> +is three octets when interpreted as Latin-1. + =head1 FEATURES THAT REQUIRE 5.8.1 Some of the features offered by this pragma requires perl 5.8.1. Most diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 190247a..b6d00d1 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -42,6 +42,21 @@ is needed.> See L. You can also use the C pragma to change the default encoding of the data in your script; see L. +=item C needed to upgrade non-Latin-1 byte strings + +By default, there is a fundamental asymmetry in Perl's unicode model: +implicit upgrading from byte strings to Unicode strings assumes that +they were encoded in I, but Unicode strings are +downgraded with UTF-8 encoding. This happens because the first 256 +codepoints in Unicode happens to agree with Latin-1. + +If you wish to interpret byte strings as UTF-8 instead, use the +C pragma: + + use encoding 'utf8'; + +See L for more details. + =back =head2 Byte and Character Semantics @@ -86,12 +101,12 @@ Otherwise, byte semantics are in effect. The C pragma should be used to force byte semantics on Unicode data. If strings operating under byte semantics and strings with Unicode -character data are concatenated, the new string will be upgraded to -I, even if the old Unicode string used EBCDIC. -This translation is done without regard to the system's native 8-bit -encoding, so to change this for systems with non-Latin-1 and -non-EBCDIC native encodings use the C pragma. See -L. +character data are concatenated, the new string will be created by +decoding the byte strings as I, even if the +old Unicode string used EBCDIC. This translation is done without +regard to the system's native 8-bit encoding. To change this for +systems with non-Latin-1 and non-EBCDIC native encodings, use the +C pragma. See L. Under character semantics, many operations that formerly operated on bytes now operate on characters. A character in Perl is -- 2.7.4