In future, Perl-level operations will expect to work with characters
rather than bytes, in general.
-However, Perl v5.6 aims to provide a safe migration path from byte
-semantics to character semantics for programs. To preserve compatibility
-with earlier versions of Perl which allowed byte semantics in Perl
-operations (owing to the fact that the internal representation for
-characters was in bytes) byte semantics will continue to be in effect
-until a the C<utf8> pragma is used in the C<main> package, or the C<$^U>
-global flag is explicitly set.
+However, as strictly an interim compatibility measure, Perl v5.6 aims to
+provide a safe migration path from byte semantics to character semantics
+for programs. For operations where Perl can unambiguously decide that the
+input data is characters, Perl now switches to character semantics.
+For operations where this determination cannot be made without additional
+information from the user, Perl decides in favor of compatibility, and
+chooses to use byte semantics.
+
+This behavior preserves compatibility with earlier versions of Perl,
+which allowed byte semantics in Perl operations, but only as long as
+none of the program's inputs are marked as being as source of Unicode
+character data. Such data may come from filehandles, from calls to
+external programs, from information provided by the system (such as %ENV),
+or from literals and constants in the source text. Later, in
+L</Character encodings for input and output>, we'll see how such
+inputs may be marked as being Unicode character data sources.
+
+One particular condition will enable character semantics on the entire
+program, bypassing the compatibility mode: if the C<$^U> global flag is
+set to C<1>, nearly all operations will use character semantics by
+default. As an added convenience, if the C<utf8> pragma is used in the
+C<main> package, C<$^U> is enabled automatically. [XXX: Should there
+be a -C switch to enable $^U?]
+
+Regardless of the above, the C<byte> pragma can always be used to force
+byte semantics in a particular lexical scope. See L<byte>.
+
+The C<utf8> pragma is primarily a compatibility device that enables
+recognition of UTF-8 in literals encountered by the parser. It is also
+used for enabling some of the more experimental Unicode support features.
+Note that this pragma is only required until a future version of Perl
+in which character semantics will become the default. This pragma may
+then become a no-op. See L<utf8>.
+
+Unless mentioned otherwise, Perl operators will use character semantics
+when they are dealing with Unicode data, and byte semantics otherwise.
+Thus, character semantics for these operations apply transparently; if
+the input data came from a Unicode source (for example, by adding a
+character encoding discipline to the filehandle whence it came, or a
+literal UTF-8 string constant in the program), character semantics
+apply; otherwise, byte semantics are in effect. To force byte semantics
+on Unicode data, the C<byte> pragma should be used.
Under character semantics, many operations that formerly operated on
bytes change to operating on characters. For ASCII data this makes
sequences of bytes internally, but again, this is just an internal
detail which is hidden at the Perl level.
-The C<byte> pragma can be used to force byte semantics in a particular
-lexical scope. See L<byte>.
-
-The C<utf8> pragma is a compatibility device to enables recognition
-of UTF-8 in literals encountered by the parser. It is also used
-for enabling some experimental Unicode support features. Note that
-this pragma is only required until a future version of Perl in which
-character semantics will become the default. This pragma may then
-become a no-op. See L<utf8>.
+=head2 Effects of character semantics
Character semantics have the following effects:
it comes to using the canonical forms of characters--Perl doesn't (yet)
attempt to canonicalize variable names for you.)
-This also needs C<use utf8> currently. [XXX: Why? High-bit chars were
+This also needs C<use utf8> currently. [XXX: Why?!? High-bit chars were
syntax errors when they occurred within identifiers in previous versions,
-so this should be enabled by default.]
+so this should probably be enabled by default.]
=item *
Unicode support in regular expressions needs C<use utf8> currently.
[XXX: Because the SWASH routines need to be loaded. And the RE engine
-appears to need an overhaul to Unicode by default anyway.]
+appears to need an overhaul to dynamically match Unicode anyway--the
+current RE compiler creates different nodes with and without C<use utf8>.]
=item *
=back
+=head2 Character encodings for input and output
+
+[XXX: This feature is not yet implemented.]
+
=head1 CAVEATS
As of yet, there is no method for automatically coercing input and
output to some encoding other than UTF-8. This is planned in the near
future, however.
-Whether a piece of data will be treated as "characters" or "bytes"
-by internal operations cannot be divined at the current time.
+Whether an arbitrary piece of data will be treated as "characters" or
+"bytes" by internal operations cannot be divined at the current time.
Use of locales with utf8 may lead to odd results. Currently there is
some attempt to apply 8-bit locale info to characters in the range