the UTF- encodings, and a string encoded in UTF-EBCDIC may occupy more bytes
than in UTF-8.
-Also, on some EBCDIC machines, functions that are documented as operating on
-US-ASCII (or Basic Latin in Unicode terminology) may in fact operate on all
-256 characters in the EBCDIC range, not just the subset corresponding to
-US-ASCII.
-
The listing below is alphabetical, case insensitive.
_EOB_
always returned.
Variant C<isFOO_A> (e.g., C<isALPHA_A()>) will return TRUE only if the input is
-also in the ASCII character set. For ASCII platforms, the base function with
-no suffix and the one with the C<_A> suffix are identical. On EBCDIC
-platforms, the C<_A> suffix function will not return true unless the specified
-character also has an ASCII equivalent.
-
-Variant C<isFOO_L1> operates on the full Latin1 character set. For EBCDIC
-platforms, the base function with no suffix and the one with the C<_L1> suffix
-are identical. For ASCII platforms, the C<_L1> suffix imposes the Latin-1
-character set onto the platform. That is, the code points that are ASCII are
-unaffected, since ASCII is a subset of Latin-1. But the non-ASCII code points
-are treated as if they are Latin-1 characters. For example, C<isSPACE_L1()>
-will return true when called with the code point 0xA0, which is the Latin-1
-NO-BREAK SPACE.
+also in the ASCII character set. The base function with no suffix and the one
+with the C<_A> suffix are identical.
+
+Variant C<isFOO_L1> imposes the Latin-1 (or EBCDIC equivlalent) character set
+onto the platform. That is, the code points that are ASCII are unaffected,
+since ASCII is a subset of Latin-1. But the non-ASCII code points are treated
+as if they are Latin-1 characters. For example, C<isWORDCHAR_L1()> will return
+true when called with the code point 0xDF, which is a word character in both
+ASCII and EBCDIC (though it represent different characters in each).
Variant C<isFOO_uni> is like the C<isFOO_L1> variant, but accepts any UV code
point as input. If the code point is larger than 255, Unicode rules are used
=item If C<use bytes> is in effect:
-=over
-
-=item On EBCDIC platforms
-
-The results are what the C language system call C<tolower()> returns.
-
-=item On ASCII platforms
-
The results follow ASCII semantics. Only characters C<A-Z> change, to C<a-z>
respectively.
-=back
-
=item Otherwise, if C<use locale> (but not C<use locale ':not_characters'>) is in effect:
Respects current LC_CTYPE locale for code points < 256; and uses Unicode
=item Otherwise:
-=over
-
-=item On EBCDIC platforms
-
-The results are what the C language system call C<tolower()> returns.
-
-=item On ASCII platforms
-
ASCII semantics are used for the case change. The lowercase of any character
outside the ASCII range is the character itself.
=back
-=back
-
=item lcfirst EXPR
X<lcfirst> X<lowercase>
L<http://unicode.org/reports/tr36> for a detailed discussion of Unicode
security issues.
-On the EBCDIC platforms that Perl handles, the native character set is
-equivalent to Latin-1. Thus this modifier changes behavior only when
-the C<"/i"> modifier is also specified, and it turns out it affects only
-two characters, giving them full Unicode semantics: the C<MICRO SIGN>
-will match the Greek capital and small letters C<MU>, otherwise not; and
-the C<LATIN CAPITAL LETTER SHARP S> will match any of C<SS>, C<Ss>,
-C<sS>, and C<ss>, otherwise not.
-
This modifier may be specified to be the default by C<use feature
'unicode_strings>, C<use locale ':not_characters'>, or
C<L<use 5.012|perlfunc/use VERSION>> (or higher),
become rather infamous, leading to yet another (printable) name for this
modifier, "Dodgy".
-On ASCII platforms, the native rules are ASCII, and on EBCDIC platforms
-(at least the ones that Perl handles), they are Latin-1.
+Unless the pattern or string are encoded in UTF-8, only ASCII characters
+can match positively.
Here are some examples of how that works on an ASCII platform:
C<\w> matches the platform's native underscore character plus whatever
the locale considers to be alphanumeric.
-=item if Unicode rules are in effect or if on an EBCDIC platform ...
+=item if Unicode rules are in effect ...
C<\w> matches exactly what C<\p{Word}> matches.
C<\s> matches whatever the locale considers to be whitespace.
-=item if Unicode rules are in effect or if on an EBCDIC platform ...
+=item if Unicode rules are in effect ...
C<\s> matches exactly the characters shown with an "s" column in the
table below.
The first column gives the Unicode code point of the character (in hex format),
the second column gives the (Unicode) name. The third column indicates
-by which class(es) the character is matched (assuming no locale or EBCDIC code
-page is in effect that changes the C<\s> matching).
+by which class(es) the character is matched (assuming no locale is in
+effect that changes the C<\s> matching).
0x0009 CHARACTER TABULATION h s
0x000a LINE FEED (LF) vs
In the ASCII range, characters whose code points are between 0 and 31 inclusive,
plus 127 (C<DEL>) are control characters.
-On EBCDIC platforms, it is likely that the code page will define C<[[:cntrl:]]>
-to be the EBCDIC equivalents of the ASCII controls, plus the controls
-that in Unicode have code points from 128 through 159.
-
=item [3]
Any character that is I<graphical>, that is, visible. This class consists
C<word> uses the platform's native underscore character, no matter what
the locale is.
-=item if Unicode rules are in effect or if on an EBCDIC platform ...
+=item if Unicode rules are in effect ...
The POSIX class matches the same as the Full-range counterpart.
It is proposed to change this behavior in a future release of Perl so that
whether or not Unicode rules are in effect would not change the
-behavior: Outside of locale or an EBCDIC code page, the POSIX classes
+behavior: Outside of locale, the POSIX classes
would behave like their ASCII-range counterparts. If you wish to
comment on this proposal, send email to C<perl5-porters@perl.org>.
C<use feature 'unicode_strings'> in its scope; see L<perllocale>.)
Otherwise, Perl uses the platform's native
byte semantics for characters whose code points are less than 256, and
-Unicode semantics for those greater than 255. On EBCDIC platforms, this
-is almost seamless, as the EBCDIC code pages that Perl handles are
-equivalent to Unicode's first 256 code points. (The exception is that
-EBCDIC regular expression case-insensitive matching rules are not as
-as robust as Unicode's.) But on ASCII platforms, Perl uses US-ASCII
-(or Basic Latin in Unicode terminology) byte semantics, meaning that characters
-whose ordinal numbers are in the range 128 - 255 are undefined except for their
+Unicode semantics for those greater than 255. That means that non-ASCII
+characters are undefined except for their
ordinal numbers. This means that none have case (upper and lower), nor are any
a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong
to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
programs knew only about the ASCII character set, and so may not work
properly for additional characters. When a string is encoded in UTF-8,
Perl assumes that the program is prepared to deal with Unicode, but when
-the string isn't, Perl assumes that only ASCII (unless it is an EBCDIC
-platform) is wanted, and so those characters that are not ASCII
+the string isn't, Perl assumes that only ASCII
+is wanted, and so those characters that are not ASCII
characters aren't recognized as to what they would be in Unicode.
C<use feature 'unicode_strings'> tells Perl to treat all characters as
Unicode, whether the string is encoded in UTF-8 or not, thus avoiding