From 0bd5a82d65c4e6a2376313bca55dc77d7694c82d Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Thu, 31 Mar 2011 12:00:13 -0600 Subject: [PATCH] perlretut: Update for 5.14 /a, /u --- pod/perlretut.pod | 46 ++++++++++++++++++++++++++++++++++------------ 1 file changed, 34 insertions(+), 12 deletions(-) diff --git a/pod/perlretut.pod b/pod/perlretut.pod index 195ce75..ea80594 100644 --- a/pod/perlretut.pod +++ b/pod/perlretut.pod @@ -367,8 +367,9 @@ character, or the match fails. Then Now, even C<[0-9]> can be a bother to write multiple times, so in the interest of saving keystrokes and making regexps more readable, Perl has several abbreviations for common character classes, as shown below. -Since the introduction of Unicode, these character classes match more -than just a few characters in the ISO 8859-1 range. +Since the introduction of Unicode, unless the C modifier is in +effect, these character classes match more than just a few characters in +the ASCII range. =over 4 @@ -409,6 +410,15 @@ regardless of whether the modifier C is in effect. =back +The C modifier, available starting in Perl 5.14, is used to +restrict the matches of \d, \s, and \w to just those in the ASCII range. +It is useful to keep your program from being needlessly exposed to full +Unicode (and its accompanying security considerations) when all you want +is to process English-like text. (The "a" may be doubled, C, to +provide even more restrictions, preventing case-insensitive matching of +ASCII with non-ASCII characters; otherwise a Unicode "Kelvin Sign" +would caselessly match a "k" or "K".) + The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside of character classes. Here are some in use: @@ -1643,6 +1653,10 @@ which is the correct answer. This example illustrates that it is important not only to match what is desired, but to reject what is not desired. +(There are other regexp modifiers that are available, such as +C, C, and C, but their specialized uses are beyond the +scope of this introduction. ) + =head3 Search and replace Regular expressions also play a big role in I @@ -1911,11 +1925,18 @@ One can also use short names or restrict names to a certain alphabet: A list of full names can be found in F in the Unicode standard (available at L). -The answer to requirement 2), as of 5.6.0, is that a regexp uses Unicode -characters. Internally, this is encoded to bytes using either UTF-8 or a -native 8 bit encoding, depending on the history of the string, but -conceptually it is a sequence of characters, not bytes. See -L for a tutorial about that. +The answer to requirement 2), as of 5.6.0, is that a regexp (mostly) +uses Unicode characters. (For messy backward compatibility reasons, +most but not all semantics of a match will assume Unicode, unless, +starting in Perl 5.14, you tell it to use full Unicode. You can do this +explicitly by using the C modifier, or you can ask Perl to use the +modifier implicitly for all regexes in a scope by using C (or +higher) or C.) If you want to handle +Unicode properly, you should ensure that one of these is the case.) +Internally, this is encoded to bytes using either UTF-8 or a native 8 +bit encoding, depending on the history of the string, but conceptually +it is a sequence of characters, not bytes. See L for a +tutorial about that. Let us now discuss Unicode character classes. Just as with Unicode characters, there are named Unicode character classes represented by the @@ -1993,11 +2014,12 @@ character classes. These have the form C<[:name:]>, with C the name of the POSIX class. The POSIX classes are C, C, C, C, C, C, C, C, C, C, C, and C, and two extensions, C (a Perl -extension to match C<\w>), and C (a GNU extension). If -Unicode is enabled (see C), -then these classes are defined the same as their -corresponding Perl Unicode classes: C<[:upper:]> is the same as -C<\p{IsUpper}>, etc. The C<[:digit:]>, C<[:word:]>, and +extension to match C<\w>), and C (a GNU extension). The C +modifier restricts these to matching just in the ASCII range; otherwise +they can match the same as their corresponding Perl Unicode classes: +C<[:upper:]> is the same as C<\p{IsUpper}>, etc. (There are some +exceptions and gotchas with this; see L for a full +discussion.) The C<[:digit:]>, C<[:word:]>, and C<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s> character classes. To negate a POSIX class, put a C<^> in front of the name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and, under -- 2.7.4