From ed7efc79ab6ea9f03d275ec3a285b8416f9c9bfa Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sun, 10 Apr 2011 18:05:52 -0600 Subject: [PATCH] perlre.pod: Update for 5.14 --- pod/perlre.pod | 299 +++++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 225 insertions(+), 74 deletions(-) diff --git a/pod/perlre.pod b/pod/perlre.pod index 387c820..fa7f3ec 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -53,12 +53,34 @@ Do case-insensitive pattern matching. If C is in effect, the case map is taken from the current locale for code points less than 255, and from Unicode rules for larger -code points. See L. +code points. However, matches that would cross the Unicode +rules/non-Unicode rules boundary (ords 255/256) will not succeed. See +L. + +There are a number of Unicode characters that match multiple characters +under C. For example, C +should match the sequence C. Perl is not +currently able to do this when the multiple characters are in the pattern and +are split between groupings, or when one or more are quantified. Thus + + "\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches + "\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match! + "\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match! + + # The below doesn't match, and it isn't clear what $1 and $2 would + # be even if it did!! + "\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match! + +Also, this matching doesn't fully conform to the current Unicode +recommendations, which ask that the matching be made upon the NFD +(Normalization Form Decomposed) of the text. However, Unicode is +in the process of reconsidering and revising their recommendations. =item x X Extend your pattern's legibility by permitting whitespace and comments. +Details in L =item p X

X X @@ -79,18 +101,21 @@ of the g and c modifiers. X X X X These modifiers, new in 5.14, affect which character-set semantics -(Unicode, ASCII, etc.) are used, as described below. +(Unicode, ASCII, etc.) are used, as described below in +L. =back These are usually written as "the C modifier", even though the delimiter in question might not really be a slash. The modifiers C may also be embedded within the regular expression itself using -the C<(?...)> construct. +the C<(?...)> construct, see L below. The C, C, C, C and C modifiers need a little more explanation. +=head3 /x + C tells the regular expression parser to ignore most whitespace that is neither backslashed nor within a character class. You can use this to break up @@ -118,80 +143,211 @@ in C<\p{...}> there can be spaces that follow the Unicode rules, for which see L. X -C means to use a locale (see L) when pattern matching. -The locale used will be the one in effect at the time of execution of -the pattern match. This may not be the same as the compilation-time -locale, and can differ from one match to another if there is an -intervening call of the +=head3 Character set modifiers + +C, C, C, and C, available starting in 5.14, are called +the character set modifiers; they affect the character set semantics +used for the regular expression. + +At any given time, exactly one of these modifiers is in effect. Once +compiled, the behavior doesn't change regardless of what rules are in +effect when the regular expression is executed. And if a regular +expression is interpolated into a larger one, the original's rules +continue to apply to it, and only it. + +=head4 /l + +means to use the current locale's rules (see L) when pattern +matching. For example, C<\w> will match the "word" characters of that +locale, and C<"/i"> case-insensitive matching will match according to +the locale's case folding rules. The locale used will be the one in +effect at the time of execution of the pattern match. This may not be +the same as the compilation-time locale, and can differ from one match +to another if there is an intervening call of the L. -This modifier is automatically set if the regular expression is compiled -within the scope of a C<"use locale"> pragma. -Perl only allows single-byte locales. This means that code points above -255 are treated as Unicode no matter what locale is in effect. -Under Unicode rules, there are a few case-insensitive matches that cross the -255/256 boundary. These are disallowed. For example, -0xFF does not caselessly match the character at 0x178, LATIN CAPITAL -LETTER Y WITH DIAERESIS, because 0xFF may not be LATIN SMALL LETTER Y -in the current locale, and Perl has no way of knowing if that character -even exists in the locale, much less what code point it is. + +Perl only supports single-byte locales. This means that code points +above 255 are treated as Unicode no matter what locale is in effect. +Under Unicode rules, there are a few case-insensitive matches that cross +the 255/256 boundary. These are disallowed under C. For example, +0xFF does not caselessly match the character at 0x178, C, because 0xFF may not be C in the current locale, and Perl has no way of knowing if +that character even exists in the locale, much less what code point it +is. + +This modifier may be specified to be the default by C, but +see L. X -C means to use Unicode semantics when pattern matching. It is -automatically set if the regular expression is encoded in utf8 internally, -or is compiled within the scope of a -L|feature> pragma (and isn't also in -the scope of the L|locale> or the L|bytes> -pragma). On ASCII platforms, the code points between 128 and 255 take on their +=head4 /u + +means to use Unicode rules when pattern matching. On ASCII platforms, +this means that the code points between 128 and 255 take on their Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's), whereas in strict ASCII their meanings are undefined. Thus the platform -effectively becomes a Unicode platform. The ASCII characters remain as -ASCII characters (since ASCII is a subset of Latin-1 and Unicode). For -example, when this option is not on, on a non-utf8 string, C<"\w"> -matches precisely C<[A-Za-z0-9_]>. When the option is on, it matches -not just those, but all the Latin-1 word characters (such as an "n" with -a tilde). On EBCDIC platforms, which already are equivalent to Latin-1, -this modifier changes behavior only when the C<"/i"> modifier is also -specified, and affects only two characters, giving them full Unicode -semantics: the C will match the Greek capital and -small letters C; otherwise not; and the C will match any of C, C, C, and C, otherwise not. -(This last case is buggy, however.) +effectively becomes a Unicode platform, hence, for example, C<\w> will +match any of the more than 100_000 word characters in Unicode. + +Unlike most locales, which are specific to a language and country pair, +Unicode classifies all the characters that are letters I as +C<\w>. For example, your locale might not think that C is a letter (unless you happen to speak Icelandic), but +Unicode does. Similarly, all the characters that are decimal digits +somewhere in the world will match C<\d>; this is hundreds, not 10, +possible matches. And some of those digits look like some of the 10 +ASCII digits, but mean a different number, so a human could easily think +a number is a different quantity than it really is. For example, +C (U+09EA) looks very much like an +C (U+0038). And, C<\d+>, may match strings of digits +that are a mixture from different writing systems, creating a security +issue. L can be used to sort this out. + +Also, case-insensitive matching works on the full set of Unicode +characters. The C, for example matches the letters "k" and +"K"; and C matches the sequence "ff", which, +if you're not prepared, might make it look like a hexadecimal constant, +presenting another potential security issue. See +L for a detailed discussion of Unicode +security issues. + +On EBCDIC platforms, which already are equivalent to Latin-1 (at least +the ones that Perl handles), this modifier changes behavior only when +the C<"/i"> modifier is also specified, and it turns out it affects only +two characters, giving them full Unicode semantics: the C +will match the Greek capital and small letters C; otherwise not; and +the C will match any of C, C, +C, and C, otherwise not. + +This modifier may be specified to be the default by C, but see +L. X -C is the same as C, except that C<\d>, C<\s>, C<\w>, and the +=head4 /a + +is the same as C, except that C<\d>, C<\s>, C<\w>, and the Posix character classes are restricted to matching in the ASCII range only. That is, with this modifier, C<\d> always means precisely the digits C<"0"> to C<"9">; C<\s> means the five characters C<[ \f\n\r\t]>; C<\w> means the 63 characters C<[A-Za-z0-9_]>; and likewise, all the Posix classes such as C<[[:print:]]> match only the appropriate -ASCII-range characters. As you would expect, this modifier causes, for -example, C<\D> to mean the same thing as C<[^0-9]>; in fact, all -non-ASCII characters match C<\D>, C<\S>, and C<\W>. C<\b> still means -to match at the boundary between C<\w> and C<\W>, using the C<"a"> -definitions of them (similarly for C<\B>). Otherwise, C<"a"> behaves -like the C<"u"> modifier, in that case-insensitive matching uses Unicode -semantics; for example, "k" will match the Unicode C<\N{KELVIN SIGN}> -under C matching, and code points in the Latin1 range, above ASCII -will have Unicode semantics when it comes to case-insensitive matching. -But writing two in "a"'s in a row will increase its effect, causing the -Kelvin sign and all other non-ASCII characters not to match any ASCII -character under C matching. +ASCII-range characters. + +This modifier is useful for people who only incidentally use Unicode. +With it, one can write C<\d> with confidence that it will only match +ASCII characters, and should the need arise to match beyond ASCII, you +can use C<\p{Digit}>, or C<\p{Word}> for C<\w>. There are similar +C<\p{...}> constructs that can match white space and Posix classes +beyond ASCII. See L. + +As you would expect, this modifier causes, for example, C<\D> to mean +the same thing as C<[^0-9]>; in fact, all non-ASCII characters match +C<\D>, C<\S>, and C<\W>. C<\b> still means to match at the boundary +between C<\w> and C<\W>, using the C definitions of them (similarly +for C<\B>). + +Otherwise, C behaves like the C modifier, in that +case-insensitive matching uses Unicode semantics; for example, "k" will +match the Unicode C<\N{KELVIN SIGN}> under C matching, and code +points in the Latin1 range, above ASCII will have Unicode rules when it +comes to case-insensitive matching. + +To forbid ASCII/non-ASCII matches (like "k" with C<\N{KELVIN SIGN}>), +specify the "a" twice, for example C or C + +To reiterate, this modifier provides protection for applications that +don't wish to be exposed to all of Unicode. Specifying it twice +gives added protection. + +This modifier may be specified to be the default by C +or C, but see +L. X +X + +=head4 /d + +This modifier means to use the "Default" native rules of the platform +except when there is cause to use Unicode rules instead, as follows: + +=over 4 + +=item 1 + +the target string is encoded in UTF-8; or + +=item 2 + +the pattern is encoded in UTF-8; or + +=item 3 + +the pattern explicitly mentions a code point that is above 255 (say by +C<\x{100}>); or + +=item 4 -C means to use the traditional Perl pattern-matching behavior. -This is dualistic (hence the name C, which also could stand for -"depends"). When this is in effect, Perl matches according to the -platform's native character set rules unless there is something that -indicates to use Unicode rules. If either the target string or the -pattern itself is encoded in UTF-8, Unicode rules are used. Also, if -the pattern contains Unicode-only features, such as code points above -255, C<\p()> Unicode properties or C<\N{}> Unicode names, Unicode rules -will be used. It is automatically selected by default if the regular -expression is compiled neither within the scope of a C<"use locale"> -pragma nor a pragma. -This behavior causes a number of glitches, see -L. -X +the pattern uses a Unicode name (C<\N{...}>); or + +=item 5 + +the pattern uses a Unicode property (C<\p{...}>) + +=back + +Another mnemonic for this modifier is "Depends", as the rules actually +used depend on various things, and as a result you can get unexpected +results. See L. + +On ASCII platforms, the native rules are ASCII, and on EBCDIC platforms +(at least the ones that Perl handles), they are Latin-1. + +Here are some examples of how that works on an ASCII platform: + + $str = "\xDF"; # $str is not in UTF-8 format. + $str =~ /^\w/; # No match, as $str isn't in UTF-8 format. + $str .= "\x{0e0b}"; # Now $str is in UTF-8 format. + $str =~ /^\w/; # Match! $str is now in UTF-8 format. + chop $str; + $str =~ /^\w/; # Still a match! $str remains in UTF-8 format. + +=head4 Which character set modifier is in effect? + +Which of these modifiers is in effect at any given point in a regular +expression depends on a fairly complex set of interactions. As +explained below in L it is possible to explicitly +specify modifiers that apply only to portions of a regular expression. +The innermost always has priority over any outer ones, and one applying +to the whole expression has priority over any default settings that are +described in the next few paragraphs. + +The Cfoo'|re/'Eflags' mode">> pragma can be used to set +default modifiers (including these) for regular expressions compiled +within its scope. This pragma has precedence over the other pragmas +that change the defaults, as listed below. + +Otherwise, C> sets the default modifier to C; +and C> or +C> (or higher) set the default to +C when not in the same scope as either C> +or C> . + +If none of the above apply, for backwards compatibility reasons, the +C modifier is the one in effect by default. As this can lead to +unexpected results, it is best to specify which other rule set should be +used. + +=head4 Character set modifier behavior prior to Perl 5.14 + +Prior to 5.14, there were no explicit modifiers, but C was implied +for regexes compiled within the scope of C, and C was +implied otherwise. However, interpolating a regex into a larger regex +would ignore the original compilation in favor of whatever was in effect +at the time of the second compilation. There were a number of +inconsistencies (bugs) with the C modifier, where Unicode rules +would be used when inappropriate, and vice versa. C<\p{}> did not imply +Unicode rules, and neither did all occurrences of C<\N{}>, until 5.12. =head2 Regular Expressions @@ -549,7 +705,7 @@ digits padded with leading zeros, since a leading zero implies an octal constant. The C<\I> notation also works in certain circumstances outside -the pattern. See L below for details.) +the pattern. See L below for details. Examples: @@ -733,7 +889,8 @@ But a minus sign is not legal with it. Note that the C, C, C, C

, and C modifiers are special in that they can only be enabled, not disabled, and the C, C, C, and C modifiers are mutually exclusive: specifying one de-specifies the -others, and a maximum of one may appear in the construct. Thus, for +others, and a maximum of one (or two C's) may appear in the +construct. Thus, for example, C<(?-p)> will warn when compiled under C; C<(?-d:...)> and C<(?dl:...)> are fatal errors. @@ -2253,17 +2410,11 @@ Subroutine call to a named capture group. Equivalent to C<< (?&NAME) >>. =head1 BUGS -There are numerous problems with case-insensitive matching of characters -outside the ASCII range, especially with those whose folds are multiple -characters, such as ligatures like C. - -In a bracketed character class with case-insensitive matching, ranges only work -for ASCII characters. For example, -C -doesn't match all the Russian upper and lower case letters. - Many regular expression constructs don't work on EBCDIC platforms. +There are a number of issues with regard to case-insensitive matching +in Unicode rules. See C under L above. + This document varies from difficult to understand to completely and utterly opaque. The wandering prose riddled with jargon is hard to fathom in several places. -- 2.7.4