From 0d017f4d564175907ce6698d1a162341a850ea9d Mon Sep 17 00:00:00 2001 From: Wolfgang Laun Date: Sun, 4 Feb 2007 17:26:14 +0100 Subject: [PATCH] minor improvements for perlre.pod From: "Wolfgang Laun" Message-ID: <17de7ee80702040726v23f54266g3c352d353a30c430@mail.gmail.com> p4raw-id: //depot/perl@30126 --- pod/perlre.pod | 191 ++++++++++++++++++++++++++++++--------------------------- 1 file changed, 99 insertions(+), 92 deletions(-) diff --git a/pod/perlre.pod b/pod/perlre.pod index d886d09..d913c80 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -16,6 +16,9 @@ operations, plus various examples of the same, see discussions of C, C, C and C in L. + +=head2 Modifiers + Matching operations can have various modifiers. Modifiers that relate to the interpretation of the regular expression inside are listed below. Modifiers that alter the way a regular expression @@ -84,7 +87,7 @@ X =head3 Metacharacters -The patterns used in Perl pattern matching derive from supplied in +The patterns used in Perl pattern matching evolved from the ones supplied in the Version 8 regex routines. (The routines are derived (distantly) from Henry Spencer's freely redistributable reimplementation of the V8 routines.) See L for @@ -149,24 +152,24 @@ many times as possible (given a particular starting location) while still allowing the rest of the pattern to match. If you want it to match the minimum number of times possible, follow the quantifier with a "?". Note that the meanings don't change, just the "greediness": -X X X +X X X X X<*?> X<+?> X X<{n}?> X<{n,}?> X<{n,m}?> - *? Match 0 or more times - +? Match 1 or more times - ?? Match 0 or 1 time - {n}? Match exactly n times - {n,}? Match at least n times - {n,m}? Match at least n but not more than m times + *? Match 0 or more times, not greedily + +? Match 1 or more times, not greedily + ?? Match 0 or 1 time, not greedily + {n}? Match exactly n times, not greedily + {n,}? Match at least n times, not greedily + {n,m}? Match at least n but not more than m times, not greedily By default, when a quantified subpattern does not allow the rest of the overall pattern to match, Perl will backtrack. However, this behaviour is -sometimes undesirable. Thus Perl provides the "possesive" quantifier form +sometimes undesirable. Thus Perl provides the "possessive" quantifier form as well. - *+ Match 0 or more times and give nothing back - ++ Match 1 or more times and give nothing back - ?+ Match 0 or 1 time and give nothing back + *+ Match 0 or more times and give nothing back + ++ Match 1 or more times and give nothing back + ?+ Match 0 or 1 time and give nothing back {n}+ Match exactly n times and give nothing back (redundant) {n,}+ Match at least n times and give nothing back {n,m}+ Match at least n but not more than m times and give nothing back @@ -183,7 +186,7 @@ string" problem can be most efficiently performed when written as: /"(?:[^"\\]++|\\.)*+"/ -as we know that if the final quote does not match, bactracking will not +as we know that if the final quote does not match, backtracking will not help. See the independent subexpression C<< (?>...) >> for more details; possessive quantifiers are just syntactic sugar for that construct. For instance the above example could also be written as follows: @@ -194,7 +197,7 @@ instance the above example could also be written as follows: Because patterns are processed as double quoted strings, the following also work: -X<\t> X<\n> X<\r> X<\f> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q> +X<\t> X<\n> X<\r> X<\f> X<\e> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q> X<\0> X<\c> X<\N> X<\x> \t tab (HT, TAB) @@ -203,10 +206,10 @@ X<\0> X<\c> X<\N> X<\x> \f form feed (FF) \a alarm (bell) (BEL) \e escape (think troff) (ESC) - \033 octal char (think of a PDP-11) - \x1B hex char - \x{263a} wide hex char (Unicode SMILEY) - \c[ control char + \033 octal char (example: ESC) + \x1B hex char (example: ESC) + \x{263a} wide hex char (example: Unicode SMILEY) + \cK control char (example: VT) \N{name} named char \l lowercase next char (think vi) \u uppercase next char (think vi) @@ -227,9 +230,9 @@ You'll need to write something like C. =head3 Character classes In addition, Perl defines the following: -X X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C> -X X +X<\g> X<\k> X<\N> X<\K> X<\v> X<\V> +X X X X \w Match a "word" character (alphanumeric plus "_") \W Match a non-"word" character @@ -265,12 +268,13 @@ to match a string of Perl-identifier characters (which isn't the same as matching an English word). If C is in effect, the list of alphabetic characters generated by C<\w> is taken from the current locale. See L. You may use C<\w>, C<\W>, C<\s>, C<\S>, -C<\d>, and C<\D> within character classes, but if you try to use them -as endpoints of a range, that's not a range, the "-" is understood -literally. If Unicode is in effect, C<\s> matches also "\x{85}", -"\x{2028}, and "\x{2029}", see L for more details about -C<\pP>, C<\PP>, and C<\X>, and L about Unicode in general. -You can define your own C<\p> and C<\P> properties, see L. +C<\d>, and C<\D> within character classes, but they aren't usable +as either end of a range. If any of them precedes or follows a "-", +the "-" is understood literally. If Unicode is in effect, C<\s> matches +also "\x{85}", "\x{2028}, and "\x{2029}". See L for more +details about C<\pP>, C<\PP>, C<\X> and the possibility of defining +your own C<\p> and C<\P> properties, and L about Unicode +in general. X<\w> X<\W> X The POSIX character class syntax @@ -278,7 +282,7 @@ X [:class:] -is also available. Note that the C<[> and C<]> braces are I; +is also available. Note that the C<[> and C<]> brackets are I; they must always be used within a character class expression. # this is correct: @@ -317,7 +321,7 @@ A GNU extension equivalent to C<[ \t]>, "all horizontal whitespace". =item [2] Not exactly equivalent to C<\s> since the C<[[:space:]]> includes -also the (very rare) "vertical tabulator", "\ck", chr(11). +also the (very rare) "vertical tabulator", "\cK" or chr(11) in ASCII. =item [3] @@ -331,7 +335,7 @@ whole character class. For example: [01[:alpha:]%] -matches zero, one, any alphabetic character, and the percentage sign. +matches zero, one, any alphabetic character, and the percent sign. The following equivalences to Unicode \p{} constructs and equivalent backslash character classes (if available), will hold: @@ -342,7 +346,7 @@ X X<\p> X<\p{}> alpha IsAlpha alnum IsAlnum ascii IsASCII - blank IsSpace + blank cntrl IsCntrl digit IsDigit \d graph IsGraph @@ -371,7 +375,7 @@ X Any control character. Usually characters that don't produce output as such but instead control the terminal somehow: for example newline and backspace are control characters. All characters with ord() less than -32 are most often classified as control characters (assuming ASCII, +32 are usually classified as control characters (assuming ASCII, the ISO Latin character sets, and Unicode), as is the character with the ord() value of 127 (C). @@ -422,7 +426,7 @@ X X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G> \b Match a word boundary - \B Match a non-(word boundary) + \B Match except at a word boundary \A Match only at beginning of string \Z Match only at end of string, or before newline at the end \z Match only at end of string @@ -469,9 +473,10 @@ loop. Take care when using patterns that include C<\G> in an alternation. =head3 Capture buffers -The bracketing construct C<( ... )> creates capture buffers. To -refer to the digit'th buffer use \ within the -match. Outside the match use "$" instead of "\". (The +The bracketing construct C<( ... )> creates capture buffers. To refer +to the current contents of a buffer later on, within the same pattern, +use \1 for the first, \2 for the second, and so on. +Outside the match use "$" instead of "\". (The \ notation works in certain circumstances outside the match. See the warning below about \1 vs $1 for details.) Referring back to another part of the match is called a @@ -492,7 +497,7 @@ backreferences. X<\g{1}> X<\g{-1}> X<\g{name}> X X In order to provide a safer and easier way to construct patterns using -backrefs, in Perl 5.10 the C<\g{N}> notation is provided. The curly +backreferences, Perl 5.10 provides the C<\g{N}> notation. The curly brackets are optional, however omitting them is less safe as the meaning of the pattern can be changed by text (such as digits) following it. When N is a positive integer the C<\g{N}> notation is exactly equivalent @@ -517,17 +522,16 @@ and would match the same as C. Additionally, as of Perl 5.10 you may use named capture buffers and named backreferences. The notation is C<< (?...) >> to declare and C<< \k >> -to reference. You may also use single quotes instead of angle brackets to quote the -name; and you may use the bracketed C<< \g{name} >> back reference syntax. -The only difference between named capture buffers and unnamed ones is -that multiple buffers may have the same name and that the contents of -named capture buffers are available via the C<%+> hash. When multiple -groups share the same name C<$+{name}> and C<< \k >> refer to the -leftmost defined group, thus it's possible to do things with named capture -buffers that would otherwise require C<(??{})> code to accomplish. Named -capture buffers are numbered just as normal capture buffers are and may be -referenced via the magic numeric variables or via numeric backreferences -as well as by name. +to reference. You may also use apostrophes instead of angle brackets to delimit the +name; and you may use the bracketed C<< \g{name} >> backreference syntax. +It's possible to refer to a named capture buffer by absolute and relative number as well. +Outside the pattern, a named capture buffer is available via the C<%+> hash. +When different buffers within the same pattern have the same name, C<$+{name}> +and C<< \k >> refer to the leftmost defined group. (Thus it's possible +to do things with named capture buffers that would otherwise require C<(??{})> +code to accomplish.) +X X +X<%+> X<$+{name}> X<\k{name}> Examples: @@ -539,7 +543,7 @@ Examples: /(?.)\k/ # ... a different way and print "'$+{char}' is the first doubled character\n"; - /(?.)\1/ # ... mix and match + /(?'char'.)\1/ # ... mix and match and print "'$1' is the first doubled character\n"; if (/Time: (..):(..):(..)/) { # parse out values @@ -567,7 +571,7 @@ X<$+> X<$^N> X<$&> X<$`> X<$'> X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9> -B: failed matches in Perl do not reset the match variables, +B: Failed matches in Perl do not reset the match variables, which makes it easier to write code that tests for a series of more specific cases and remembers the best match. @@ -655,10 +659,10 @@ One or more embedded pattern-match modifiers, to be turned on (or turned off, if preceded by C<->) for the remainder of the pattern or the remainder of the enclosing pattern group (if any). This is particularly useful for dynamic patterns, such as those read in from a -configuration file, read in as an argument, are specified in a table -somewhere, etc. Consider the case that some of which want to be case -sensitive and some do not. The case insensitive ones need to include -merely C<(?i)> at the front of the pattern. For example: +configuration file, taken from an argument, or specified in a table +somewhere. Consider the case where some patterns want to be case +sensitive and some do not: The case insensitive ones merely need to +include C<(?i)> at the front of the pattern. For example: $pattern = "foobar"; if ( /$pattern/i ) { } @@ -672,9 +676,9 @@ These modifiers are restored at the end of the enclosing group. For example, ( (?i) blah ) \s+ \1 -will match a repeated (I!) word C in any -case, assuming C modifier, and no C modifier outside this -group. +will match C in any case, some spaces, and an exact (I!) +repetition of the previous word, assuming the C modifier, and no C +modifier outside this group. Note that the C modifier is special in that it can only be enabled, not disabled, and that its presence anywhere in a pattern has a global @@ -783,17 +787,17 @@ only for fixed-width look-behind. X<< (?) >> X<(?'NAME')> X X A named capture buffer. Identical in every respect to normal capturing -parens C<()> but for the additional fact that C<%+> may be used after +parentheses C<()> but for the additional fact that C<%+> may be used after a succesful match to refer to a named buffer. See C for more details on the C<%+> hash. If multiple distinct capture buffers have the same name then the $+{NAME} will refer to the leftmost defined buffer in the match. -The forms C<(?'NAME'pattern)> and C<(?pattern)> are equivalent. +The forms C<(?'NAME'pattern)> and C<< (?pattern) >> are equivalent. B While the notation of this construct is the same as the similar -function in .NET regexes, the behavior is not, in Perl the buffers are +function in .NET regexes, the behavior is not. In Perl the buffers are numbered sequentially regardless of being named or not. Thus in the pattern @@ -808,8 +812,8 @@ its Unicode extension (see L), though it isn't extended by the locale (see L). B In order to make things easier for programmers with experience -with the Python or PCRE regex engines the pattern C<< (?PENAMEEpattern) >> -maybe be used instead of C<< (?pattern) >>; however this form does not +with the Python or PCRE regex engines, the pattern C<< (?PENAMEEpattern) >> +may be used instead of C<< (?pattern) >>; however this form does not support the use of single quotes as a delimiter for the name. This is only available in Perl 5.10 or later. @@ -822,14 +826,14 @@ the group is designated by name and not number. If multiple groups have the same name then it refers to the leftmost defined group in the current match. -It is an error to refer to a name not defined by a C<(?)> +It is an error to refer to a name not defined by a C<< (?) >> earlier in the pattern. Both forms are equivalent. B In order to make things easier for programmers with experience -with the Python or PCRE regex engines the pattern C<< (?P=NAME) >> -maybe be used instead of C<< \k >> in Perl 5.10 or later. +with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >> +may be used instead of C<< \k >> in Perl 5.10 or later. =item C<(?{ code })> X<(?{})> X X X @@ -873,7 +877,7 @@ Cization are undone, so that # location. >x; -will set C<$res = 4>. Note that after the match, $cnt returns to the globally +will set C<$res = 4>. Note that after the match, C<$cnt> returns to the globally introduced value, because the scopes that restrict C operators are unwound. @@ -900,7 +904,7 @@ perilous C pragma has been used (see L), or the variables contain results of C operator (see L). -This restriction is because of the wide-spread and remarkably convenient +This restriction is due to the wide-spread and remarkably convenient custom of using run-time determined strings as patterns. For example: $re = <>; @@ -915,7 +919,7 @@ so you should only do so if you are also using taint checking. Better yet, use the carefully constrained evaluation within a Safe compartment. See L for details about both these mechanisms. -Because perl's regex engine is not currently re-entrant, interpolated +Because Perl's regex engine is currently not re-entrant, interpolated code may not invoke the regex engine either directly with C or C), or indirectly with functions such as C. @@ -1036,7 +1040,7 @@ for later use: } B that this pattern does not behave the same way as the equivalent -PCRE or Python construct of the same form. In perl you can backtrack into +PCRE or Python construct of the same form. In Perl you can backtrack into a recursed group, in PCRE and Python the recursed into group is treated as atomic. Also, modifiers are resolved at compile time, so constructs like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will @@ -1045,8 +1049,8 @@ be processed. =item C<(?&NAME)> X<(?&NAME)> -Recurse to a named subpattern. Identical to (?PARNO) except that the -parenthesis to recurse to is determined by name. If multiple parens have +Recurse to a named subpattern. Identical to C<(?PARNO)> except that the +parenthesis to recurse to is determined by name. If multiple parentheses have the same name, then it recurses to the leftmost. It is an error to refer to a name that is not declared somewhere in the @@ -1054,7 +1058,7 @@ pattern. B In order to make things easier for programmers with experience with the Python or PCRE regex engines the pattern C<< (?P>NAME) >> -maybe be used instead of C<< (?&NAME) >> as of Perl 5.10. +may be used instead of C<< (?&NAME) >> in Perl 5.10 or later. =item C<(?(condition)yes-pattern|no-pattern)> X<(?()> @@ -1147,7 +1151,7 @@ An example of how this might be used is as follows: )/x Note that capture buffers matched inside of recursion are not accessible -after the recursion returns, so the extra layer of capturing buffers are +after the recursion returns, so the extra layer of capturing buffers is necessary. Thus C<$+{NAME_PAT}> would not be defined even though C<$+{NAME}> would be. @@ -1260,7 +1264,7 @@ to inside of one of these constructs. The following equivalences apply: =head2 Special Backtracking Control Verbs B These patterns are experimental and subject to change or -removal in a future version of perl. Their usage in production code should +removal in a future version of Perl. Their usage in production code should be noted to avoid problems during upgrades. These special patterns are generally of the form C<(*VERB:ARG)>. Unless @@ -1308,7 +1312,7 @@ continues in B, which may also backtrack as necessary; however, should B not match, then no further backtracking will take place, and the pattern will fail outright at the current starting position. -As a shortcut, X<\v> is exactly equivalent to C<(*PRUNE)>. +As a shortcut, C<\v> is exactly equivalent to C<(*PRUNE)>. The following example counts all the possible matching strings in a pattern (without actually matching any of them). @@ -1361,7 +1365,7 @@ of this pattern. This effectively means that the regex engine "skips" forward to this position on failure and tries to match again, (assuming that there is sufficient room to match). -As a shortcut X<\V> is exactly equivalent to C<(*SKIP)>. +As a shortcut C<\V> is exactly equivalent to C<(*SKIP)>. The name of the C<(*SKIP:NAME)> pattern has special significance. If a C<(*MARK:NAME)> was encountered while matching, then it is that position @@ -1498,7 +1502,7 @@ for production code. This pattern matches nothing and causes the end of successful matching at the point at which the C<(*ACCEPT)> pattern was encountered, regardless of whether there is actually more to match in the string. When inside of a -nested pattern, such as recursion or a dynamically generated subbpattern +nested pattern, such as recursion, or in a subpattern dynamically generated via C<(??{})>, only the innermost pattern is ended immediately. If the C<(*ACCEPT)> is inside of capturing buffers then the buffers are @@ -1508,7 +1512,7 @@ For instance: 'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x; will match, and C<$1> will be C and C<$2> will be C, C<$3> will not -be set. If another branch in the inner parens were matched, such as in the +be set. If another branch in the inner parentheses were matched, such as in the string 'ACDE', then the C and C would have to be matched as well. =back @@ -1521,11 +1525,11 @@ X X NOTE: This section presents an abstract approximation of regular expression behavior. For a more rigorous (and complicated) view of the rules involved in selecting a match among possible alternatives, -see L. +see L. A fundamental feature of regular expression matching involves the notion called I, which is currently used (when needed) -by all regular expression quantifiers, namely C<*>, C<*?>, C<+>, +by all regular non-possessive expression quantifiers, namely C<*>, C<*?>, C<+>, C<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often optimized internally, but the general principle outlined here is valid. @@ -1573,7 +1577,7 @@ and the first "bar" thereafter. if ( /foo(.*?)bar/ ) { print "got <$1>\n" } got -Here's another example: let's say you'd like to match a number at the end +Here's another example. Let's say you'd like to match a number at the end of a string, and you also want to keep the preceding part of the match. So you write this: @@ -1698,9 +1702,9 @@ using the vertical bar. C means match "a" AND (then) match "b", although the attempted matches are made at different positions because "a" is not a zero-width assertion, but a one-width assertion. -B: particularly complicated regular expressions can take +B: Particularly complicated regular expressions can take exponential time to solve because of the immense number of possible -ways they can use backtracking to try match. For example, without +ways they can use backtracking to try for a match. For example, without internal optimizations done by the regular expression engine, this will take a painfully long time to run: @@ -1732,9 +1736,12 @@ Any single character matches itself, unless it is a I with a special meaning described here or above. You can cause characters that normally function as metacharacters to be interpreted literally by prefixing them with a "\" (e.g., "\." matches a ".", not any -character; "\\" matches a "\"). A series of characters matches that -series of characters in the target string, so the pattern C -would match "blurfl" in the target string. +character; "\\" matches a "\"). This escape mechanism is also required +for the character used as the pattern delimiter. + +A series of characters matches that series of characters in the target +string, so the pattern C would match "blurfl" in the target +string. You can specify a character class, by enclosing a list of characters in C<[]>, which will match any character from the list. If the @@ -1755,7 +1762,7 @@ a range, the "-" is understood literally. Note also that the whole range idea is rather unportable between character sets--and even within character sets they may cause results you probably didn't expect. A sound principle is to use only ranges -that begin from and end at either alphabets of equal case ([a-e], +that begin from and end at either alphabetics of equal case ([a-e], [A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt, spell out the character sets in full. @@ -1800,7 +1807,7 @@ match "0x1234 0x4321", but not "0x1234 01234", because subpattern 1 matched "0x", even though the rule C<0|0x> could potentially match the leading 0 in the second number. -=head2 Warning on \1 vs $1 +=head2 Warning on \1 Instead of $1 Some people get too used to writing things like: @@ -1825,7 +1832,7 @@ C<${1}000>. The operation of interpolation should not be confused with the operation of matching a backreference. Certainly they mean two different things on the I side of the C. -=head2 Repeated patterns matching zero-length substring +=head2 Repeated Patterns Matching a Zero-length Substring B: Difficult material (and prose) ahead. This section needs a rewrite. @@ -1838,7 +1845,7 @@ loops using regular expressions, with something as innocuous as: 'foo' =~ m{ ( o? )* }x; -The C can match at the beginning of C<'foo'>, and since the position +The C matches at the beginning of C<'foo'>, and since the position in the string is not moved by the match, C would match again and again because of the C<*> modifier. Another common way to create a similar cycle is with the looping modifier C: @@ -1901,7 +1908,7 @@ the matched string, and is reset by each assignment to pos(). Zero-length matches at the end of the previous match are ignored during C. -=head2 Combining pieces together +=head2 Combining RE Pieces Each of the elementary pieces of regular expressions which were described before (such as C or C<\Z>) could match at most one substring @@ -2002,13 +2009,13 @@ One more rule is needed to understand how a match is determined for the whole regular expression: a match at an earlier position is always better than a match at a later position. -=head2 Creating custom RE engines +=head2 Creating Custom RE Engines Overloaded constants (see L) provide a simple way to extend the functionality of the RE engine. Suppose that we want to enable a new RE escape-sequence C<\Y|> which -matches at boundary between whitespace characters and non-whitespace +matches at a boundary between whitespace characters and non-whitespace characters. Note that C<(?=\S)(? matches exactly at these positions, so we want to have each C<\Y|> in the place of the more complicated version. We can create a module C to do -- 2.7.4