Tony Cook [Thu, 13 Feb 2014 02:55:12 +0000 (13:55 +1100)]
[perl #121207] test.pl tempfile() improvements
Brad Gilbert [Tue, 4 Feb 2014 17:55:28 +0000 (11:55 -0600)]
Make sure that tempfile() in t/test.pl removes the temp files
Run a program within t/test_pl/tempfile.t that uses tempfile() to make
sure that the temp file is removed when the tests are done.
Brad Gilbert [Tue, 4 Feb 2014 16:37:02 +0000 (10:37 -0600)]
Add register_tempfile() to t/test.pl
This convenience function causes files to be removed in the same
way that tempfile() does.
It uses the same variable that tempfile() does, to catch and
warn on any collisions.
Brad Gilbert [Tue, 4 Feb 2014 16:31:46 +0000 (10:31 -0600)]
Test that tempfile() in t/test.pl skips files that already exist
Brad Gilbert [Tue, 4 Feb 2014 16:20:26 +0000 (10:20 -0600)]
Improve error diagnostics in t/test_pl/tempfile.t
Tony: add a missing local
Alan Haggai Alavi [Wed, 12 Feb 2014 11:20:49 +0000 (16:50 +0530)]
Replace 'mkpath' (legacy interface) with 'make_path'.
Update email of Alan Haggai Alavi; consolidate all email addresses
in Porting/checkAUTHORS.pl.
Reini Urban [Wed, 12 Feb 2014 17:05:43 +0000 (11:05 -0600)]
NETWARE CopFILE_setn
broken since 5.8.9
Karl Williamson [Wed, 12 Feb 2014 17:26:08 +0000 (10:26 -0700)]
pod/perldebguts: Stress ephemeral nature of regnode types
Karl Williamson [Fri, 7 Feb 2014 17:57:15 +0000 (10:57 -0700)]
regcomp.h: Rmv false comments
I misread the code when I added these comments
Karl Williamson [Thu, 6 Feb 2014 16:57:51 +0000 (09:57 -0700)]
regexec.c, locale.c: Silence some compiler warnings
For regexec.c, one compiler amongst our smokers believes there is a path
where this array can be used uninitialized; it's easiest to just
initialize it, even though I think the compiler is wrong, unless it is
optimizing incorrectly, in which case, it would be still be best to
initialize it.
For locale.c, this is just the well-known gcc bug that they refuse to
fix concerning a (void) cast when the function has been declared to
require not ignoring the resul
Chris 'BinGOs' Williams [Wed, 12 Feb 2014 09:17:36 +0000 (09:17 +0000)]
Update CGI to CPAN version 3.65
[DELTA]
Version 3.65 Feb 11, 2014
[INTERNALS]
- Update Makefile to refine where CGI.pm gets installed
(Thanks to bingo, rjbs: https://github.com/markstos/CGI.pm/pull/30)
Ricardo Signes [Mon, 10 Feb 2014 23:43:53 +0000 (18:43 -0500)]
acknowledgements: use $^X when running sub-perl
Ricardo Signes [Mon, 10 Feb 2014 23:38:07 +0000 (18:38 -0500)]
tweak Porting/acknowledgements.pl to avoid awk and grep
Rather than assuming that we are on a unix system with the
standard kit, we can rely on this neat little programming
language that provides their behavior and is already assumed
to be present!
Gavin Shelley [Mon, 10 Feb 2014 23:19:14 +0000 (18:19 -0500)]
provide a more limited change-count for acknowledgements.pl
Matthew Horsfall [Fri, 7 Feb 2014 14:41:09 +0000 (09:41 -0500)]
Add some examples to cv_set_call_checker and wrap_op_checker
Chris 'BinGOs' Williams [Mon, 10 Feb 2014 16:54:41 +0000 (16:54 +0000)]
Update IO-Socket-IP to CPAN version 0.28
[DELTA]
0.28 2014/02/10 16:17:59
[CHANGES]
* Renamed internal _configure method to _io_socket_ip__configure to
avoid clashes with modules that try to subclass IO::Socket::INET
[BUGFIXES]
* Disable AI_ADDRCONFIG during one-sided 'v6 tests as sometimes it
would otherwise fail
* Skip the SO_BROADCAST test on OSes that fail with EACCES (RT92502)
Chris 'BinGOs' Williams [Mon, 10 Feb 2014 16:53:35 +0000 (16:53 +0000)]
Add IO-Socket-IP to Maintainers.pl
Steffen Mueller [Mon, 10 Feb 2014 10:40:31 +0000 (11:40 +0100)]
pp_concat: Only call SvPV_force_nolen when needed
If we just did an sv_setpvs on it, the SvPV_force_nolen should not do
anything useful, so let's not.
Side note: s/TARG/left/ in the enclosing block because they are the same
pointer, so why use a define that needs grokking by the reader of the
code if the local variable is guaranteed to be the same?
Chris 'BinGOs' Williams [Sun, 9 Feb 2014 21:33:57 +0000 (21:33 +0000)]
Update Module-Build to CPAN version 0.4205
[DELTA]
0.4205 - Sun Feb 9 17:51:22 CET 2014
[BUG FIXES]
- FIX license code regression for artistic license [Roy Ivy III, Leon Timmermans]
- Don't swallow ExtUtils::CBuilder loading errors [Matthew Horsfall, Leon Timmermans]
- Handle testing on cross-compile builds [Brian Fraser]
- Protect against platforms without getpw{nam,uid} [Brian Fraser]
Chris 'BinGOs' Williams [Sun, 9 Feb 2014 21:23:21 +0000 (21:23 +0000)]
Update Pod-Escapes to CPAN version 1.05
[DELTA]
1.05 2014-02-09 NEILB
* Added PREREQ_PM, MIN_PERL_VERSION, LICENSE and repo to Makefile.PL
* Made strict- and warnings-clean.
* Fixed syntax error in abstract: RT#49985 from JDHEDDEN
* Fixed typo reported in RT#85374 by dsteinbrunner
* Renamed this file to Changes and reformatted as per CPAN::Changes::Spec
* Deleted META.yml so MakeMaker will generate MYMETA.{yml,json}
* Noted in pod that now being maintained by NEILB
* Added link to github repo in pod
Father Chrysostomos [Sun, 9 Feb 2014 01:14:10 +0000 (17:14 -0800)]
Use ‘an’ for $/=[] error message
This says ‘an ARRAY’:
$ perl -Mstrict -e '@{"a"}'
Can't use string ("a") as an ARRAY ref while "strict refs" in use at -e line 1.
This says ‘a ARRAY’:
$ ./miniperl -e '$/=[]'
Setting $/ to a ARRAY reference is forbidden at -e line 1.
It ought to say ‘an’.
Father Chrysostomos [Sun, 9 Feb 2014 01:09:46 +0000 (17:09 -0800)]
Increase $mro::VERSION to 1.15
Father Chrysostomos [Sun, 9 Feb 2014 01:09:14 +0000 (17:09 -0800)]
Use HEKfARG in mro.xs
It’s faster to pass the HEK directly instead of creating and
throwing away SVs.
Father Chrysostomos [Sun, 9 Feb 2014 01:02:23 +0000 (17:02 -0800)]
[perl #120374] Stop for($h{k}||'') from vivifying
Commit
2e73d70e52 broke this (made it vivify) by propagating lvalue
context to the branches of || and && (to fix another bug). It broke
App::JobLog as a result.
Because foreach does not do defelem magic (i.e., it vivifies), this
ends up extending vivification to happen where it did not before.
Fixing foreach to do defelem magic (create ‘deferred element’ scalars,
the way sub calls do, to avoid vivifying immediately) would be another
way to fix this, but it is controversial. See ticket #2166.
So, if either argument to || (or &&) is a vivifying op, don’t propa-
gate the lvalue context, unless this is the return value of an lvalue
sub (necessary for if/else with implicit return to work correctly in
lvalue subs).
Father Chrysostomos [Sat, 8 Feb 2014 21:34:29 +0000 (13:34 -0800)]
Remove DREFed flag from Concise.pm
0824d667 added the flag.
9026059dcee8 removed the flag but left it in
B::Concise.
Father Chrysostomos [Sat, 8 Feb 2014 21:34:17 +0000 (13:34 -0800)]
Increase $B::Concise::VERSION to 0.992
Father Chrysostomos [Sat, 8 Feb 2014 21:22:01 +0000 (13:22 -0800)]
Expand tabs in diagnostics.pm
Otherwise pod like this:
The second situation is caused by an eval accessing a lexical subroutine
that has gone out of scope, for example,
sub f {
my sub a {...}
sub { eval '\&a' }
}
f()->();
is turned into this:
The second situation is caused by an eval accessing a variable that has
gone out of scope, for example,
sub f {
my $a;
sub { eval '$a' }
}
f()->();
instead of this:
The second situation is caused by an eval accessing a variable that has
gone out of scope, for example,
sub f {
my $a;
sub { eval '$a' }
}
f()->();
I don’t know how to test this without literally copying and pasting
parts of diagnostics.pm into diagnostics.t. But I have tested it man-
ually and it works.
Father Chrysostomos [Sat, 8 Feb 2014 21:10:59 +0000 (13:10 -0800)]
diagnostics.pm: Eliminate $WHOAMI
This variable only held the package name. __PACKAGE__ is faster,
as it allows constant folding.
diagnostics.pm just happens to be older than __PACKAGE__, which was
introduced as recently as 1997 (
68dc074516).
Father Chrysostomos [Sat, 8 Feb 2014 21:04:24 +0000 (13:04 -0800)]
Increase $diagnostics::VERSION to 1.34
Father Chrysostomos [Sat, 8 Feb 2014 20:31:11 +0000 (12:31 -0800)]
perldiag: Wrap long lines
to avoid splain output like this on 80-column terminals:
rewinddir() attempted on invalid dirhandle foo at -e line 1 (#1)
(W io) The dirhandle you tried to do a rewinddir() on is either closed or no
t
really a dirhandle. Check your control flow.
Father Chrysostomos [Sat, 8 Feb 2014 20:20:07 +0000 (12:20 -0800)]
perldiag: Don’t use dev version numbers
Dev versions are an artefact of the developement process.
Father Chrysostomos [Sat, 8 Feb 2014 20:09:01 +0000 (12:09 -0800)]
perldiag: Consistent spaces after dots
Also, non-integer should be hyphenated.
Father Chrysostomos [Sat, 8 Feb 2014 20:02:19 +0000 (12:02 -0800)]
Another perldelta typo
Father Chrysostomos [Sat, 8 Feb 2014 15:36:37 +0000 (07:36 -0800)]
Alphabetise perldiag
Father Chrysostomos [Sat, 8 Feb 2014 15:34:04 +0000 (07:34 -0800)]
perldelta typo
David Mitchell [Sat, 8 Feb 2014 15:41:39 +0000 (15:41 +0000)]
[MERGE] fix and refactor re_intuit_start()
Perl_re_intuit_start() is the main run-time optimising function for the
regex engine. It tries to either quickly reject a match, or find a suitable
starting point for the NFA.
Unfortunately it is impenetrable code, with 13 labels and no large scale
loop or other constructs, and has several severe performance issues with
long utf8 strings.
This series of commits attempts to fix the performance issues, audit the
code for utf8 and other correctness, and refactor and simplify the code,
as well as improve the documentation. In particular it fixes RT#120692.
With gcc on x86_64, this branch decreases the binary size of the function
by around 15%.
Much of my work on this branch has been an iterative process of wondering
why a piece of code is the way it is, adding some assertions and seeing
what breaks in the test suite, then using that info to improve the code or
documentation.
This work isn't finished yet; in particular I haven't yet audited and
refactored the stclass block of code towards the end of the function.
I also have more refactorisations and some more more optimisations
still to go, as well as general tidying up of the documentation.
David Mitchell [Sat, 8 Feb 2014 13:57:54 +0000 (13:57 +0000)]
re_intuit_start(): add comments about check_ix
if (check_ix)
isn't very clear, so clarify it a bit.
David Mitchell [Sat, 8 Feb 2014 12:54:27 +0000 (12:54 +0000)]
re_intuit_start(): assert fixed+float dont overlap
Currently it appears that the anchored and floating substring ranges don't
overlap. Assert this truth; it will force someone to audit the code first
if they wish to change that assumption.
David Mitchell [Fri, 7 Feb 2014 22:33:27 +0000 (22:33 +0000)]
re_intuit_start(): in MBOL block, eliminate t var
Currently we do:
char *t = memchr(rx_origin, '\n');
if (!t) ...
rx_origin = t + 1;
...
Eliminate the t var and just set rx_origin directly:
rx_origin = memchr(rx_origin, '\n');
if (!rx_origin) ...
rx_origin++;
...
David Mitchell [Fri, 7 Feb 2014 22:25:40 +0000 (22:25 +0000)]
re_intuit_start(): MBOL use char for float max
Do the "maximum place \n can appear within the float range" calculation
in chars rather than bytes. Doing it in bytes is logically incorrect,
although I think the worst outcome is that a string is falsely accepted
by intuit then has to be failed by a full run of the regex engine.
But I couldn't think of a test that would show a significant performance
difference.
David Mitchell [Fri, 7 Feb 2014 21:11:28 +0000 (21:11 +0000)]
re_intuit_start(): MBOL limit in chars not bytes
The calculation for the maximum position \n should be searched for,
strend - prog->minlen, is currently done in bytes. Change it to
chars for correctness. It probably doesn't matter at the moment, because
any overshoot in \n will still fail other constraints (which *do* calculate
the end-point correctly). But in the future that might change, and we
don't want any surprises.
David Mitchell [Fri, 7 Feb 2014 17:04:21 +0000 (17:04 +0000)]
re_intuit_start(): remove other_last = rx_origin
The previous commit made this assignment conditional on other_last
not decreasing; but it turns out that increasing it is pointless (although
harmless), since the next time round the "other" substring block, the
current rx_origin will be >= the old rx_origin (since we never decrease
it), and s >= new rx_origin. So s would already be >= the value we would set
other_last to, so it doesn't make any difference.
David Mitchell [Fri, 7 Feb 2014 16:43:19 +0000 (16:43 +0000)]
re_intuit_start(): don't decrease other_last
The /^../m failure code did an unconditional other_last = rx_origin;
if other_last was already high, it could get shrunk and we'd end
up running fbm over the same bit of string repeatedly.
The following code
$s = "-ab\n" x 500_000;
$s .= 'abx';
$s =~ /^ab.*x/m;
(which went quadratic on length) reduces from minutes to millisecs with
this commit. This is because we'd keep going back to near the beginning
of the string and searching for 'x' again.
David Mitchell [Fri, 7 Feb 2014 14:58:06 +0000 (14:58 +0000)]
re_intuit_start(): make assert unconditional
an assert was originally in the float-then-anchored branch, but not in the
anchored-then-float branch. When the two branches were merged, the assert
was only done if other==anchored. It turns out the the assert should be
true in both cases, so remove the guard.
I've also changed the condition from
prog->minlen > other->min_offset
to
prog->minlen >= other->min_offset
Since they can in fact be equal on a one-char substr with SvTAIL().
David Mitchell [Fri, 7 Feb 2014 14:54:35 +0000 (14:54 +0000)]
re_intuit_start(): update comments in /^../m block
There were some XXX comments about whether to search for next \n or next
substr; I've updated those comments, removed an obsolete comment
(we *do check for STCLASS next), and re-indented a debugging statement.
David Mitchell [Fri, 7 Feb 2014 14:15:26 +0000 (14:15 +0000)]
re_intuit_start(): use memchr() to find \n
The code that scans for another \n when the /^.../m constraint has failed
currently searches for the \n in a big while loop. Use memchr() instead,
which is likely to to be as efficient as possible on any given platform.
Also invert the sense of most of the tests following a found \n,
which removes lots indented ifs, and leaves us with a clean set of
if (A)
goto X;
if (B)
goto Y;
if (C)
goto Z;
David Mitchell [Thu, 6 Feb 2014 18:24:14 +0000 (18:24 +0000)]
re_intuit_start(): s/i_strpos/strpos/g
Now that strpos is constant, there's no need save its initial value to
i_strpos for debugging purposes.
David Mitchell [Thu, 6 Feb 2014 18:18:42 +0000 (18:18 +0000)]
re_intuit_start(): keep strpos constant
The /^../m "look for next \n" code set strpos to the new rx_origin when it
found a \n and restarted. This is wrong, as it can trigger the
(rx_origin == strpos) test, falsely concluding that the substrs didn't
help find a position beyond the start of the string.
It's also very confusing.
(I've previously managed to get rid of most other uses of strpos after the
'restart:' position, so I know that strpos isn't needed apart from the
BmUSEFUL test).
David Mitchell [Thu, 6 Feb 2014 16:51:59 +0000 (16:51 +0000)]
re_intuit_start(): re-indent code
Remove one level of indent from a block after the previous commit
removed a pair of braces.
Whitespace-only change
David Mitchell [Thu, 6 Feb 2014 16:42:11 +0000 (16:42 +0000)]
re_intuit_start(): move label after var decls
By moving a var initialisation to after its declaration, we can move a
label to after the var declarations, which allows us to remove a set of
braces and one level of indent. (We do the re-indent in the next commit)
David Mitchell [Thu, 6 Feb 2014 16:36:52 +0000 (16:36 +0000)]
re_intuit_start(): unconditionally init other_last
Initialise other_last to strpos at the top if the function, rather than
initialising it to NULL then later setting it to strpos if NULL.
Makes the code simpler.
Although strpos can currently change doing execution, the conditional
assignment always happens before strpos has had a chance to change.
David Mitchell [Wed, 5 Feb 2014 17:32:22 +0000 (17:32 +0000)]
re_intuit_start(): don't decrease rx_origin
When calculating the new rx_origin after a successful check match,
don't set it to a lower value than it already is. This can avoid
having to do repeated HOP(check_at, -max_offset) over the same
section of string, which makes the following take milliseconds rather than
10's of seconds:
$s = "-a-bc" x 250_000;
$s .= "1a1bc";
utf8::upgrade($s);
$s =~ /\da\d{0,30000}bc/ or die;
David Mitchell [Wed, 5 Feb 2014 16:48:21 +0000 (16:48 +0000)]
re_intuit_start(): format a ?: expression better
I kept reading it as assigning other_ix to rx_origin.
Whitespace-only change
David Mitchell [Wed, 5 Feb 2014 12:00:08 +0000 (12:00 +0000)]
re_intuit_start(): remove obsolete comment
David Mitchell [Wed, 5 Feb 2014 11:51:58 +0000 (11:51 +0000)]
re_intuit_start(): eliminate s as func-wide var
All uses of s are now as tmp values local to a specific block,
so make it just local to the block.
David Mitchell [Wed, 5 Feb 2014 10:55:26 +0000 (10:55 +0000)]
re_intuit_start(): pass rx_origin in/out stclass
Currently the start position for the regstclass code is passed to the
start of the block in s. Pass it in rx_origin instead (which already
contains the right value anyway).
Also, use it as the value to exit with, when goto'ing to giveup
David Mitchell [Tue, 4 Feb 2014 22:58:37 +0000 (22:58 +0000)]
re_intuit_start(): localise 's' in abs anch branch
The value of s within the block devoted to finding an absolutely anchored
check substr is used neither on entry or exit from the block, so make it
local.
David Mitchell [Tue, 4 Feb 2014 21:01:58 +0000 (21:01 +0000)]
re_intuit_start(): simplify the /^.../m condition
The commit-but-last merged two conditions into a single messy one; now
simplify it.
Note that in the case of /.*.../, which sets MBOL and IMPLICIT,
we should never arrive with strpos != rx_origin, since the .* forces
the origin as far back as it will go.
David Mitchell [Tue, 4 Feb 2014 20:38:04 +0000 (20:38 +0000)]
re_intuit_start(): re-indent block after last mod
The previous commit rearranged some code, but left the indentation as-is.
Now re-indent.
Whitespace-only change
David Mitchell [Tue, 4 Feb 2014 20:26:20 +0000 (20:26 +0000)]
re_intuit_start(): rearrange /^/m code
After matching the "check" and "other" strings, we check that
rx_origin is at a \n in the presence of /^../m. The code that
does this is in one half of an if-statement, with a couple of labels and
gotos that get us to and from the other half of the if statement.
Re-arrange the code so that the /^../m is done on its own before the if.
This removes a couple of labels and gotos and makes the code clearer.
Basically we went from:
if (rx_origin != strpos) {
if (ml_anch && COND_A) {
find_anchor:
LOOK_FOR_ANCHOR...
}
REST_A;
}
else {
if (ml_anch && COND_B) {
goto find_anchor;
}
REST_B;
}
to:
if (rx_origin != strpos && (ml_anch && COND_A)
|| rx_origin == strpos && (ml_anch && COND_B))
{
find_anchor:
LOOK_FOR_ANCHOR...
}
...
}
if (rx_origin != strpos) {
REST_A;
else {
REST_B;
}
The next couple of commits will re-indent and simplify the condition a
bit.
David Mitchell [Tue, 4 Feb 2014 20:09:48 +0000 (20:09 +0000)]
re_intuit_start(): remove redundant assignment
we do 'rx_origin = strpos;' in the branch that has the condition
'rx_origin == strpos'
David Mitchell [Tue, 4 Feb 2014 19:08:40 +0000 (19:08 +0000)]
re_intuit_start(): give "other" block it's own 's'
There's an 's' var global to the whole function; but give the "other
substr" code block its own local 's' var, since its only used as a tmp
var, not to pass values to or from the block. Eventually we'll remove the
global 's' altogether.
David Mitchell [Tue, 4 Feb 2014 19:03:52 +0000 (19:03 +0000)]
re_intuit_start(): eliminate 's' from "check" code
The block that finds the check string uses s as a temporary variable
to hold the result of the fbm search, then at the end, assigns it to
check_at. Just use check_at directly.
David Mitchell [Tue, 4 Feb 2014 18:55:28 +0000 (18:55 +0000)]
re_intuit_start(): eliminate saved_s var
In the "find other substr" block, we enter with s pointing to the "check"
substr. We save s to saved_s, then use its value, then use s for
something else, then finally restore s from saved_s.
However, at entry to this block, we have already set check_at to s,
so use check_at rather than s as the input to the block; then there's
also no need to use saved_s to remember its value. But it turns
out we don't need to set s to the old value anyway, as the next block of
code always assigns to s anyway.
David Mitchell [Tue, 4 Feb 2014 18:52:08 +0000 (18:52 +0000)]
re_intuit_start(): localise t
The function-wide variable t is now only used locally within two
separate blocks, so remove the outer declaration and add two inner
declarations.
David Mitchell [Sun, 2 Feb 2014 19:49:51 +0000 (19:49 +0000)]
re_intuit_start(): remove try_at_* labels
Now that both "other" blocks have been merged into one block, there's
only one occurrence of the following rather than two:
if (rx_origin == strpos)
goto try_at_start;
goto try_at_offset;
which allows us to eliminate these two gotos and just fall through into
the 'if (rx_origin == strpos)' just before the two code blocks marked by
those two labels.
Also intro introduce another label, 'postprocess_substr_matches',
which is needed by the stclass code now that those other two labels have
gone.
David Mitchell [Sun, 2 Feb 2014 19:29:35 +0000 (19:29 +0000)]
re_intuit_start(): simplify check-only origin test
In the case where there's no "other" substring, we check whether
the regex origin would be at the start of string.However, a few commits
ago we introduced the rx_origin var, and we can use this value now
simplify the test, which was effectively re-calculating rx_origin each
time.
David Mitchell [Sun, 2 Feb 2014 17:49:57 +0000 (17:49 +0000)]
re_intuit_start(): merge anch and float "other"
When processing the "other" substring, there are two very similar
branches: one if "other" is anchored, the other if it's floating.
Merge these two branches.
The diff output makes it look a lot messier that it actually is; really
it's a bit like
if (other_ix) {
A;
B1;
C;
}
else {
A;
B2;
C;
}
becomes
{
A;
if (other_ix)
B1;
else
B2;
C;
}
where each statement such as B, that differs between the two branches,
is handled separately.
David Mitchell [Sun, 2 Feb 2014 14:47:34 +0000 (14:47 +0000)]
re_intuit_start(): calc fbm_instr() end in bytes
When calculating the end limit of the string to pass to fbm_instr(),
we usually have a pointer to the latest point where the substr could
start, whereas fbm_instr() expects a pointer to the latest point where
the substr could end.
Since fmb_intr() purely matches bytes (it cares not whether those bytes
are part of a utf8 stream of not), the value of the latest end point will
always be:
(latest start point) + SvCUR(sv) - !!SvTAIL(sv)
i.e. work in bytes, even if we have utf8 values.
In some of the places where fbm_instr() is used, the calculation is being
done partially or fully in chars rather than bytes. This is not incorrect,
and indeed may in theory calculate a slightly lower end limit sometimes
and thus stop the fbm earlier. But this comes at the cost having to do
utf8 length calculations and HOPs back from the end of the string.
So we're trading off not having to do utf8 skips on the last few chars
against the fbm not uselessly searching the last few chars. These roughly
cancel each other out. But since we no longer do HOPs before starting the
fbm, we win every time the fbm doesn't get near the end of the string.
So in conclusion, simpler code and better than or equal performance.
David Mitchell [Sun, 2 Feb 2014 14:14:47 +0000 (14:14 +0000)]
re_intuit_start(): move a line of code earlier
This makes no functional difference, but makes the two branches
of "other substr" calculate the three values last1, last, s in the same
order.
David Mitchell [Sun, 2 Feb 2014 14:02:24 +0000 (14:02 +0000)]
re_intuit_start(): re-indent after brace removal
The previous commit removed one level of {} from a block of code;
re-indent to match.
Whitespace-only change
David Mitchell [Sun, 2 Feb 2014 13:56:14 +0000 (13:56 +0000)]
re_intuit_start(): move do_other_anchored label up
The "other substr" code currently looks like
if (anchored) {
do_other_anchored:
{
...
}
}
else {
....
}
Replace it with
do_other_substr:
if (anchored) {
...
}
else {
....
}
and make the two places that currently do 'goto do_other_anchored'
do 'goto do_other_substr' instead, after first asserting that the "other
substr" is indeed always anchored.
This would appear to be infinitesimally less efficient, but is is part
of plan to make the two branches of the "other substr" code more similar,
allowing eventually merging.
David Mitchell [Sun, 2 Feb 2014 13:44:07 +0000 (13:44 +0000)]
re_intuit_start(): reduce use of *_offset macros
There are a number of macros with definitions like
#define anchored_offset substrs->data[0].min_offset
#define float_min_offset substrs->data[1].min_offset
In the two "other substr" branches, replace uses of these macros, e.g.
{
...
foo = prog->float_min_offset;
...
}
with
{
struct reg_substr_datum *other = &prog->substrs->data[other_ix];
...
foo = other->min_offset;
...
}
As well as making debugging easier (a debugger might display real fields
but not macros), and potentially making the binary more compact and faster
(unless the compiler is clever enough to optimise away every use of the
'prog->substrs->data[0]' dereference), it also helps make the two "other
substr" branches more similar, bringing us closer to eventually merging
them.
David Mitchell [Sat, 1 Feb 2014 00:33:11 +0000 (00:33 +0000)]
re_intuit_start(): harmonise other_last++
In the other=anchored branch, at the end on failure or success, we
set other_last to HOP3(last, 1) or HOP3(s, 1) respectively,
indicating the minimum point we should start matching if we ever
have to try again. Clearly for failure, we know the substring can't be
found at any position up to, or including last, so next time we should try
at last+1. For success, if we return later it means that some other
constraint failed, and we already know that the substr wasn't found at
positions up to s-1, and that if we tried position s again we'd just
repeat the previous failure. So in both cases set to N+1.
In the other=float branch however, other_last is set to last or s on
failure or success, with a big "XXX is this right?" against the
"other_last = s" code. It turns out that "other_last = s" *is* right, for
the special reasons explained in the code comments added by this commit;
while "other_last = last" is changed to be "other_last = HOP3(last,1)".
David Mitchell [Fri, 31 Jan 2014 23:58:14 +0000 (23:58 +0000)]
re_intuit_start(): simplify other=anchored block
This block of code calculates 2 limits: last, last2; plus a third,
last1 = min(last, last2)
It turns out that (as explained below), last is always <= last2, which
allows us to simplify the code. In particular, this means that last always
equals last1, so eliminate last1 and always use last instead.
At the same time, rename last2 to last1, so the vars have the same names /
meanings as in the other=float branch.
Here's the math (ignoring char/byte differences for simplicity's sake):
last = s (== start of just matched float substr)
- float_min_offset
+ anchored_offset
last2 = strend - minlen + anchored_offset
Let
delta = last2 - last
= (strend - minlen + anchored_offset)
- (s - float_min_offset + anchored_offset)
= (strend - s) - (minlen - float_min_offset) [1]
Now, we've just matched a floating substring at s. But this previous
match was constrained to *end* no later than end_shift chars before
strend, so it was constrained to *start* no later than
end_shift + length(float) chars before strend; i.e.
strend - s >= end_shift + length(float) [2].
Also, more or less by definition,
minlen = float_min_offset + length(float) + end_shift
or
end_shift = minlen - float_min_offset - length(float) [2]
So, combining [2] and [3] gives
strend - s >= (minlen - float_min_offset - length(float)) + length(float)
strend - s >= minlen - float_min_offset
Therefore, from [1],
delta >= 0
David Mitchell [Fri, 31 Jan 2014 23:45:01 +0000 (23:45 +0000)]
re_intuit_start(): add tmp assertion
This assertion confirms it is safe to strip out some redundant code that
will be removed (and explained) in the next commit
David Mitchell [Thu, 30 Jan 2014 16:12:14 +0000 (16:12 +0000)]
re_intuit_start(): fixup some code comments
Based on some feedback from Hugo, this makes some of the comments I've
added recently less confusing (hopefully).
In particular, it standardises on one set of terminology for string
positions: earliest/first to latest/last, avoiding others like
smallest/least/minimum to greatest/most/maximum, and
bottom/lowest to top/highest.
David Mitchell [Sun, 26 Jan 2014 16:07:17 +0000 (16:07 +0000)]
re_intuit_start(): update rx_origin after check
Previously the code for updating rx_origin after a 'check' match or an
'other' match looked a bit like this:
s = fbm_instr(check);
if (other exists) {
if (other is anchored) {
rx_origin = HOP3c(s, -prog->check_offset_max);
....
}
else {
rx_origin = HOP3c(s, -prog->check_offset_min);
....
}
}
else
rx_origin = HOP3c(s, -prog->check_offset_max);
This commit changes it to
s = fbm_instr(check);
rx_origin = HOP3c(s, -prog->check_offset_max);
if (other exists) {
if (other is anchored) {
....
}
else {
....
}
}
Of course in each case the 'HOP3' code was slightly different, but they
all happened to be equivalent, especially as for an anchored string,
check_offset_min == check_offset_max.
The only complication was a goto do_other_anchored, but it turns
out that setting rx_origin in that case was easy.
David Mitchell [Sun, 26 Jan 2014 14:19:47 +0000 (14:19 +0000)]
regex substrs: record index of check substr
Currently prog->substrs->data[] is a 3 element array of structures.
Elements 0 and 1 record the longest anchored and floating substrings,
while element 2 ('check'), is a copy of the longest of 0 and 1.
Record in a new field, prog->substrs->check_ix, the index of which element
was copied. (Eventually I intend to remove the copy altogether.)
Also for the anchored substr, set max_offset equal to min offset.
Previously it was left as zero and ignored, although if copied to check,
the check copy of max *was* set equal to min. Having this always set will
allow us to make the code simpler.
David Mitchell [Sat, 25 Jan 2014 10:30:51 +0000 (10:30 +0000)]
re_intuit_start(): use the rx_origin var more
Make the rx_origin variable (introduced in the previous commit, and which
specifies the current minimum legal place the regex could match at) to
also be used at the start and end of the "other" substr match: the origin
is now passed in this var to the other parts of the code that use it,
rather than in the anonymous "t" variable, which is slowly being reduced
in function to a temporary generic char pointer.
David Mitchell [Fri, 24 Jan 2014 16:39:40 +0000 (16:39 +0000)]
re_intuit_start(): introduce rx_origin var
re_intuit_start() is a bit of mess. It uses two general function-scope
vars, s and t, to point at string offsets while processing. These vars
mean different things at different times. Introduce a new var, rx_origin,
which indicates the current minimum position that the regex could begin
matching at. It starts off at strpos, and gradually moves up as various
constraints are rejected. It will be the value eventually returned.
For the moment, s and/or t will continue serving that function at various
points in the code; this commit just makes rx_origin valid at the entry to
the 'restart:' block.
David Mitchell [Fri, 24 Jan 2014 14:41:56 +0000 (14:41 +0000)]
re_intuit_start(): use different var for tmp value
Make the anchored branch more similar to the floating branch by using s to
hold the start position for fbm rather than t. Should be functionally
equivalent.
Note that on failure in the anchored branch, we leave with t holding a
different value than before, but it shouldn't matter, since the value of t
is only used in the success case.
David Mitchell [Fri, 24 Jan 2014 13:48:21 +0000 (13:48 +0000)]
re_intuit_start(): substr SV cannot be undef
Commit
7e0d5ad7c removed the code that sometimes set the substr
to &PL_sv_undef, so there's no need to test for that value any more.
David Mitchell [Mon, 20 Jan 2014 16:51:31 +0000 (16:51 +0000)]
re_intuit_start(): simplify fixed offset_max code
Since we now assert that all offsets are non-negative, this code can
be simplified a bit. Also, by using HOP3lim() rather than HOP3(), we can
remove a trailing conditional.
David Mitchell [Mon, 20 Jan 2014 16:28:08 +0000 (16:28 +0000)]
re_intuit_start(): thinko from a few commits ago
I bit of code I modified a few commits ago was supposed to be
subtracting the offset 'start_shift' if it was positive, but the
test condition I coded was 'end_shift > 0' by mistake.
It turns out this is harmless, since start_shift is always positive
anyway, and if the shift wasn't subtracted, it just made the code slightly
less efficient. (So it worked either way).
Fix it any way.
David Mitchell [Sun, 19 Jan 2014 00:15:57 +0000 (00:15 +0000)]
Perl_regexec_flags(): use HOP4c in another place
Now that we have this macro, use it.
David Mitchell [Sat, 18 Jan 2014 23:46:49 +0000 (23:46 +0000)]
re_intuit_start(): bias last* vars; revive reghop4
In the "just matched float substr, now match fixed substr" branch,
initially add an extra prog->anchored_offset to the last and last2 vars;
since a lot of the later calculations involve adding anchored_offset,
doing this early to the last* vars means less work in some cases. In
particular, last is calculated from s by a single
HOP4(s, prog->anchored_offset-start_shift,...)
rather than two separate
HOP3(s, -start_shift,...);
HOP3(..., prog->anchored_offset,...);
which may mostly cancel each other out.
Similarly with last2. Later, we can skip adding prog->anchored_offset to
last1, since its antecedents already have the bias added.
In the case of failure, calculating a new start position involves an extra
HOP to s, but removes a HOP from other_last, so the two cancel out.
To make this work, I revived the reghop4() function which had been
commented out, and added a HOP4c() wrapper macro. This is like HOP3c(),
but allows you to specify both lower and upper limits. Useful when you
don't know the sign of the offset in advance.
(Yves had earlier added this function, but had commented it out until such
time as it was actually used.)
I also added some extra comments to this block and removed the comment
about it being maybe broken under utf8, since I'm auditing the code for
utf8-safeness.
David Mitchell [Fri, 17 Jan 2014 16:09:23 +0000 (16:09 +0000)]
re_intuit_start(): add some more code comments
David Mitchell [Thu, 16 Jan 2014 17:07:13 +0000 (17:07 +0000)]
re_intuit_start(): delete srch_(start|end)_shift
remove these two vars; these are now just unmodified copies of
start_shift, end_shift; so just use those two vars directly.
David Mitchell [Thu, 16 Jan 2014 16:59:50 +0000 (16:59 +0000)]
re_intuit_start(): assert substr offsets are >= 0.
Some parts of this function handle the negative offset case, while other
parts don't. Also, nothing in the test suite generates negative offsets.
So for now, assert that all offsets are positive, and strip out any
code that handles negative offsets. This will make my current activity
in fixing and refactoring this function easier. If at some future point
someone wants to add support for negative offsets (e.g. with look-behind)
then they'll have to add support fully to re_intuit_start() from scratch.
David Mitchell [Thu, 16 Jan 2014 16:00:41 +0000 (16:00 +0000)]
re_intuit_start(): fix another utf8 slowdown
The code that looks for a floating substr after a fixed substr has
already been found, was very slow on long utf8 strings. For example
this used to take an hour or more, and now takes millisecs:
$s = "ab" x 1_000_000;
utf8::upgrade($s);
$s =~ /ab.{1,2}x/;
When calculating the maximum position at which the floating substr could
start, there are two possible upper limits.
First, the absolute max position, ignoring the results of the previous
fixed substr match - this is the end-of-string less a bit (last1);
Second, float_max_offset on from the current origin of the regex (this
is dependent on where the fixed substr previously matched).
To decide which of these two values to use (the smaller), it used to
calculate the distance in chars from the regex origin to last1, and if
this was greater than float_max_offset, it used origin + float_max_offset
(in chars) instead.
This distance calculation involved doing a utf8 length calculation on the
majority of the string, which for long strings was a big slowdown.
Fix this by instead always using HOP3(origin + float_max_offset), but
using last1 as the upper HOP limit rather than strend, so it's always
limited to <= last1.
If L is number of chars that had to be hopped over for the distance
calculation (which could be most of the string), and if M is the
chars hopped for origin + float_max_offset (typically either small or
infinite), then we:
previously hopped: (M>=L ? L : L+M) chars
now hop: min(L,M) chars; or if M is infinite, hop 0 chars
Which is always less than or equal to the amount of work done previously,
and is a very big win for long strings with smallish maximum float
offsets.
David Mitchell [Thu, 16 Jan 2014 15:16:34 +0000 (15:16 +0000)]
re_intuit_start(): document floating code better
Add some comments to the float-finding code that explains what the t, last
and last1 vars are, and how they're calculated.
Also make the calculation of last separate from last1;
it should logically be the same, but clearer. i.e. change
last = last1 = ...;
if (cond)
last = ....;
to
last1 = ...;
last = cond ? .... : last1;
David Mitchell [Wed, 8 Jan 2014 16:30:32 +0000 (16:30 +0000)]
re_intuit_start(): add more debugging output
Add some debugging output to some parts of the code without them, so it's
easier to follow progress through intuit(); also add an initial "we're in
intuit" message.
Make all the debugging output, apart from the initial and final intuit
messages, indented by 2 chars so that they are seen to be things happening
within intuit.
Dump the susbtrs data array.
Fix up a few of the existing outputs to be more informative.
David Mitchell [Fri, 27 Dec 2013 23:23:12 +0000 (23:23 +0000)]
re_intuit_start(): simplify ml_anch evaluation
rather than enumerating all the anchor flag combos where ml_anch *isn't*
true, enumerate the flags for which is *is* true. This is slighly simpler
logic, and involves once less negation, which makes it easier to
understand.
David Mitchell [Fri, 27 Dec 2013 23:16:23 +0000 (23:16 +0000)]
test for single-line ^ within /m
This combo doesn't appear to be tested anywhere; specifically, adding this
in re_intuit_start() didn't trigger the assertion when run against the
test suite:
if (prog->extflags & RXf_ANCH_BOL)
assert(!multiline);
David Mitchell [Fri, 27 Dec 2013 22:28:31 +0000 (22:28 +0000)]
eliminate RXf_ANCH_SINGLE
This macro defines two flag bits:
#define PREGf_ANCH_SINGLE (PREGf_ANCH_SBOL|PREGf_ANCH_GPOS)
but is only used twice in core (and not on CPAN),
don't really add any value, but increases cognitive complexity.
David Mitchell [Fri, 27 Dec 2013 22:12:31 +0000 (22:12 +0000)]
re_intuit_start(): add comments to a block of code
explain what it does!
David Mitchell [Fri, 27 Dec 2013 22:06:36 +0000 (22:06 +0000)]
re_intuit_start(): refactor an if/else block
change
if (X) { do nothing } else if (Y) { Z }
to
if (!X && Y) { Z }
David Mitchell [Thu, 26 Dec 2013 22:47:33 +0000 (22:47 +0000)]
re_intuit_start(): rationalise ml_anch var
Make ml_anch bool rather than I32, since that's all its used for.
Also, unconditionally initialise it to zero then only set where needed.
This eliminates an else branch that just sets it to zero.
David Mitchell [Thu, 26 Dec 2013 22:29:39 +0000 (22:29 +0000)]
re_intuit_start(); eliminate max_shift var
I introduced this var a few commits ago, but after a bit of refactoring,
I can now eliminate it and just use its antecedents directly.