review.tizen.org Git - platform/upstream/perl.git/commit

author	David Mitchell <davem@iabyn.com>
	Fri, 13 Dec 2013 16:35:14 +0000 (16:35 +0000)
committer	David Mitchell <davem@iabyn.com>
	Fri, 7 Feb 2014 22:39:35 +0000 (22:39 +0000)
commit	d6ef167873919ed43a86136ba20f5a410a05e7ca
tree	cd83b4fc9e11783a3dc2972b85ad2126f290b629	tree \| snapshot
parent	c889ccc800e01645e475131343d53779b3bd79bd	commit \| diff

RT#120692 slow intuit with long utf8 strings

Some code in re_intuit_start() that tries to find the range of chars
to which the BM substr find can be applied, uses logic that is very
inefficient once utf8 was enabled. Basically the code tries to find
the maximum end-point where the substr could be found, by taking the
minimum of:

    * start + prog->check_offset_max + length(substr)
    * end   - prog->check_end_shift

Except that these values are in char lengths and need to be converted to
bytes before calling fbm_instr(). The code formerly involved scanning the
whole of the remaining string to determine how many chars it had.
By doing the calculation a different way, we can avoid this.

This makes the following two regexps each take milliseconds rather than
10s of seconds:

        my $s = 'ab' x 1_000_000;
        utf8::upgrade($s);
        1 while $s =~ m/\Ga+ba+b/g;
        $s=~ /^a{1,2}x/ for  1..10_000;

regexec.c		diff \| blob \| history
t/re/pat.t		diff \| blob \| history
t/re/re_tests		diff \| blob \| history