review.tizen.org Git - platform/upstream/perl.git/log

utf8.c: Prefer binsearch over swash hash for small swashes

A binary swash is a hash of bitmaps used to cache the results of looking
up if a code point matches a Unicode property or regex bracketed
character class.  An inversion list is a data structure that also holds
information about what code points match a Unicode property or character
class.  It is implemented as an SV* to a sorted C array, and hence can
be searched using a binary search.

This patch converts to using a binary search of an  inversion list
instead of a hash look-up for inversion lists that are no more than 512
elements (9 iterations of the search loop).  That number can be easily
adjusted, if necessary.

Theoretically, a hash is faster than a binary lookup over a very long
period.  So this may negatively impact long-running servers.  But in the
short run, where most programs reside, the binary search is
significantly faster.

A swash is filled as necessary as time goes on, caching each new
distinct code point it is called with.  If it is called with  many, many
such code points, its performance can degrade as collisions increase.  A
binary search does not have that drawback.  However, most real-world
scenarios do not have a program being called with huge numbers of
distinct code points.  Mostly, the program will be called with code
points from just one or a few of the world's scripts, so will remain
sparse.  The bitmaps in a swash are each 64 bits long (except for ASCII,
where it is 128).  That means that when the swash is populated, a lookup
of a single code point that hasn't been checked before will have to
lookup the 63 adjoining code points as well, increasing its startup
overhead.  Of course, if one of those 63 code points is later accessed,
no extra populate happens.  This is a typical case where a languages
code points are all near each other.

The bottom line, though, is in the short term, this patch speeds up the
processing of \X regex matching about 35-40%, with modern Korean (which
has uniquely complicated \X processing) closer to 40%, and other scripts
closer to 35%.

The 512 boundary means that over 90% of the official Unicode properties
are handled using binary search.  I settled on that number by
experimenting with several properties besides \X and with various
powers-of-2 limits.  Until I got that high, performance kept improving
when the property went from being a swash to a binary search.  \X
improved even up to 2048, which encompasses 100% of the official Unicode
properties.

The implementation changes so that an inversion list instead of a swash
is returned by swash_init() when the input flags allows it to do so, for
all inversion lists shorter than the compiled in constant of 512
(actually <= 512).  The other functions that access swashes have added
intelligence to deal with an object of either type.  Should someone in
CPAN be using the public swash_init() interface, they will not see any
difference, as the option to get an inversion list is not available to
them.

utf8.c: Bypass a subroutine wrapper

We might as well call the core swash initialization, since we are the
core here, since the public one merely wraps it.

utf8.c: Add comment about speed-up attempt

This might keep someone later from attempting the speedup which didn't
actually help, so I didn't commit it

utf8.c: Shorten hash key for speed

Experiments have shown that longer hash keys impact performance. See
the thread at
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2012-08/msg00869.html

This patch shortens a key used very frequently. There are other keys in
this hash which are used frequently in some circumstances, but I expect
to change to use fewer in the future, so am not changing them now

utf8.c: collapse a function parameter

Now that we have a flags parameter, we can get put this parameter as
just another flag, giving a cleaner interface to this internal-only
function. This also renames the flag parameter to <flag_p> to indicate
it needs to be dereferenced.

regexec.c: Reword comment

This portion of the comment is unnecessary, and doesn't really reflect
the implementation

regexec.c: Use get method instead of internals

A new get method has been written to access the internals of a swash
it's best to use it.

This also moves the error checking to the method

embed.fnc: Turn null wrapper function into macro

This function only does something on EBCDIC platforms. On ASCII ones
make it a macro, like similar ones to avoid useless function nesting

utf8.c: Revise internal API of swash_init()

This revises the API for the version of swash_init() that is usable
by core Perl.  The external interface is unaffected.  There is now a
flags parameter to allow for future growth.  And the core internal-only
function that returns if a swash has a user-defined property in it or
not has been removed.  This information is now returned via the new
flags parameter upon initialization, and is unavailable afterwards.
This is to prepare for the flexibility to change the swash that is
needed in future commits.

embed.fnc: Mark internal function as "may change"

This function is not designed for a public API, and should have been so
listed.

Add caching to inversion list searches

Benchmarking showed some speed-up when the result of the previous
search in an inversion list is cached, thus potentially avoiding a
search in the next call. This adds a field to each inversion list which
caches its previous search result.

regexec.c: Use xor to save a branch

Probably this gets optimized this way anyway.

Comment out unused function

In looking at \X handling, I noticed that this function which is
intended for use in it, actually isn't used. This function may someday
be useful, so I'm leaving the source in.

utf8.c: Speed up \X processing of Korean

\X matches according to a complicated pattern that is hard-coded in
regexec.c.  Part of that pattern involves checking if a code point is a
component of a Hangul Syllable or not.  For Korean code points, this
involves checking against multiple tables.  It turns out that two of
those tables are arranged so that the checks for them can be done via an
arithmetic expression; Unicode publishes algorithms for determining
various characteristics based on their very structured ordering.

This patch converts the routines that check these two tables to instead
use the arithmetic expression.

regcomp.c: Move functions to inline_invlist.c

This populates inline_invlist.c with some static inline functions and
macro defines. These are the ones that are anticipated to be needed in
the near term outside regcomp.c

regcomp.c: Rename 2 functions to indicate private nature

These two functions will be moved into a header in a future commit,
where they will be accessible outside regcomp.c Prefix their names with
an underscore to emphasize that they are private

regcomp.c: Silence compiler warning.

The warning that this variable can be used uninitialized is spurious,
but silence it nonetheless.

Add empty inline_invlist.c

This will be used for things need to handle inversion lists in the three
files that currently use them. I'm putting this in a separate hdr,
because inversion lists are very internal-only, so should not be grouped
in with things that there is an external API for. It is a dot-c file so
that the functions can continue to be declared with embed.fnc, and
porting/args_assert.t will continue to work, as it looks only in .c
files.

regcomp.c: Add assertion, comments

regcomp.c: Allow search to work on empty inversion lists

You cannot retrieve the array of an empty inversion list, so the code
has to be reordered to do that after the list is known to be non-empty.
I haven't been able to find a case where this currently fails, but
future commits open up the possibility.

regcomp.c: Special case /[UV_MAX]/

The highest code point representable on the machine has to be special
cased. Earlier commits for 5.14 did this for ranges ending in this code
point, but it turns out there needs to be a special-special case when
the range contains just it.

mktables: Fix bug when deleting final range

When a Range_List is emptied, there is a bug which causes a runtime
error when trying to refer to a non-existent element. This avoids that.
A future commit would have run afoul of this bug.

Increase $B::Concise::VERSION to 0.93

Optimise %hash in sub { %hash || ... }

In %hash || $foo, the %hash is in scalar context, so it has to iterate
through the buckets to produce statistics on bucket usage.

If the || is in void context, the value returned by hash is only ever
used as a boolean (as || doesn’t have to return it).  We already opti-
mise it by adding a boolkeys op when it is known at compile time that
|| will be in void context.

In sub { %hash || $foo } it is not known at compile time that it will
be in void context, so it wasn’t optimised.

This commit optimises it by flagging the %hash at compile time as
being possibly in ‘true boolean’ context.  When that flag is set,
the rv2hv and padhv ops call block_gimme() to see whether || is in
void context.

This speeds things up signficantly.  Here is what I got after optimis-
ing rv2hv but before doing padhv:

$ time ./miniperl -e '%hash = 1..10000; sub { %hash || 1 }->() for 1..100000'

real 0m0.179s
user 0m0.101s
sys 0m0.005s
$ time ./miniperl -e 'my %hash = 1..10000; sub { %hash || 1 }->() for 1..100000'

real 0m5.446s
user 0m2.419s
sys 0m0.015s

(That example is slightly misleading because of the closure, but the
closure version takes 1 sec. when optimised.)

improve and fix the documentation of the PERL_HASH function

minor doc patches to api stuff

Apply boolkeys optimisation to %hash?:

and consequently if(%hash) followed by else.

Apply boolkeys optimisation to scalar(%hash)

[perl #114576] Optimise if(%hash) in non-void context

The boolkeys optimisation (867fa1e2da1) was only applying to an and
(or if) in void context.  If an if occurs as the last thing in a sub-
routine, the void context is not know at compile time so the optimisa-
tion does not apply.

In the case of || (to which the boolkeys optimisation also applies),
we can’t optimise it in non-void context, because someone might be
writing $bucket_info = %hash || '0/0';

In the case of &&, we can optimise it, even in non-void context,
because a true value will always be discarded in %hash && foo.
The false value it returns for an empty hash is always the int-
eger 0.  That would change if we simply applied boolkeys to
my $ret = %hash && foo; because boolkeys return &PL_sv_no (the dualvar
you get from !1).  But since boolkeys’ return value is never directly
visible to perl code, we can safely change that.

pp.c: pp_boolkeys does not need to pop

If it’s going to consume and return exactly one item, it doesn’t need
to decrement and increment the stack pointer.

[perl #114572] perl.c: fix locality/rmv redundant nulls in call_sv/eval_sv

Small tweaks to improve locality/more opportunity for C compiler to
optimize. Also remove redunant nulls, since the OP structs are
null filled a line or 2 before.

parser.t: Move a test above ‘Add new tests here’

pad.h: Rename PadnameSTATE; make it a proper boolean

I used PadnameIs* for OUR, because I was copying
PAD_COMPNAME_FLAGS_isOUR. STATE should be consistent with it. And it
was missing the double bang, making the docs wrong.

rt #111126 - don't empty a file with copy("foo/bar", "foo/");

rt #111126 - TODO test for copy foo/file to foo/

close the Peek.t temp file so the END block can unlink it

This was leaving detritus on Win32 builds

oops, left some debugging code

left from fixing perl #112272

don't use PerlHost's getenv after perl_destruct

On Win32, perl_free calls PerlHost's getenv which calls win32_getenv.
win32_getenv and its children use SVs and mortal stack. After
perl_destruct SVs and mortal stack don't exist but the old Itmps_stack
pointer remains unchanged/un-nulled. Depending on the memory allocator
randomness, previous mortaled SV would be written to allocator freed
but page allocated memory and it silently worked. Recently in 5.17 the
page started to be freed and now this bug segvs. This patch fixes
the problem by using PL_perl_destruct_level and calling getenv earlier.

Announcement template - Current development track is 5.17

RMG - CPAN /src and /src/README.html are the same

RMG - corelist.pl uses HTTP::Tiny, not wget or curl

It also fetches files remotely even when using a local CPAN mirror if
the files are missing.

Record the story behind the pack format specifiers H, h, B and b.

Increase $Module::CoreList::VERSION to 2.73

Even though cmp_version.t doesn’t mind 2.72, we need a version bump,
as 2.72 is already on CPAN.

Clean up data for ExtUtils::Miniperl in Module::CoreList

Some corelist data was constructed without ExtUtils::Miniperl being
present, presumably because perl wasn't fully built at the time.

Clean up data for Pod::Perldoc::ToTk in Module:CoreList

It was alternating between 'undef' and undef.

Clean up data for Carp::Heavy in Module::CoreList

It was lagging behind by about one release -- presumably due to it being
based on $Carp::VERSION.

Fix the version of Scalar::Util in corelist for 5.7.3

pad.h: PadnameSTATE

Use FooBAR convention for new pad macros

After a while, I realised that it can be confusing for PAD_ARRAY and
PAD_MAX to take a pad argument, but for PAD_SV to take a number and
PAD_SET_CUR a padlist.

I was copying the HEK_KEY convention, which was probably a bad idea.
This is what we use elsewhere:

   TypeMACRO
   ----=====
     AvMAX
    CopFILE
   PmopSTASH
  StashHANDLER
OpslabREFCNT_dec

Furthermore, heks are not part of the API, so what convention they use
is not so important.

So these:

    PADNAMELIST_*
    PADLIST_*
    PADNAME_*
    PAD_*

are now:

    Padnamelist*
    Padlist*
    Padname*
    Pad*

Increase $B::Deparse::VERSION to 1.17

B::Deparse: Suppress trailing ; in formats

While it doesn’t change the behaviour, nobody writes formats that way,
and this makes the output match 5.17.2 and earlier.

pad.h: Let PADNAME_PV return null

pad.h: typos in macro definitions

It would help to define these macros properly.

pad.h: PADNAME_SV

If CPAN modules should not assume that pad names are SVs, we need
to provide a better way than newSVpvn(PADNAME_PV(pn),PADNAME_LEN(pn))
to get an SV out of it, as, knowing that pad names are just SVs, the
core can do it more efficiently by simply returning the pad name
itself.

pad.[ch]: PADNAME_OUTER

I think this is the last bit of pad-as-sv stuff that was not
abstracted away in pad-specific macros.

toke.c: Extreme paranoia

PATCH: Devel::Peek doesn't compile under C++

Commit c9795579db61c900bacee2790bdceb7bad3dd45d introduced
an error in C++: it's missing a cast.

[perl #114040] Fix here-docs in multiline re-evals

Commit 5097bf9b8 only partially fixed this, or, rather, did the
groundwork for fixing it.

If we have a pattern like this:

/(?{<<foo . baz
bar
foo
})/

Then PL_linestr contains this while we are parsing the block:

"(?{<<foo . baz\nbar\nfoo\n})"

The code for parsing a here-doc in a multiline PL_linestr buffer
(which applies to here-docs in string evals or in quote-like operat-
ors) likes to modify PL_linestr to contain everything after the
<<heredoc marker except the here-doc body, which has been stolen (but
it oddly includes the last character of the marker, which does not
matter, as PL_bufptr is set to PL_linestart+1):

"o . baz\n})"

The regexp block parsing code expects to be able to extract the entire
block (as a string) from PL_linestr after parsing it. So it is not
helpful for S_scan_heredoc to go and modify it like that.

Before modifying PL_linestr, we can set aside a copy of the source
code (in PL_sublex_info.re_eval_str) from the beginning of the regexp
block to the end of PL_linestr, so that the regexp block code can
retrieve the original source from there.

We also adjust PL_sublex_info.re_eval_start so that at the end of the
regexp block PL_bufptr - PL_sublex_info.re_eval_start is the length of
the block.

Instead of clobbering PL_linestr, we can copy everything after the
here-doc to when the body begins. And this for two reasons: it
requires less allocation (I would have made that change in the end
anyway, for efficiency), and it makes it easier to calculate how much
to subtract from re_eval_start.

This fix does not apply to here-docs in quotes in multiline string
evals, which crashes and always has.

Peek.t: Test that DeadCode doesn’t crash

I broke it, but Karl Williamson’s commit (the previous) with my tweaks
fixes it. This function was not at all exercised by the test suite.

Devel::Peek: Fix so compiles under C++

Commit 86b9d29366aea0e71ad75b61d04f56f1fe5b0d4d created a new PADLIST
type. However, this broke the compilation of Devel::Peek with C++.
This commit gets it to compile again, and pass our regression test
suite.

[Modified by the committer to use the correct PADLIST_ macros; other-
wise it will crash.]

toke.c: -DT should report forced tokens under -Dmad

I was wondering why the -DT output was missing things out.
This is why:

#ifdef PERL_MAD
    /* FIXME - can these be merged?  */
    return next_type;
#else
    return REPORT(next_type);
#endif

heredoc.t: Add a CRLF test

I nearly broke this in recent bug fixes

[Merge] New PADLIST type

To fix a bug (db4cf31d1d) and to facilitate the lexical subs I’m work-
ing on, I needed to be able to add extra fields to a padlist. But
padlists are AVs, making that nontrivial.

There is no reason they need to be AVs, and they take less memory when
they are not, so I made a new padlist struct.

This is going to break CPAN modules that manipulate padlists.

To avoid having to patch those modules again later if we change pads
from AVs into their own types, I have added APIs for accessing the
contents of pads.

There is also a new PADNAMELIST type (currently equivalent to AV), in
case the pad holding the names needs to be a different type from a pad
some time in the future.

pad.c: fix pod link

Increase $XS:APItest::VERSION to 0.43

Increase $B::VERSION to 1.38

pad.c: CvPADLIST docs: one more thing

pad.c: Use PAD_ARRAY rather than AvARRAY in curpad docs

Use new types for comppad and comppad_name

I know that a few times I’ve looked at perl source files to find out
what type to use in ‘<type> foo = PL_whatever’. So I am changing
intrpvar.h as well as the api docs.

pad.c: CvPADLIST doc update

More PAD APIs

If we are making padlists their own type, and no longer AVs, it makes
sense to add APIs for pads, too, so that CPAN code that needs to
change now will only have to change once if we ever stop pads them-
selves from being AVs.

There is no reason pad names have to be SVs, so I am adding sep-
arate APIs for pad names, too. The AV containing pad names is
now officially a PADNAMELIST, which is accessed, not via
*PADLIST_ARRAY(padlist), but via PADLIST_NAMES(padlist).

Future optimisations may even merge the padlist with its name list so
I have also added macros to access the parts of the name list directly
from the padlist.

Fix format closure bug with redefined outer sub

CVs close over their outer CVs.  So, when you write:

my $x = 52;
sub foo {
  sub bar {
    sub baz {
      $x
    }
  }
}

baz’s CvOUTSIDE pointer points to bar, bar’s CvOUTSIDE points to foo,
and foo’s to the main cv.

When the inner reference to $x is looked up, the CvOUTSIDE chain is
followed, and each sub’s pad is looked at to see if it has an $x.
(This happens at compile time.)

It can happen that bar is undefined and then redefined:

undef &bar;
eval 'sub bar { my $x = 34 }';

After this, baz will still refer to the main cv’s $x (52), but, if baz
had  ‘eval '$x'’ instead of just $x, it would see the new bar’s $x.
(It’s not really a new bar, as its refaddr is the same, but it has a
new body.)

This particular case is harmless, and is obscure enough that we could
define it any way we want, and it could still be considered correct.

The real problem happens when CVs are cloned.

When a CV is cloned, its name pad already contains the offsets into
the parent pad where the values are to be found.  If the outer CV
has been undefined and redefined, those pad offsets can be com-
pletely bogus.

Normally, a CV cannot be cloned except when its outer CV is running.
And the outer CV cannot have been undefined without also throwing
away the op that would have cloned the prototype.

But formats can be cloned when the outer CV is not running.  So it
is possible for cloned formats to close over bogus entries in a new
parent pad.

In this example, \$x gives us an array ref.  It shows ARRAY(0xbaff1ed)
instead of SCALAR(0xdeafbee):

sub foo {
    my $x;
format =
@
($x,warn \$x)[0]
.
}
undef &foo;
eval 'sub foo { my @x; write }';
foo
__END__

And if the offset that the format’s pad closes over is beyond the end
of the parent’s new pad, we can even get a crash, as in this case:

eval
'sub foo {' .
'{my ($a,$b,$c,$d,$e,$f,$g,$h,$i,$j,$k,$l,$m,$n,$o,$p,$q,$r,$s,$t,$u)}'x999
. q|
    my $x;
format =
@
($x,warn \$x)[0]
.
}
|;
undef &foo;
eval 'sub foo { my @x; my $x = 34; write }';
foo();
__END__

So now, instead of using CvROOT to identify clones of
CvOUTSIDE(format), we use the padlist ID instead.  Padlists don’t
actually have an ID, so we give them one.  Any time a sub is cloned,
the new padlist gets the same ID as the old.  The format needs to
remember what its outer sub’s padlist ID was, so we put that in the
padlist struct, too.

Increase $B::Xref::VERSION from 1.03 to 1.04

Stop padlists from being AVs

In order to fix a bug, I need to add new fields to padlists.  But I
cannot easily do that as long as they are AVs.

So I have created a new padlist struct.

This not only allows me to extend the padlist struct with new members
as necessary, but also saves memory, as we now have a three-pointer
struct where before we had a whole SV head (3-4 pointers) + XPVAV (5
pointers).

This will unfortunately break half of CPAN, but the pad API docs
clearly say this:

    NOTE: this function is experimental and may change or be
    removed without notice.

This would have broken B::Debug, but a patch sent upstream has already
been integrated into blead with commit 9d2d23d981.

Use PADLIST in more places

Much code relies on the fact that PADLIST is typedeffed as AV.
PADLIST should be treated as a distinct type.

Move PAD(LIST) typedefs to perl.h

otherwise they can only be used in some header files.

[Merge] Enter inline.h

This is a home for static inline functions that cannot go in other
headers because they depend on proto.h or struct definitions.

This allows us to avoid repeating macros with GCC and non-GCC ver-
sions. It also makes it easier to avoid evaluating macro argu-
ments twice.

I’ve moved just enough things into it to offset the additional lines
added by the comments at the top. The ‘net code removal’ of this
branch is 4 lines.

Move S_CvDEPTHp from cv.h to inline.h; shrink macros

This allows us to use assert() inside S_CvDEPTHp, so we no longer need
GCC and non-GCC variants of the macro that calls it.

Static inline functions for SvPADTMP and SvPADSTALE

This allows non-GCC compilers to have assertions and avoids
repeating the macros.

Use fast SvREFCNT_dec for non-GCC

Use static inline functions for SvREFCNT_inc

This avoids the need to repeat the macros in GCC and non-GCC versions.
For non-GCC compilers capable of inlining, this should speed things up
slightly, too, as PL_Sv is no longer needed.

[perl #113718] Add inline.h

We can put static inline functions here, and they can depend on
function prototypes and struct definitions from other header
files.

Sync Module-CoreList in Maintainers.pl for CPAN release

Update Changes fr Module-CoreList and bump to version 2.72

[Merge] Here-doc parsing

I was waiting for 5.17.3 to be released, before merging my work on
padlists (which is blocking lexical subs), since I thought it would be
mean to inflict it on blead at the last minute before a release.

So, in the mean time, I decided to fix a small here-doc parsing bug,
that prevented them from occurring inside regexp code blocks.

As often happens, it turned out to be more involved than that....

I ended up writing a history of here-doc parsing, which you can find
in the commit message for 5097bf9b8d, which shows that the way they
have interacted with other quote-like operators (or other here-docs)
has changed over time in interesting ways.

While I was fixing those, I started to find other bugs.  Since I was
modifying the code, I decided to try applying David Nicol’s patch that
allows a here-doc terminator with no newline after it, to avoid creat-
ing more conflicts through my changes.  The patch didn’t work.  And
while I was resolving what conflicts there were, I figured out a sim-
pler approach.  So, instead of trying to investigate into why the
patch didn’t work, I just wrote my own version, which used less code.
Instead of working back on error to try to see whether we could have
accepted a terminator without a newline, we can just tack a newline on
the string buffer at EOF and let the rest of the code handle it the
usual way.

I continued to find more bugs as I went, till my ‘Yay, another bug!’
started to become ‘What? *Another* bug?’.

In the end:

• I fixed here-doc parsing, such that the body starts on the line fol-
  lowing the <<foo marker, regardless of whether it is inside quotes,
  string evals, or what have you (but see remaining bugs below).  This
  was contrary to the documentation, but the documentation was actu-
  ally wrong half the time, so I corrected it.
• Here-doc terminators no longer require a final newline at EOF.
• You no longer get crashes with edge cases.
• Nulls in comments no longer confuse the here-doc parser.

And, finally, one bug that I fixed was not related to here-docs per
se, but got in the way.  It deserves its own JAPH:

s/${s|||, \""}Just another Perl hacker,
/anything/;
print

There are still two bugs remaining:
• Here-docs whose markers occur in single-line s/// patterns where the
  replacement part is multi-line or starts on a subsequent line are
  still screwed.
• CR and CR LF line terminators are treated inconsistently inside and
  outside of string evals.

I’ve decided to set those aside for later and merge what I’ve
done so far.

perlop.pod: Update here-doc-in-quotes parsing rules

smoke-me diag

nt,hun

toke.c:scan_heredoc: Use PL_tokenbuf less

When scanning for a heredoc terminator in a string eval or quote-like
operator, the first character we are looking for is always a newline.
So instead of setting term to *PL_tokenbuf in those two code paths,
we can just hard-code '\n'.

Fix substitution in substitution pattern

Guess what this prints:

s/${s|||, \""}Just another Perl hacker,
/anything/;
print

And look at this:

$ perl5.6.2 -e 's/${s|||;\""}/foo\n/; print;'
$ perl5.16.0 -e 's/${s|||;\""}/foo\n/; print;'
$ perl5.17.2 -e 's/${s|||;\""}/foo\n/; print;'
Bus error
$ ./miniperl -e 's/${s|||;\""}/foo\n/; print;'
Bus error

The first two gave no output, though they should have shown "foo".
And bleadperl now crashes.

When the lexer parses a quote-like operator, it begins by extracting
what is between the quotes.  It puts it in an SV stored in the varia-
ble PL_lex_stuff.  Then, if it is y/// or s///, it scans the replace-
ment part and puts it in an SV in PL_lex_repl.  When it finishes with
it, it sets PL_lex_repl to NULL.

Now, if you put s/// in the pattern part of s/// (or y in s), the
inner s/// will clobber PL_lex_repl with its own replacement string.
So, when the outer s/// finish parsing its pattern and wants its
replacement string.  If it is not there, it assumes it has already
parsed it (whether PL_lex_repl is set is how it remembers which half
of s/// it is parsing), and proceeds to feed bad code to the parser,
resulting in a bad op tree.

PL_lex_repl needs to be localised when a quote-like operator is
parsed.  Since localisation for quote-like operators happens in a sep-
arate yylex call (yylex calls sublex_push, which does it) after the
string delimiters are found, at which point PL_lex_repl has already
been set (clobbering the previous value), we change the delim-
iter-scanning code (scan_{str,trans,subst}) to use the new
PL_sublex_info.repl, which sublex_push now copies into PL_lex_repl
after localising the latter.

Fix here-docs in nested quote-like operators

When the lexer encounters a quote-like operator, it extracts the con-
tents of the quotes and starts an inner lexing scope.

To handle eval "s//<<FOO/e\n...", the here-doc parser peeks into the
outer lexing scope’s PL_linestr (current line buffer, which inside an
eval contains the entire string of code being parsed; for quote-like
operators, that is where the contents of the quote are stored).  It
only does this inside a string eval.  When parsing a file, the input
comes in one line at a time.  So the here-doc parser steals lines from
the input stream for s//<<FOO/e outside an eval.

This approach fails in this case, as the peekee is the linestr for
s///, not for the eval:

eval ' s//"${\<<END}"/e; print
Just another Perl hacker,
END
'or die $@
__END__
Can't find string terminator "END" anywhere before EOF at (eval 1) line 1.

We also need to do this peeking stuff outside of a string eval, to
solve this:

s//"${\<<END}"
Just another Perl hacker,
END
/e; print
__END__
Can't find string terminator "END" anywhere before EOF at - line 1.

In the first example above, we need to look not in the parent lexing
scope’s linestr, but in that of the grandparent.

To solve the second example, we need to check whether the outer lexing
scope is a quote-like operator when we are not in an eval.

For parsing here-docs in quotes in eval, we currently store two
things, the former buffer pointer and the former linestr, in
PL_sublex_info.super_{bufp,lines}tr.  The values for upper scopes are
stashed away on the savestack somewhere.

We need to be able to iterate through the outer lexer scopes till we
find one with multiple lines.  Retrieving the information from the
savestack would be too complex and error-prone.

Since PL_linestr is an SV, we can abuse a couple of fields in it.
Upgrading it to PVNV gives it both IVX and NVX fields, which are big
enough to store pointers.

IVX is already used to hold an op number.  So for the innermost quoted
scope we still need to use PL_sublex_info.super_bufptr.  When entering
a new lexing scope (in sublex_push), we can localise the IVX field of
the outer PL_linestr SV and set it to what PL_sublex_info.super_bufptr
was in that scope.  SvIVX(linestr) is only used for an op number when
that linestr’s lexing scope is the innermost one.

PL_sublex_info.super_linestr can be eliminated and replaced with
SvNVX(PL_linestr).

Don’t use strchr when scanning for newline after <<foo

The code that uses this is specifically for parsing <<foo inside a
quote-like operator inside a string eval.

This prints bar:

eval "s//<<foo/e
bar
foo
";
print $_ || $@;

This prints Can't find string terminator blah blah blah:

eval "s//<<foo/e #\0
bar
foo
";
print $_ || $@;

Nulls in comments are allowed elsewhere. This prints bar:

eval "\$_ = <<foo #\0
bar
foo
";
print $_ || $@;

The problem with strchr is that it is specifically for scanning null-
terminated strings. If embedded nulls are permitted (and should be in
this case), memchr should be used.

This code was added by 0244c3a403.

[perl #65838] perlop: remove caveat here-doc without newline

here-doc in quotes in multiline s//.../e in eval

When <<END occurs on the last line of a quote-like operator inside a
string eval ("${\<<END}"), it peeks into the linestr buffer of the
parent lexing scope (quote-like operators start a new lexing scope
with the linestr buffer containing what is between the quotes) to find
the body of the here-doc. It modifies that buffer, stealing however
much it needs.

It was not leaving things in the consistent state that s///e checks
for when it finishes parsing the replacement (to make sure s//}+{/
doesn’t ‘work’). Specifically, it was not shrinking the parent buf-
fer, so when PL_bufend was reset in sublex_done to the end of the par-
ent buffer, it was pointing to the wrong spot.

heredoc after "" in s/// in eval

This works fine:

eval ' s//<<END.""/e; print
Just another Perl hacker,
END
'or die $@
__END__
Just another Perl hacker,

But this doesn’t:

eval ' s//"$1".<<END/e; print
Just another Perl hacker,
END
'or die $@
__END__
Can't find string terminator "END" anywhere before EOF at (eval 1) line 1.

It fails because PL_sublex_info.super_buf*, added by commit
0244c3a403, are not localised, so, after the "", s/// sees its own
buffer pointers in those variables, instead of its parent string eval.

This used to happen only with s///e inside s///e, but that was because
here-docs would peek inside the parent linestr buffer only inside
s///e, and not other quote-like operators. That was fixed in
recent commits.

Simply moving the assignment of super_buf* into sublex_push does solve
the bug for a simple "", as "" does sublex_start, but not sublex_push.
We do need to localise those variables for "${\''}", however.

toke.c:S_scan_heredoc: Add comment about <<\FOO

[perl #65838] Allow here-doc with no final newline

When reading a line of input while scanning a here-doc, if the line
does not end in \n, then we know we have reached the end of input.  By
simply tacking a \n on to the buffer, we can meet the expectations of
the rest of the here-doc parsing code.  If it turns out the delimiter
is not found on that line, it does not matter that we modified it, as
we will croak anyway.

I had to add a new flag to lex_next_chunk.  Before commit f0e67a1d2,
S_scan_heredoc would read from the stream itself, without closing any
handles.  So the next time through yylex, the eof code would supply
the final implicit semicolon.

Since f0e67a1d2, S_scan_heredoc has been calling lex_next_chunk, which
takes care of reading from the stream an supply any final ; at eof.
The here-doc parser will just get confused as a result (<<';' would
work without any terminator).  The new flag tells lex_next_chunk not
to do anything at eof (not even closing handles and resetting the
parser state), but to return false and leave everything as it was.

heredoc.t: Suppress deprecation warnings

Clean up heredoc.t

* Made the tests more independent, mostly by decoupling the use of
  a single $string.  This will make it easier to expand on the test file
  later.

* Replace ok( $foo eq $bar ) with is() for better diagnostics

* Remove unnecessary STDERR redirection.  fresh_perl does that for you.

* fix fresh_perl to honor progfile and stderr arguments passed in
  rather than just blowing over them

[perl #65838] Tests for here-docs without final newlines

and a few error cases

[perl #114040] Parse here-docs correctly in quoted constructs

When parsing code outside a string eval or quoted construct, the lexer
reads one line at a time into PL_linestr.

To parse a here-doc (hereinafter ‘deer hock’, because I spike lunar-
isms), the lexer has to pull extra lines out of the input stream ahead
of the current line, the value of PL_linestr remaining the same.

In a string eval, the entire piece of code being parsed is in
PL_linestr.

To parse a deer hock inside a string eval, the lexer has to fiddle
with the contents of PL_linestr, scanning for newline characters.

Originally, S_scan_heredoc just followed those two approaches.

When the lexer encounters a quoted construct, it looks for the end-
ing delimiter (reading from the input stream if necessary), puts the
entire quoted thing (minus quotes) in PL_linestr, and then starts an
inner lexing scope.

This means that deer hocks would not nest properly outside of a string
eval, because the body of the inner deer hock would be pulled out of
the input stream *after* the outer deer hock.

Larry Wall fixed that in commit fd2d095329 (Jan. 1997), so that this
would work:

<<foo
${\<<bar}
ber
bar
foo

He did so by following the string eval approach (looking for the deer
hock body in PL_linestr) if the deer hock was inside another quoted
construct.

Later, commit a2c066523a (Mar. 1998) fixed this:

s/^not /substr(<<EOF, 0, 0)/e;
  Ignored
EOF

by following the string eval approach only if the deer hock was inside
another non-backtick deer hock, not just any quoted construct.

The problem with the string eval approach inside a substitu-
tion is that it only looks in PL_linestr, which only contains
‘substr(<<EOF, 0, 0)’ when the lexer is handling the second part of
the s/// operator.

But that unfortunately broke this:

s/^not /substr(<<EOF, 0, 0)
  Ignored
EOF
/e;

and this:

print <<`EOF`;
${\<<EOG}
echo stuff
EOG
EOF

reverting it to the pre-fd2d095329 behaviour, because the outer quoted
construct was treated as one line.

Later on, commit 0244c3a403 (Mar. 1999) fixed this:

eval 's/.../<<FOO/e
  stuff
FOO
';

which required a new approach not used before.  When the replacement
part of the s/// is being parsed, PL_linestr contains ‘<<FOO’.  The
body of the deer hock is not in the input stream (there isn’t one),
but in what was the previous value of PL_linestr before the lexer
encountered s///.

So 0244c3a403 fixed that by recording pointers into the outer string
and using them in S_scan_heredoc.  That commit, for some reason, was
written such that it applied only to substitutions, and not to other
quoted constructs.

It also failed to take interpolation into account, and did not record
the outer buffer position, but then tried to use it anyway, resulting
in crashes in both these cases:

eval 's/${ <<END }//';
eval 's//${ <<END }//';

It also failed to take multiline s///’s into account, resulting in
neither of these working, because it lost track of the current cursor,
leaving it at 'D' instead of the line break following it:

eval '
s//<<END
/e;
blah blah blah
END
;1' or die $@;

eval '
s//<<END
blah blah blah
END
/e;
;1' or die $@;

S_scan_heredoc currently positions the cursor (s) at the last charac-
ter of <<END if there is a line break on the same line.  There is an
s++ later on to account, but the code added by 0244c3a403 bypassed it.

So, in the end, deer hocks could only be nested in other quoted con-
structs if the outer construct was in a string eval and was not s///,
or was a non-backtick deer hock.

This commit hopefully fixes most of the problems. :-)

The s///-in-eval case is a little tricky.  We have to see whether the
deer hock label is on the last line of the s///.  If it is, we have
to peek into the outer buffer.  Otherwise, we have to treat it like a
string eval.

This commit does not deal with <<END inside the pattern of a multi-
line s/// or in nested quotes.