utf8.c: refactor utf8n_to_uvuni()
The prior version had a number of issues, some of which have been taken
care of in previous commits.
The goal when presented with malformed input is to consume as few bytes
as possible, so as to position the input for the next try to the first
possible byte that could be the beginning of a character. We don't want
to consume too few bytes, so that the next call has us thinking that
what is the middle of a character is really the beginning; nor do we
want to consume too many, so as to skip valid input characters. (This
is forbidden by the Unicode standard because of security
considerations.) The previous code could do both of these under various
circumstances.
In some cases it took as a given that the first byte in a character is
correct, and skipped looking at the rest of the bytes in the sequence.
This is wrong when just that first byte is garbled. We have to look at
all bytes in the expected sequence to make sure it hasn't been
prematurely terminated from what we were led to expect by that first
byte.
Likewise when we get an overflow: we have to keep looking at each byte
in the sequence. It may be that the initial byte was garbled, so that
it appeared that there was going to be overflow, but in reality, the
input was supposed to be a shorter sequence that doesn't overflow. We
want to have an error on that shorter sequence, and advance the pointer
to just beyond it, which is the first position where a valid character
could start.
This fixes a long-standing TODO from an externally supplied utf8 decode
test suite.
And, the old algorithm for finding overflow failed to detect it on some
inputs. This was spotted by Hugo van der Sanden, who suggested the new
algorithm that this commit uses, and which should work in all instances.
For example, on a 32-bit machine, any string beginning with "\xFE" and
having the next byte be either "\x86" or \x87 overflows, but this was
missed by the old algorithm.
Another bug was that the code was careless about what happens when a
malformation occurs that the input flags allow. For example, a sequence
should not start with a continuation byte. If that malformation is
allowed, the code pretended it is a start byte and extracts the "length"
of the sequence from it. But pretending it is a start byte is not the
same thing as it actually being a start byte, and so there is no
extractable length in it, so the number that this code thought was
"length" was bogus.
Yet another bug fixed is that if only the warning subcategories of the
utf8 category were turned on, and not the entire utf8 category itself,
warnings were not raised that should have been.
And yet another change is that given malformed input with warnings
turned off, this function used to return whatever it had computed so
far, which is incomplete or erroneous garbage. This commit changes to
return the REPLACEMENT CHARACTER instead.
Thanks to Hugo van der Sanden for reviewing and finding problems with an
earlier version of these commits