[libc++][format] Improves Unicode decoders.
During the implementation of P2286 a second Unicode decoder was added.
The original decoder was only used for the width estimation. Changing
an ill-formed Unicode sequence to the replacement character, works
properly for this use case. For P2286 an ill-formed Unicode sequence
needs to be formatted as a sequence of code units. The exact wording in
the Standard as a bit unclear and there was odd example in the WP. This
made it hard to use the same decoder. SG16 determined the odd example in
the WP was a bug and this has been fixed in the WP.
This made it possible to combine the two decoders. The P2286 decoder
kept track of the size of the ill-formed sequence. However this was not
needed since the output algorithm needs to keep track of size of a
well-formed and an ill-formed sequence. So this feature has been
removed.
The error status remains since it's needed for P2286, the grapheme
clustering can ignore this unneeded value. (In general, grapheme
clustering is only has specified behaviour for Unicode. When the string
is in a non-Unicode encoding there are no requirements. Ill-formed
Unicode is a non-Unicode encoding. Still libc++ does a best effort
estimation.)
There UTF-8 decoder accepted several ill-formed sequences:
- Values in the surrogate range U+D800..U+DFFF.
- Values encoded in more code units than required, for example 0+0020
in theory can be encoded using 1, 2, 3, or 4 were accepted. This is
not allowed by the Unicode Standard.
- Values larger than U+10FFFF were not always rejected.
Reviewed By: #libc, ldionne, tahonermann, Mordante
Differential Revision: https://reviews.llvm.org/D144346