contrib: add unicode/utf8-dump.py
authorDavid Malcolm <dmalcolm@redhat.com>
Fri, 8 Oct 2021 14:53:42 +0000 (10:53 -0400)
committerDavid Malcolm <dmalcolm@redhat.com>
Mon, 1 Nov 2021 15:52:28 +0000 (11:52 -0400)
commitb050653c4cb63fe46b9727af924f9bc2b6475fba
tree483cf51829392380b442fd9c9de5fa29f41bd766
parent429e3b7d8bf6609ddf7c7b1e49244997e9ac76b8
contrib: add unicode/utf8-dump.py

This script may be useful when debugging issues relating to Unicode
encoding (e.g. when investigating source files with bidirectional control
characters).

It dumps a UTF-8 file as a list of numbered lines (mimicking GCC's
diagnostic output format), interleaved with lines per character showing
the Unicode codepoints, the UTF-8 encoding bytes, the name of the
character, and, where printable, the characters themselves.
The lines are printed in logical order, which may help the reader to grok
the relationship between visual and logical ordering in bi-di files.

For example:

$ cat test.c
int གྷ;
const char *אבג = "ALEF-BET-GIMEL";

$ ./contrib/unicode/utf8-dump.py test.c
   1 | int གྷ;
     |   U+0069            0x69                     LATIN SMALL LETTER I i
     |   U+006E            0x6e                     LATIN SMALL LETTER N n
     |   U+0074            0x74                     LATIN SMALL LETTER T t
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0F43  0xe0 0xbd 0x83                       TIBETAN LETTER GHA གྷ
     |   U+003B            0x3b                                SEMICOLON ;
     |   U+000A            0x0a                           LINE FEED (LF) (control character)
   2 | const char *אבג = "ALEF-BET-GIMEL";
     |   U+0063            0x63                     LATIN SMALL LETTER C c
     |   U+006F            0x6f                     LATIN SMALL LETTER O o
     |   U+006E            0x6e                     LATIN SMALL LETTER N n
     |   U+0073            0x73                     LATIN SMALL LETTER S s
     |   U+0074            0x74                     LATIN SMALL LETTER T t
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0063            0x63                     LATIN SMALL LETTER C c
     |   U+0068            0x68                     LATIN SMALL LETTER H h
     |   U+0061            0x61                     LATIN SMALL LETTER A a
     |   U+0072            0x72                     LATIN SMALL LETTER R r
     |   U+0020            0x20                                    SPACE (separator)
     |   U+002A            0x2a                                 ASTERISK *
     |   U+05D0       0xd7 0x90                       HEBREW LETTER ALEF א
     |   U+05D1       0xd7 0x91                        HEBREW LETTER BET ב
     |   U+05D2       0xd7 0x92                      HEBREW LETTER GIMEL ג
     |   U+0020            0x20                                    SPACE (separator)
     |   U+003D            0x3d                              EQUALS SIGN =
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0022            0x22                           QUOTATION MARK "
     |   U+0041            0x41                   LATIN CAPITAL LETTER A A
     |   U+004C            0x4c                   LATIN CAPITAL LETTER L L
     |   U+0045            0x45                   LATIN CAPITAL LETTER E E
     |   U+0046            0x46                   LATIN CAPITAL LETTER F F
     |   U+002D            0x2d                             HYPHEN-MINUS -
     |   U+0042            0x42                   LATIN CAPITAL LETTER B B
     |   U+0045            0x45                   LATIN CAPITAL LETTER E E
     |   U+0054            0x54                   LATIN CAPITAL LETTER T T
     |   U+002D            0x2d                             HYPHEN-MINUS -
     |   U+0047            0x47                   LATIN CAPITAL LETTER G G
     |   U+0049            0x49                   LATIN CAPITAL LETTER I I
     |   U+004D            0x4d                   LATIN CAPITAL LETTER M M
     |   U+0045            0x45                   LATIN CAPITAL LETTER E E
     |   U+004C            0x4c                   LATIN CAPITAL LETTER L L
     |   U+0022            0x22                           QUOTATION MARK "
     |   U+003B            0x3b                                SEMICOLON ;
     |   U+000A            0x0a                           LINE FEED (LF) (control character)

Tested with Python 3.8

contrib/ChangeLog:
* unicode/utf8-dump.py: New file.

Signed-off-by: David Malcolm <dmalcolm@redhat.com>
contrib/unicode/utf8-dump.py [new file with mode: 0755]