1 # This file is derived from
3 # http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
5 # Which was created by Markus Kuhn <mkuhn@acm.org> - 2000-09-02
7 # lines begining with # and blank lines are ignored
9 # Beyond that, this file consists of a series of test cases. Each test case consists of
14 # VALID : The string is a valid UTF-8 representation of valid Unicode
15 # INCOMPLETE : The string has a partial character at the end
16 # NOTUNICODE : The string is valid UTF-8, but the characters represented
17 # are not valid unicode (
18 # OVERLONG : The string includes overlong sequences
19 # MALFORMED : The string is not valid UTF-8
20 # 3. If the status is VALID or NOTUNICODE, the UCS-4 representation of the string,
21 # as a series of hex numbers.
23 # 1 Some correct UTF-8 text
26 03ba 1f79 03c3 03bc 03b5
28 # 2.1 First possible sequence of a certain length
30 # FIXME - handle NULLS?
80 # 2.3 Other boundary conditions
102 # 3.1 Unexpected continuation bytes
120 \80\81\82\83\84\85\86\87\88\89\8a\8b\8c\8d\8e\8f\90\91\92\93\94\95\96\97\98\99\9a\9b\9c\9d\9e\9f¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿
123 # 3.2 Lonely start characters
125 À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
127 à á â ã ä å æ ç è é ê ë ì í î ï
136 # 3.3 Sequences with last continuation byte missing
159 # 3.4 Concatenation of incomplete sequences
161 Àà
\80ð
\80\80ø
\80\80\80ü
\80\80\80\80ßï¿÷¿¿û¿¿¿ý¿¿¿¿
164 # 3.5 Impossible bytes
173 # Examples of an overlong ASCII character
186 # Maximum overlong sequences
199 # Overlong representation of the NUL character
212 # Illegal code positions
214 # Single UTF-16 surrogates
244 # Paired UTF-16 surrogates
278 # Other illegal code positions
290 # Some more tests, not from Markus Kuhn's file
293 # Mixed plane 0 and higher planes
297 41 00010000 42 10ffff 43