1 # ******************************************************************************
3 # * Copyright (C) 1995-2014, International Business Machines
4 # * Corporation and others. All Rights Reserved.
6 # ******************************************************************************
8 # If this converter alias table looks very confusing, a much easier to
9 # understand view can be found at this demo:
10 # http://demo.icu-project.org/icu-bin/convexp
14 # This file is not read directly by ICU. If you change it, you need to
15 # run gencnval, and eventually run pkgdata to update the representation that
16 # ICU uses for aliases. The gencnval tool will normally compile this file into
17 # cnvalias.icu. The gencnval -v verbose option will help you when you edit
20 # Please be friendly to the rest of us that edit this table by
21 # keeping this table free of tabs.
23 # This is an alias file used by the character set converter.
24 # A lot of converter information can be found in unicode/ucnv.h, but here
25 # is more information about this file.
27 # If you are adding a new converter to this list and want to include it in the
28 # icu data library, please be sure to add an entry to the appropriate ucm*.mk file
29 # (see ucmfiles.mk for more information).
31 # Here is the file format using BNF-like syntax:
33 # converterTable ::= tags { converterLine* }
34 # converterLine ::= converterName [ tags ] { taggedAlias* }'\n'
35 # taggedAlias ::= alias [ tags ]
36 # tags ::= '{' { tag+ } '}'
37 # tag ::= standard['*']
38 # converterName ::= [0-9a-zA-Z:_'-']+
39 # alias ::= converterName
41 # Except for the converter name, aliases are case insensitive.
42 # Names are separated by whitespace.
43 # Line continuation and comment sytax are similar to the GNU make syntax.
44 # Any lines beginning with whitespace (e.g. U+0020 SPACE or U+0009 HORIZONTAL
45 # TABULATION) are presumed to be a continuation of the previous line.
46 # The # symbol starts a comment and the comment continues till the end of
51 # All names can be tagged by including a space-separated list of tags in
52 # curly braces, as in ISO_8859-1:1987{IANA*} iso-8859-1 { MIME* } or
53 # some-charset{MIME* IANA*}. The order of tags does not matter, and
54 # whitespace is allowed between the tagged name and the tags list.
56 # The tags can be used to get standard names using ucnv_getStandardName().
58 # The complete list of recognized tags used in this file is defined in
59 # the affinity list near the beginning of the file.
61 # The * after the standard tag denotes that the previous alias is the
62 # preferred (default) charset name for that standard. There can only
63 # be one of these default charset names per converter.
67 # The world is getting more complicated...
68 # Supporting XML parsers, HTML, MIME, and similar applications
69 # that mark encodings with a charset name can be difficult.
70 # Many of these applications and operating systems will update
71 # their codepages over time.
73 # It means that a new codepage, one that differs from an
74 # old one by changing a code point, e.g., to the Euro sign,
75 # must not get an old alias, because it would mean that
76 # old files with this alias would be interpreted differently.
78 # If an codepage gets updated by assigning characters to previously
79 # unassigned code points, then a new name is not necessary.
80 # Also, some codepages map unassigned codepage byte values
81 # to the same numbers in Unicode for roundtripping. It may be
82 # industry practice to keep the encoding name in such a case, too
83 # (example: Windows codepages).
85 # The aliases listed in the list of character sets
86 # that is maintained by the IANA (http://www.iana.org/) must
87 # not be changed to mean encodings different from what this
88 # list shows. Currently, the IANA list is at
89 # http://www.iana.org/assignments/character-sets
90 # It should also be mentioned that the exact mapping table used for each
91 # IANA names usually isn't specified. This means that some other applications
92 # and operating systems are left to interpret the exact mappings for the
93 # underspecified aliases. For instance, Shift-JIS on a Solaris platform
94 # may be different from Shift-JIS on a Windows platform. This is why
95 # some of the aliases can be tagged to differentiate different mapping
96 # tables with the same alias. If an alias is given to more than one converter,
97 # it is considered to be an ambiguous alias, and the affinity list will
98 # choose the converter to use when a standard isn't specified with the alias.
100 # Name matching is case-insensitive. Also, dashes '-', underscores '_'
101 # and spaces ' ' are ignored in names (thus cs-iso_latin-1, csisolatin1
102 # and "cs iso latin 1" are the same).
103 # However, the names in the left column are directly file names
104 # or names of algorithmic converters, and their case must not
105 # be changed - or else code and/or file names must also be changed.
106 # For example, the converter ibm-921 is expected to be the file ibm-921.cnv.
110 # The immediately following list is the affinity list of supported standard tags.
111 # When multiple converters have the same alias under different standards,
112 # the standard nearest to the top of this list with that alias will
113 # be the first converter that will be opened. The ordering of the aliases
114 # after this affinity list does not affect the preferred alias, but it may
115 # affect the order of the returned list of aliases for a given converter.
117 # The general ordering is from specific and frequently used to more general
118 # or rarely used at the bottom.
120 UTR22 # Name format specified by http://www.unicode.org/unicode/reports/tr22/
121 HTML # WHATWG's encoding spec; https://encoding.spec.whatwg.org
122 IANA # Source: http://www.iana.org/assignments/character-sets
123 MIME # Source: http://www.iana.org/assignments/character-sets
126 UTF-8 { MIME* HTML* }
130 utf-16be { MIME* HTML* }
132 utf-16le { MIME* HTML* }
135 # Keep UTF-32 entries for now until we sort out Blink's behavior when
137 UTF-32 { IANA* MIME* } ISO-10646-UCS-4 { IANA }
140 UTF-32BE { IANA* } UTF32_BigEndian
141 UTF-32LE { IANA* } UTF32_LittleEndian
144 IBM866 { MIME* HTML* }
150 ISO-8859-2 { MIME* HTML* }
161 ISO-8859-3 { MIME* HTML* }
172 ISO-8859-4 { MIME* HTML* }
183 ISO-8859-5 { MIME* HTML* }
193 ISO-8859-6 { MIME* HTML* }
209 ISO-8859-7 { MIME* HTML* }
223 ISO-8859-8 { MIME* HTML* }
234 # adding this one leads to a failure in encoding-labels.html
238 # This alias has to be dealt with by TextCodecICU unless
239 # multiple encodings can share a single mapping table.
240 #ISO-8859-8-I { MIME* HTML* }
245 ISO-8859-10 { MIME* HTML* }
254 ISO-8859-13 { MIME* HTML* }
259 ISO-8859-14 { MIME* HTML* }
264 ISO-8859-15 { MIME* HTML* }
272 ISO-8859-16 { MIME* HTML* }
275 KOI8-R { MIME* HTML* }
282 KOI8-U { MIME* HTML* }
285 macintosh { MIME* HTML* }
291 windows-874 { MIME* HTML* }
299 windows-1250 { MIME* HTML* }
304 windows-1251 { MIME* HTML* }
309 windows-1252 { MIME* HTML* }
328 windows-1253 { MIME* HTML* }
333 windows-1254 { MIME* HTML* }
347 windows-1255 { MIME* HTML* }
352 windows-1256 { MIME* HTML* }
357 windows-1257 { MIME* HTML* }
362 windows-1258 { MIME* HTML* }
367 x-mac-cyrillic { MIME* HTML* }
370 # Chrome: Added 4 GB2312 aliases and EUC-CN to Windows-936 to reflect the
371 # reality of the web (GB2312 is treated synonymously with its
372 # superset, Windows-936/GBK)
373 # HTML5 makes GBK an alias for GB18030
374 # TODO(jshin): Decide if Chrome should follow spec. crbug.com/339862
389 # GB 18030 is partly algorithmic, using the MBCS converter
390 gb18030 { IANA* } gb18030 { MIME* } ibm-1392 windows-54936
398 # Chrome: WHATWG encoding spec has big5-hkscs as an alias for big5
399 # TODO(jshin): Decide if Chrome should follow spec. crbug.com/277040
400 ibm-1375_P100-2007 { UTR22* } # Big5-HKSCS-2004 with Unicode 3.1 mappings. This uses supplementary characters.
402 Big5-HKSCS { MIME* IANA* }
404 HKSCS-BIG5 # From http://www.openi18n.org/localenameguide/
408 EUC-JP { MIME* HTML* }
412 ISO_2022,locale=ja,version=0
413 ISO-2022-JP { MIME* HTML* }
417 Shift_JIS { MIME* HTML* }
426 EUC-KR { MIME* HTML* }
437 # We need to keep these aliases so that documents labelled with them
438 # are converted to a single U+FFFD instead of being rendered as a gibberish.
439 ISO-2022-KR { HTML* MIME* } csISO2022KR { IANA }
440 ISO-2022-CN { IANA* HTML* } csISO2022CN x-ISO-2022-CN-GB
441 ISO-2022-CN-EXT { IANA* HTML* }
442 HZ-GB-2312 { HTML* IANA* } HZ