1 .TH PRECONV @MAN1EXT@ "@MDATE@" "groff @VERSION@"
3 preconv \- convert encoding of input files to something GNU troff \
7 .\" Save and disable compatibility mode (for, e.g., Solaris 10/11).
8 .do nr preconv_C \n[.C]
12 .\" ====================================================================
14 .\" ====================================================================
16 .\" Copyright (C) 2006-2018 Free Software Foundation, Inc.
18 .\" Permission is granted to make and distribute verbatim copies of this
19 .\" manual provided the copyright notice and this permission notice are
20 .\" preserved on all copies.
22 .\" Permission is granted to copy and distribute modified versions of
23 .\" this manual under the conditions for verbatim copying, provided that
24 .\" the entire resulting derived work is distributed under the terms of
25 .\" a permission notice identical to this one.
27 .\" Permission is granted to copy and distribute translations of this
28 .\" manual into another language, under the above conditions for
29 .\" modified versions, except that this permission notice may be
30 .\" included in translations approved by the Free Software Foundation
31 .\" instead of in the original English.
34 .\" ====================================================================
36 .\" ====================================================================
40 .OP \-D default_encoding
58 .\" ====================================================================
60 .\" ====================================================================
65 and converts its encoding(s) to a form GNU
67 can process, sending the data to standard output.
69 Currently, this means ASCII characters and \[oq]\e[uXXXX]\[cq]
70 entities, where \[oq]XXXX\[cq] is a hexadecimal number with four to
71 six digits, representing a Unicode input code.
75 should be invoked with the
83 .\" ====================================================================
85 .\" ====================================================================
87 Whitespace is permitted between a command-line option and its argument.
92 Emit debugging messages to standard error (mainly the used encoding).
96 Specify default encoding if everything fails (see below).
100 Specify input encoding explicitly, overriding all other methods.
109 uses the algorithm described below to select the input encoding.
115 Print a help message and exit.
119 Do not add \&.lf requests.
125 Print the version number and exit.
128 .\" ====================================================================
130 .\" ====================================================================
133 tries to find the input encoding with the following algorithm.
136 If the input encoding has been explicitly specified with option
141 Otherwise, check whether the input starts with a
148 Otherwise, check whether there is a known
150 (see below) in either the first or second input line.
158 (an encoding detector library available on most major distributions)
159 is available on the system, use it to try to detect the encoding of the file.
162 If everything fails, use a default encoding as given with option
164 by the current locale, or \[oq]latin1\[cq] if the locale is set to
165 \[oq]C\[cq], \[oq]POSIX\[cq], or empty (in that order).
173 environment variable which is eventually expanded to option
177 .\" ====================================================================
178 .SS "Byte Order Mark"
179 .\" ====================================================================
181 The Unicode Standard defines character U+FEFF as the Byte Order Mark
184 On the other hand, value U+FFFE is guaranteed not be a Unicode character at
187 This allows detection of the byte order within the data stream (either
188 big-endian or little-endian), and the MIME encodings \%\[oq]UTF-16\[cq]
189 and \%\[oq]UTF-32\[cq] mandate that the data stream starts with U+FEFF.
191 Similarly, the data stream encoded as \%\[oq]UTF-8\[cq] might start
192 with a BOM (to ease the conversion from and to \%UTF-16 and \%UTF-32).
194 In all cases, the byte order mark is
196 part of the data but part of the encoding protocol; in other words,
198 output doesn't contain it.
202 Note that U+FEFF not at the start of the input data actually is
203 emitted; it has then the meaning of a \[oq]zero width no-break
204 space\[cq] character \[en] something not needed normally in
208 .\" ====================================================================
210 .\" ====================================================================
212 Editors which support more than a single character encoding need tags
213 within the input files to mark the file's encoding.
215 While it is possible to guess the right input encoding with the help of
216 heuristic algorithms for data which represents a greater amount of a natural
217 language, it is still just a guess.
219 Additionally, all algorithms fail easily for input which is either too short
220 or doesn't represent a natural language.
226 supports the coding tag convention (with some restrictions) as used by
230 (and probably other programs too).
238 are stored in so-called
239 .IR "File Variables" .
242 recognizes the following syntax form which must be put into a troff comment
243 in the first or second line.
257 The only relevant tag for
259 is \[oq]coding\[cq] which can take the values listed below.
261 Here an example line which tells
263 to edit a file in troff mode, and to use \%latin2 as its encoding.
268 \&.\[rs]" \-*\- mode: troff; coding: latin-2 \-*\-
274 The following list gives all MIME coding tags (either lowercase or
275 uppercase) supported by
277 this list is hard-coded in the source.
282 \%big5, \%cp1047, \%euc-jp, \%euc-kr, \%gb2312, \%iso-8859-1,
283 \%iso-8859-2, \%iso-8859-5, \%iso-8859-7, \%iso-8859-9, \%iso-8859-13,
284 \%iso-8859-15, \%koi8-r, \%us-ascii, \%utf-8, \%utf-16, \%utf-16be,
291 In addition, the following hard-coded list of other tags is recognized
292 which eventually map to values from the list above.
297 \%ascii, \%chinese-big5, \%chinese-euc, \%chinese-iso-8bit, \%cn-big5,
298 \%\%cn-gb, \%cn-gb-2312, \%cp878, \%csascii, \%csisolatin1,
299 \%cyrillic-iso-8bit, \%cyrillic-koi8, \%euc-china, \%euc-cn,
300 \%euc-japan, \%euc-japan-1990, \%euc-korea, \%greek-iso-8bit,
301 \%iso-10646/utf8, \%iso-10646/utf-8, \%iso-latin-1, \%iso-latin-2,
302 \%iso-latin-5, \%iso-latin-7, \%iso-latin-9, \%japanese-euc,
303 \%japanese-iso-8bit, \%jis8, \%koi8, \%korean-euc, \%korean-iso-8bit,
304 \%latin-0, \%latin1, \%latin-1, \%latin-2, \%latin-5, \%latin-7,
305 \%latin-9, \%mule-utf-8, \%mule-utf-16, \%mule-utf-16be,
306 \%mule-utf-16-be, \%mule-utf-16be-with-signature, \%mule-utf-16le,
307 \%mule-utf-16-le, \%mule-utf-16le-with-signature, \%utf8, \%utf-16-be,
308 \%utf-16-be-with-signature, \%utf-16be-with-signature, \%utf-16-le,
309 \%utf-16-le-with-signature, \%utf-16le-with-signature
315 Those tags are taken from
319 together with some aliases.
321 Trailing \%\[oq]-dos\[cq], \%\[oq]-unix\[cq], and \%\[oq]-mac\[cq]
322 suffixes of coding tags (which give the end-of-line convention used in
323 the file) are stripped off before the comparison with the above tags
328 by itself only supports three encodings: \%latin-1, cp1047, and \%UTF-8;
329 all other encodings are passed to the
333 At compile time it is searched and checked for a valid
335 implementation; a call to \[oq]preconv \-\-version\[cq] shows whether
340 .\" ====================================================================
342 .\" ====================================================================
346 .I "local variable lists"
349 This is a different syntax form to specify local variables at the end of a
353 .\" ====================================================================
355 .\" ====================================================================
357 .BR groff (@MAN1EXT@)
366 .\" Restore compatibility mode (for, e.g., Solaris 10/11).
374 .\" vim: set filetype=groff: