src/preproc/preconv/preconv.man

   1 .TH PRECONV @MAN1EXT@ "@MDATE@" "Groff Version @VERSION@"
   2 .SH NAME
   3 preconv \- convert encoding of input files to something GNU troff understands
   4 .
   5 .
   6 .\" license (copying)
   7 .de co
   8 Copyright \[co] 2006-2014 Free Software Foundation, Inc.
   9
  10 Permission is granted to make and distribute verbatim copies of
  11 this manual provided the copyright notice and this permission notice
  12 are preserved on all copies.
  13
  14 Permission is granted to copy and distribute modified versions of this
  15 manual under the conditions for verbatim copying, provided that the
  16 entire resulting derived work is distributed under the terms of a
  17 permission notice identical to this one.
  18
  19 Permission is granted to copy and distribute translations of this
  20 manual into another language, under the above conditions for modified
  21 versions, except that this permission notice may be included in
  22 translations approved by the Free Software Foundation instead of in
  23 the original English.
  24 ..
  25 .
  26 .\" --------------------------------------------------------------------
  27 .SH SYNOPSIS
  28 .\" --------------------------------------------------------------------
  29 .
  30 .SY preconv
  31 .OP \-dr
  32 .OP \-e encoding
  33 .RI [ files
  34 .IR .\|.\|. ]
  35 .
  36 .SY preconv
  37 .B \-h
  38 |
  39 .B \-\-help
  40 .
  41 .SY preconv
  42 .B \-v
  43 |
  44 .B \-\-version
  45 .YS
  46 .
  47 .
  48 .PP
  49 It is possible to have whitespace between the
  50 .B \-e
  51 command line option and its parameter.
  52 .
  53 .
  54 .\" --------------------------------------------------------------------
  55 .SH DESCRIPTION
  56 .\" --------------------------------------------------------------------
  57 .
  58 .B preconv
  59 reads
  60 .I files
  61 and converts its encoding(s) to a form GNU
  62 .BR troff (@MAN1EXT@)
  63 can process, sending the data to standard output.
  64 .
  65 Currently, this means ASCII characters and \[oq]\e[uXXXX]\[cq]
  66 entities, where \[oq]XXXX\[cq] is a hexadecimal number with four to
  67 six digits, representing a Unicode input code.
  68 .
  69 Normally,
  70 .B preconv
  71 should be invoked with the
  72 .B \-k
  73 and
  74 .B \-K
  75 options of
  76 .BR groff .
  77 .
  78 .
  79 .\" --------------------------------------------------------------------
  80 .SH OPTIONS
  81 .\" --------------------------------------------------------------------
  82 .
  83 .TP
  84 .B \-d
  85 Emit debugging messages to standard error (mainly the used encoding).
  86 .
  87 .TP
  88 .BI \-D encoding
  89 Specify default encoding if everything fails (see below).
  90 .
  91 .TP
  92 .BI \-e encoding
  93 Specify input encoding explicitly, overriding all other methods.
  94 .
  95 This corresponds to
  96 .BR groff \[aq]s
  97 .BI \-K encoding
  98 option.
  99 .
 100 Without this switch,
 101 .B preconv
 102 uses the algorithm described below to select the input encoding.
 103 .
 104 .TP
 105 .B \-\-help
 106 .TQ
 107 .B \-h
 108 Print help message.
 109 .
 110 .TP
 111 .B \-r
 112 Do not add \&.lf requests.
 113 .
 114 .TP
 115 .B \-\-version
 116 .TQ
 117 .B \-v
 118 Print version number.
 119 .
 120 .
 121 .\" --------------------------------------------------------------------
 122 .SH USAGE
 123 .\" --------------------------------------------------------------------
 124 .
 125 .B preconv
 126 tries to find the input encoding with the following algorithm.
 127 .
 128 .IP 1.
 129 If the input encoding has been explicitly specified with option
 130 .BR \-e ,
 131 use it.
 132 .
 133 .IP 2.
 134 Otherwise, check whether the input starts with a
 135 .I Byte Order Mark
 136 (BOM, see below).
 137 .
 138 If found, use it.
 139 .
 140 .IP 3.
 141 Finally, check whether there is a known
 142 .I coding tag
 143 (see below) in either the first or second input line.
 144 .
 145 If found, use it.
 146 .
 147 .IP 4.
 148 If everything fails, use a default encoding as given with option
 149 .BR \-D ,
 150 by the current locale, or \[oq]latin1\[cq] if the locale is set to
 151 \[oq]C\[cq], \[oq]POSIX\[cq], or empty (in that order).
 152 .
 153 .
 154 .PP
 155 Note that the
 156 .B groff
 157 program supports a
 158 .B GROFF_ENCODING
 159 environment variable which is eventually expanded to option
 160 .BR \-k .
 161 .
 162 .
 163 .\" --------------------------------------------------------------------
 164 .SS "Byte Order Mark"
 165 .\" --------------------------------------------------------------------
 166 .
 167 The Unicode Standard defines character U+FEFF as the Byte Order Mark
 168 (BOM).
 169 .
 170 On the other hand, value U+FFFE is guaranteed not be a Unicode character at
 171 all.
 172 .
 173 This allows to detect the byte order within the data stream (either
 174 big-endian or lower-endian), and the MIME encodings \%\[oq]UTF-16\[cq]
 175 and \%\[oq]UTF-32\[cq] mandate that the data stream starts with U+FEFF.
 176 .
 177 Similarly, the data stream encoded as \%\[oq]UTF-8\[cq] might start
 178 with a BOM (to ease the conversion from and to \%UTF-16 and \%UTF-32).
 179 .
 180 In all cases, the byte order mark is
 181 .I not
 182 part of the data but part of the encoding protocol; in other words,
 183 .BR preconv \[aq]s
 184 output doesn\[aq]t contain it.
 185 .
 186 .
 187 .PP
 188 Note that U+FEFF not at the start of the input data actually is
 189 emitted; it has then the meaning of a \[oq]zero width no-break
 190 space\[cq] character \[en] something not needed normally in
 191 .BR groff .
 192 .
 193 .
 194 .\" --------------------------------------------------------------------
 195 .SS "Coding Tags"
 196 .\" --------------------------------------------------------------------
 197 .
 198 Editors which support more than a single character encoding need tags
 199 within the input files to mark the file\[aq]s encoding.
 200 .
 201 While it is possible to guess the right input encoding with the help of
 202 heuristic algorithms for data which represents a greater amount of a natural
 203 language, it is still just a guess.
 204 .
 205 Additionally, all algorithms fail easily for input which is either too short
 206 or doesn\[aq]t represent a natural language.
 207 .
 208 .
 209 .PP
 210 For these reasons,
 211 .B preconv
 212 supports the coding tag convention (with some restrictions) as used by
 213 .B "GNU Emacs"
 214 and
 215 .B XEmacs
 216 (and probably other programs too).
 217 .
 218 .
 219 .PP
 220 Coding tags in
 221 .B "GNU Emacs"
 222 and
 223 .B XEmacs
 224 are stored in so-called
 225 .IR "File Variables" .
 226 .
 227 .B preconv
 228 recognizes the following syntax form which must be put into a troff comment
 229 in the first or second line.
 230 .
 231 .RS
 232 .PP
 233 \-*\-
 234 .IR tag1 :
 235 .IR value1 ;
 236 .IR tag2 :
 237 .IR value2 ;
 238 \&.\|.\|.\& \-*\-
 239 .RE
 240 .
 241 .
 242 .PP
 243 The only relevant tag for
 244 .B preconv
 245 is \[oq]coding\[cq] which can take the values listed below.
 246 .
 247 Here an example line which tells
 248 .B Emacs
 249 to edit a file in troff mode, and to use \%latin2 as its encoding.
 250 .
 251 .RS
 252 .PP
 253 .EX
 254 \&.\[rs]" \-*\- mode: troff; coding: latin-2 \-*\-\""
 255 .EE
 256 .RE
 257 .
 258 .
 259 .PP
 260 The following list gives all MIME coding tags (either lowercase or
 261 uppercase) supported by
 262 .BR preconv ;
 263 this list is hard-coded in the source.
 264 .
 265 .RS
 266 .PP
 267 .ad l
 268 \%big5, \%cp1047, \%euc-jp, \%euc-kr, \%gb2312, \%iso-8859-1,
 269 \%iso-8859-2, \%iso-8859-5, \%iso-8859-7, \%iso-8859-9, \%iso-8859-13,
 270 \%iso-8859-15, \%koi8-r, \%us-ascii, \%utf-8, \%utf-16, \%utf-16be,
 271 \%utf-16le
 272 .ad
 273 .RE
 274 .
 275 .
 276 .PP
 277 In addition, the following hard-coded list of other tags is recognized
 278 which eventually map to values from the list above.
 279 .
 280 .RS
 281 .PP
 282 .ad l
 283 \%ascii, \%chinese-big5, \%chinese-euc, \%chinese-iso-8bit, \%cn-big5,
 284 \%\%cn-gb, \%cn-gb-2312, \%cp878, \%csascii, \%csisolatin1,
 285 \%cyrillic-iso-8bit, \%cyrillic-koi8, \%euc-china, \%euc-cn,
 286 \%euc-japan, \%euc-japan-1990, \%euc-korea, \%greek-iso-8bit,
 287 \%iso-10646/utf8, \%iso-10646/utf-8, \%iso-latin-1, \%iso-latin-2,
 288 \%iso-latin-5, \%iso-latin-7, \%iso-latin-9, \%japanese-euc,
 289 \%japanese-iso-8bit, \%jis8, \%koi8, \%korean-euc, \%korean-iso-8bit,
 290 \%latin-0, \%latin1, \%latin-1, \%latin-2, \%latin-5, \%latin-7,
 291 \%latin-9, \%mule-utf-8, \%mule-utf-16, \%mule-utf-16be,
 292 \%mule-utf-16-be, \%mule-utf-16be-with-signature, \%mule-utf-16le,
 293 \%mule-utf-16-le, \%mule-utf-16le-with-signature, \%utf8, \%utf-16-be,
 294 \%utf-16-be-with-signature, \%utf-16be-with-signature, \%utf-16-le,
 295 \%utf-16-le-with-signature, \%utf-16le-with-signature
 296 .ad
 297 .RE
 298 .
 299 .
 300 .PP
 301 Those tags are taken from
 302 .B "GNU Emacs"
 303 and
 304 .BR XEmacs ,
 305 together with some aliases.
 306 .
 307 Trailing \%\[oq]-dos\[cq], \%\[oq]-unix\[cq], and \%\[oq]-mac\[cq]
 308 suffixes of coding tags (which give the end-of-line convention used in
 309 the file) are stripped off before the comparison with the above tags
 310 happens.
 311 .
 312 .SS "Iconv Issues"
 313 .B preconv
 314 by itself only supports three encodings: \%latin-1, cp1047, and \%UTF-8;
 315 all other encodings are passed to the
 316 .B iconv
 317 library functions.
 318 .
 319 At compile time it is searched and checked for a valid
 320 .B iconv
 321 implementation; a call to \[oq]preconv \-\-version\[cq] shows whether
 322 .B iconv
 323 is used.
 324 .
 325 .
 326 .\" --------------------------------------------------------------------
 327 .SH BUGS
 328 .\" --------------------------------------------------------------------
 329 .
 330 .B preconv
 331 doesn\[aq]t support
 332 .I "local variable lists"
 333 yet.
 334 .
 335 This is a different syntax form to specify local variables at the end of a
 336 file.
 337 .
 338 .
 339 .\" --------------------------------------------------------------------
 340 .SH "SEE ALSO"
 341 .\" --------------------------------------------------------------------
 342 .
 343 .BR groff (@MAN1EXT@)
 344 .br
 345 the
 346 .B "GNU Emacs"
 347 and
 348 .B XEmacs
 349 info pages
 350 .
 351 .
 352 .\" --------------------------------------------------------------------
 353 .SH COPYING
 354 .\" --------------------------------------------------------------------
 355 .co
 356 .
 357 .
 358 .\" Emacs setting
 359 .\" Local Variables:
 360 .\" mode: nroff
 361 .\" End: