src/preproc/preconv/preconv.1.man

   1 .TH PRECONV @MAN1EXT@ "@MDATE@" "groff @VERSION@"
   2 .SH NAME
   3 preconv \- convert encoding of input files to something GNU troff \
   4 understands
   5 .
   6 .
   7 .\" Save and disable compatibility mode (for, e.g., Solaris 10/11).
   8 .do nr preconv_C \n[.C]
   9 .cp 0
  10 .
  11 .
  12 .\" ====================================================================
  13 .\" Legal Terms
  14 .\" ====================================================================
  15 .\"
  16 .\" Copyright (C) 2006-2018 Free Software Foundation, Inc.
  17 .\"
  18 .\" Permission is granted to make and distribute verbatim copies of this
  19 .\" manual provided the copyright notice and this permission notice are
  20 .\" preserved on all copies.
  21 .\"
  22 .\" Permission is granted to copy and distribute modified versions of
  23 .\" this manual under the conditions for verbatim copying, provided that
  24 .\" the entire resulting derived work is distributed under the terms of
  25 .\" a permission notice identical to this one.
  26 .\"
  27 .\" Permission is granted to copy and distribute translations of this
  28 .\" manual into another language, under the above conditions for
  29 .\" modified versions, except that this permission notice may be
  30 .\" included in translations approved by the Free Software Foundation
  31 .\" instead of in the original English.
  32 .
  33 .
  34 .\" ====================================================================
  35 .SH SYNOPSIS
  36 .\" ====================================================================
  37 .
  38 .SY preconv
  39 .OP \-dr
  40 .OP \-D default_encoding
  41 .OP \-e encoding
  42 .RI [ file
  43 \&.\|.\|.\&]
  44 .
  45 .SY preconv
  46 .B \-h
  47 .SY preconv
  48 .B \-\-help
  49 .YS
  50 .
  51 .SY preconv
  52 .B \-v
  53 .SY preconv
  54 .B \-\-version
  55 .YS
  56 .
  57 .
  58 .\" ====================================================================
  59 .SH DESCRIPTION
  60 .\" ====================================================================
  61 .
  62 .B preconv
  63 reads
  64 .I files
  65 and converts its encoding(s) to a form GNU
  66 .BR troff (@MAN1EXT@)
  67 can process, sending the data to standard output.
  68 .
  69 Currently, this means ASCII characters and \[oq]\e[uXXXX]\[cq]
  70 entities, where \[oq]XXXX\[cq] is a hexadecimal number with four to
  71 six digits, representing a Unicode input code.
  72 .
  73 Normally,
  74 .B preconv
  75 should be invoked with the
  76 .B \-k
  77 and
  78 .B \-K
  79 options of
  80 .BR groff .
  81 .
  82 .
  83 .\" ====================================================================
  84 .SH OPTIONS
  85 .\" ====================================================================
  86 .
  87 Whitespace is permitted between a command-line option and its argument.
  88 .
  89 .
  90 .TP
  91 .B \-d
  92 Emit debugging messages to standard error (mainly the used encoding).
  93 .
  94 .TP
  95 .BI \-D encoding
  96 Specify default encoding if everything fails (see below).
  97 .
  98 .TP
  99 .BI \-e encoding
 100 Specify input encoding explicitly, overriding all other methods.
 101 .
 102 This corresponds to
 103 .BR groff 's
 104 .BI \-K encoding
 105 option.
 106 .
 107 Without this switch,
 108 .B preconv
 109 uses the algorithm described below to select the input encoding.
 110 .
 111 .TP
 112 .B \-\-help
 113 .TQ
 114 .B \-h
 115 Print a help message and exit.
 116 .
 117 .TP
 118 .B \-r
 119 Do not add \&.lf requests.
 120 .
 121 .TP
 122 .B \-\-version
 123 .TQ
 124 .B \-v
 125 Print the version number and exit.
 126 .
 127 .
 128 .\" ====================================================================
 129 .SH USAGE
 130 .\" ====================================================================
 131 .
 132 .B preconv
 133 tries to find the input encoding with the following algorithm.
 134 .
 135 .IP 1.
 136 If the input encoding has been explicitly specified with option
 137 .BR \-e ,
 138 use it.
 139 .
 140 .IP 2.
 141 Otherwise, check whether the input starts with a
 142 .I Byte Order Mark
 143 (BOM, see below).
 144 .
 145 If found, use it.
 146 .
 147 .IP 3.
 148 Otherwise, check whether there is a known
 149 .I coding tag
 150 (see below) in either the first or second input line.
 151 .
 152 If found, use it.
 153 .
 154 .IP 4
 155 Finally, if the
 156 .B uchardet
 157 library
 158 (an encoding detector library available on most major distributions)
 159 is available on the system, use it to try to detect the encoding of the file.
 160 .
 161 .IP 5.
 162 If everything fails, use a default encoding as given with option
 163 .BR \-D ,
 164 by the current locale, or \[oq]latin1\[cq] if the locale is set to
 165 \[oq]C\[cq], \[oq]POSIX\[cq], or empty (in that order).
 166 .
 167 .
 168 .PP
 169 Note that the
 170 .B groff
 171 program supports a
 172 .I \%GROFF_ENCODING
 173 environment variable which is eventually expanded to option
 174 .BR \-k .
 175 .
 176 .
 177 .\" ====================================================================
 178 .SS "Byte Order Mark"
 179 .\" ====================================================================
 180 .
 181 The Unicode Standard defines character U+FEFF as the Byte Order Mark
 182 (BOM).
 183 .
 184 On the other hand, value U+FFFE is guaranteed not be a Unicode character at
 185 all.
 186 .
 187 This allows detection of the byte order within the data stream (either
 188 big-endian or little-endian), and the MIME encodings \%\[oq]UTF-16\[cq]
 189 and \%\[oq]UTF-32\[cq] mandate that the data stream starts with U+FEFF.
 190 .
 191 Similarly, the data stream encoded as \%\[oq]UTF-8\[cq] might start
 192 with a BOM (to ease the conversion from and to \%UTF-16 and \%UTF-32).
 193 .
 194 In all cases, the byte order mark is
 195 .I not
 196 part of the data but part of the encoding protocol; in other words,
 197 .BR preconv 's
 198 output doesn't contain it.
 199 .
 200 .
 201 .PP
 202 Note that U+FEFF not at the start of the input data actually is
 203 emitted; it has then the meaning of a \[oq]zero width no-break
 204 space\[cq] character \[en] something not needed normally in
 205 .BR groff .
 206 .
 207 .
 208 .\" ====================================================================
 209 .SS "Coding Tags"
 210 .\" ====================================================================
 211 .
 212 Editors which support more than a single character encoding need tags
 213 within the input files to mark the file's encoding.
 214 .
 215 While it is possible to guess the right input encoding with the help of
 216 heuristic algorithms for data which represents a greater amount of a natural
 217 language, it is still just a guess.
 218 .
 219 Additionally, all algorithms fail easily for input which is either too short
 220 or doesn't represent a natural language.
 221 .
 222 .
 223 .PP
 224 For these reasons,
 225 .B preconv
 226 supports the coding tag convention (with some restrictions) as used by
 227 .B "GNU Emacs"
 228 and
 229 .B XEmacs
 230 (and probably other programs too).
 231 .
 232 .
 233 .PP
 234 Coding tags in
 235 .B "GNU Emacs"
 236 and
 237 .B XEmacs
 238 are stored in so-called
 239 .IR "File Variables" .
 240 .
 241 .B preconv
 242 recognizes the following syntax form which must be put into a troff comment
 243 in the first or second line.
 244 .
 245 .RS
 246 .PP
 247 \-*\-
 248 .IR tag1 :
 249 .IR value1 ;
 250 .IR tag2 :
 251 .IR value2 ;
 252 \&.\|.\|.\& \-*\-
 253 .RE
 254 .
 255 .
 256 .PP
 257 The only relevant tag for
 258 .B preconv
 259 is \[oq]coding\[cq] which can take the values listed below.
 260 .
 261 Here an example line which tells
 262 .B Emacs
 263 to edit a file in troff mode, and to use \%latin2 as its encoding.
 264 .
 265 .RS
 266 .PP
 267 .EX
 268 \&.\[rs]" \-*\- mode: troff; coding: latin-2 \-*\-
 269 .EE
 270 .RE
 271 .
 272 .
 273 .PP
 274 The following list gives all MIME coding tags (either lowercase or
 275 uppercase) supported by
 276 .BR preconv ;
 277 this list is hard-coded in the source.
 278 .
 279 .RS
 280 .PP
 281 .ad l
 282 \%big5, \%cp1047, \%euc-jp, \%euc-kr, \%gb2312, \%iso-8859-1,
 283 \%iso-8859-2, \%iso-8859-5, \%iso-8859-7, \%iso-8859-9, \%iso-8859-13,
 284 \%iso-8859-15, \%koi8-r, \%us-ascii, \%utf-8, \%utf-16, \%utf-16be,
 285 \%utf-16le
 286 .ad
 287 .RE
 288 .
 289 .
 290 .PP
 291 In addition, the following hard-coded list of other tags is recognized
 292 which eventually map to values from the list above.
 293 .
 294 .RS
 295 .PP
 296 .ad l
 297 \%ascii, \%chinese-big5, \%chinese-euc, \%chinese-iso-8bit, \%cn-big5,
 298 \%\%cn-gb, \%cn-gb-2312, \%cp878, \%csascii, \%csisolatin1,
 299 \%cyrillic-iso-8bit, \%cyrillic-koi8, \%euc-china, \%euc-cn,
 300 \%euc-japan, \%euc-japan-1990, \%euc-korea, \%greek-iso-8bit,
 301 \%iso-10646/utf8, \%iso-10646/utf-8, \%iso-latin-1, \%iso-latin-2,
 302 \%iso-latin-5, \%iso-latin-7, \%iso-latin-9, \%japanese-euc,
 303 \%japanese-iso-8bit, \%jis8, \%koi8, \%korean-euc, \%korean-iso-8bit,
 304 \%latin-0, \%latin1, \%latin-1, \%latin-2, \%latin-5, \%latin-7,
 305 \%latin-9, \%mule-utf-8, \%mule-utf-16, \%mule-utf-16be,
 306 \%mule-utf-16-be, \%mule-utf-16be-with-signature, \%mule-utf-16le,
 307 \%mule-utf-16-le, \%mule-utf-16le-with-signature, \%utf8, \%utf-16-be,
 308 \%utf-16-be-with-signature, \%utf-16be-with-signature, \%utf-16-le,
 309 \%utf-16-le-with-signature, \%utf-16le-with-signature
 310 .ad
 311 .RE
 312 .
 313 .
 314 .PP
 315 Those tags are taken from
 316 .B "GNU Emacs"
 317 and
 318 .BR XEmacs ,
 319 together with some aliases.
 320 .
 321 Trailing \%\[oq]-dos\[cq], \%\[oq]-unix\[cq], and \%\[oq]-mac\[cq]
 322 suffixes of coding tags (which give the end-of-line convention used in
 323 the file) are stripped off before the comparison with the above tags
 324 happens.
 325 .
 326 .SS "Iconv Issues"
 327 .B preconv
 328 by itself only supports three encodings: \%latin-1, cp1047, and \%UTF-8;
 329 all other encodings are passed to the
 330 .B iconv
 331 library functions.
 332 .
 333 At compile time it is searched and checked for a valid
 334 .B iconv
 335 implementation; a call to \[oq]preconv \-\-version\[cq] shows whether
 336 .B iconv
 337 is used.
 338 .
 339 .
 340 .\" ====================================================================
 341 .SH BUGS
 342 .\" ====================================================================
 343 .
 344 .B preconv
 345 doesn't support
 346 .I "local variable lists"
 347 yet.
 348 .
 349 This is a different syntax form to specify local variables at the end of a
 350 file.
 351 .
 352 .
 353 .\" ====================================================================
 354 .SH "SEE ALSO"
 355 .\" ====================================================================
 356 .
 357 .BR groff (@MAN1EXT@)
 358 .br
 359 the
 360 .B "GNU Emacs"
 361 and
 362 .B XEmacs
 363 info pages
 364 .
 365 .
 366 .\" Restore compatibility mode (for, e.g., Solaris 10/11).
 367 .cp \n[preconv_C]
 368 .
 369 .
 370 .\" Emacs setting
 371 .\" Local Variables:
 372 .\" mode: nroff
 373 .\" End:
 374 .\" vim: set filetype=groff: