manual/message.texi

   1 @node Message Translation, Searching and Sorting, Locales, Top
   2 @c %MENU% How to make the program speak the user's language
   3 @chapter Message Translation
   4
   5 The program's interface with the human should be designed in a way to
   6 ease the human the task.  One of the possibilities is to use messages in
   7 whatever language the user prefers.
   8
   9 Printing messages in different languages can be implemented in different
  10 ways.  One could add all the different languages in the source code and
  11 add among the variants every time a message has to be printed.  This is
  12 certainly no good solution since extending the set of languages is
  13 difficult (the code must be changed) and the code itself can become
  14 really big with dozens of message sets.
  15
  16 A better solution is to keep the message sets for each language are kept
  17 in separate files which are loaded at runtime depending on the language
  18 selection of the user.
  19
  20 The GNU C Library provides two different sets of functions to support
  21 message translation.  The problem is that neither of the interfaces is
  22 officially defined by the POSIX standard.  The @code{catgets} family of
  23 functions is defined in the X/Open standard but this is derived from
  24 industry decisions and therefore not necessarily based on reasonable
  25 decisions.
  26
  27 As mentioned above the message catalog handling provides easy
  28 extendibility by using external data files which contain the message
  29 translations.  I.e., these files contain for each of the messages used
  30 in the program a translation for the appropriate language.  So the tasks
  31 of the message handling functions are
  32
  33 @itemize @bullet
  34 @item
  35 locate the external data file with the appropriate translations.
  36 @item
  37 load the data and make it possible to address the messages
  38 @item
  39 map a given key to the translated message
  40 @end itemize
  41
  42 The two approaches mainly differ in the implementation of this last
  43 step.  The design decisions made for this influences the whole rest.
  44
  45 @menu
  46 * Message catalogs a la X/Open::  The @code{catgets} family of functions.
  47 * The Uniforum approach::         The @code{gettext} family of functions.
  48 @end menu
  49
  50
  51 @node Message catalogs a la X/Open
  52 @section X/Open Message Catalog Handling
  53
  54 The @code{catgets} functions are based on the simple scheme:
  55
  56 @quotation
  57 Associate every message to translate in the source code with a unique
  58 identifier.  To retrieve a message from a catalog file solely the
  59 identifier is used.
  60 @end quotation
  61
  62 This means for the author of the program that s/he will have to make
  63 sure the meaning of the identifier in the program code and in the
  64 message catalogs are always the same.
  65
  66 Before a message can be translated the catalog file must be located.
  67 The user of the program must be able to guide the responsible function
  68 to find whatever catalog the user wants.  This is separated from what
  69 the programmer had in mind.
  70
  71 All the types, constants and functions for the @code{catgets} functions
  72 are defined/declared in the @file{nl_types.h} header file.
  73
  74 @menu
  75 * The catgets Functions::      The @code{catgets} function family.
  76 * The message catalog files::  Format of the message catalog files.
  77 * The gencat program::         How to generate message catalogs files which
  78                                 can be used by the functions.
  79 * Common Usage::               How to use the @code{catgets} interface.
  80 @end menu
  81
  82
  83 @node The catgets Functions
  84 @subsection The @code{catgets} function family
  85
  86 @comment nl_types.h
  87 @comment X/Open
  88 @deftypefun nl_catd catopen (const char *@var{cat_name}, int @var{flag})
  89 The @code{catgets} function tries to locate the message data file names
  90 @var{cat_name} and loads it when found.  The return value is of an
  91 opaque type and can be used in calls to the other functions to refer to
  92 this loaded catalog.
  93
  94 The return value is @code{(nl_catd) -1} in case the function failed and
  95 no catalog was loaded.  The global variable @var{errno} contains a code
  96 for the error causing the failure.  But even if the function call
  97 succeeded this does not mean that all messages can be translated.
  98
  99 Locating the catalog file must happen in a way which lets the user of
 100 the program influence the decision.  It is up to the user to decide
 101 about the language to use and sometimes it is useful to use alternate
 102 catalog files.  All this can be specified by the user by setting some
 103 environment variables.
 104
 105 The first problem is to find out where all the message catalogs are
 106 stored.  Every program could have its own place to keep all the
 107 different files but usually the catalog files are grouped by languages
 108 and the catalogs for all programs are kept in the same place.
 109
 110 @cindex NLSPATH environment variable
 111 To tell the @code{catopen} function where the catalog for the program
 112 can be found the user can set the environment variable @code{NLSPATH} to
 113 a value which describes her/his choice.  Since this value must be usable
 114 for different languages and locales it cannot be a simple string.
 115 Instead it is a format string (similar to @code{printf}'s).  An example
 116 is
 117
 118 @smallexample
 119 /usr/share/locale/%L/%N:/usr/share/locale/%L/LC_MESSAGES/%N
 120 @end smallexample
 121
 122 First one can see that more than one directory can be specified (with
 123 the usual syntax of separating them by colons).  The next things to
 124 observe are the format string, @code{%L} and @code{%N} in this case.
 125 The @code{catopen} function knows about several of them and the
 126 replacement for all of them is of course different.
 127
 128 @table @code
 129 @item %N
 130 This format element is substituted with the name of the catalog file.
 131 This is the value of the @var{cat_name} argument given to
 132 @code{catgets}.
 133
 134 @item %L
 135 This format element is substituted with the name of the currently
 136 selected locale for translating messages.  How this is determined is
 137 explained below.
 138
 139 @item %l
 140 (This is the lowercase ell.) This format element is substituted with the
 141 language element of the locale name.  The string describing the selected
 142 locale is expected to have the form
 143 @code{@var{lang}[_@var{terr}[.@var{codeset}]]} and this format uses the
 144 first part @var{lang}.
 145
 146 @item %t
 147 This format element is substituted by the territory part @var{terr} of
 148 the name of the currently selected locale.  See the explanation of the
 149 format above.
 150
 151 @item %c
 152 This format element is substituted by the codeset part @var{codeset} of
 153 the name of the currently selected locale.  See the explanation of the
 154 format above.
 155
 156 @item %%
 157 Since @code{%} is used in a meta character there must be a way to
 158 express the @code{%} character in the result itself.  Using @code{%%}
 159 does this just like it works for @code{printf}.
 160 @end table
 161
 162
 163 Using @code{NLSPATH} allows to specify arbitrary directories to be
 164 searched for message catalogs while still allowing different languages
 165 to be used.  If the @code{NLSPATH} environment variable is not set the
 166 default value is
 167
 168 @smallexample
 169 @var{prefix}/share/locale/%L/%N:@var{prefix}/share/locale/%L/LC_MESSAGES/%N
 170 @end smallexample
 171
 172 @noindent
 173 where @var{prefix} is given to @code{configure} while installing the GNU
 174 C Library (this value is in many cases @code{/usr} or the empty string).
 175
 176 The remaining problem is to decide which must be used.  The value
 177 decides about the substitution of the format elements mentioned above.
 178 First of all the user can specify a path in the message catalog name
 179 (i.e., the name contains a slash character).  In this situation the
 180 @code{NLSPATH} environment variable is not used.  The catalog must exist
 181 as specified in the program, perhaps relative to the current working
 182 directory.  This situation in not desirable and catalogs names never
 183 should be written this way.  Beside this, this behaviour is not portable
 184 to all other platforms providing the @code{catgets} interface.
 185
 186 @cindex LC_ALL environment variable
 187 @cindex LC_MESSAGES environment variable
 188 @cindex LANG environment variable
 189 Otherwise the values of environment variables from the standard
 190 environment are examined (@pxref{Standard Environment}).  Which
 191 variables are examined is decided by the @var{flag} parameter of
 192 @code{catopen}.  If the value is @code{NL_CAT_LOCALE} (which is defined
 193 in @file{nl_types.h}) then the @code{catopen} function examines the
 194 environment variable @code{LC_ALL}, @code{LC_MESSAGES}, and @code{LANG}
 195 in this order.  The first variable which is set in the current
 196 environment will be used.
 197
 198 If @var{flag} is zero only the @code{LANG} environment variable is
 199 examined.  This is a left-over from the early days of this function
 200 where the other environment variable were not known.
 201
 202 In any case the environment variable should have a value of the form
 203 @code{@var{lang}[_@var{terr}[.@var{codeset}]]} as explained above.  If
 204 no environment variable is set the @code{"C"} locale is used which
 205 prevents any translation.
 206
 207 The return value of the function is in any case a valid string.  Either
 208 it is a translation from a message catalog or it is the same as the
 209 @var{string} parameter.  So a piece of code to decide whether a
 210 translation actually happened must look like this:
 211
 212 @smallexample
 213 @{
 214   char *trans = catgets (desc, set, msg, input_string);
 215   if (trans == input_string)
 216     @{
 217       /* Something went wrong.  */
 218     @}
 219 @}
 220 @end smallexample
 221
 222 @noindent
 223 When an error occured the global variable @var{errno} is set to
 224
 225 @table @var
 226 @item EBADF
 227 The catalog does not exist.
 228 @item ENOMSG
 229 The set/message tuple does not name an existing element in the
 230 message catalog.
 231 @end table
 232
 233 While it sometimes can be useful to test for errors programs normally
 234 will avoid any test.  If the translation is not available it is no big
 235 problem if the original, untranslated message is printed.  Either the
 236 user understands this as well or s/he will look for the reason why the
 237 messages are not translated.
 238 @end deftypefun
 239
 240 Please note that the currently selected locale does not depend on a call
 241 to the @code{setlocale} function.  It is not necessary that the locale
 242 data files for this locale exist and calling @code{setlocale} succeeds.
 243 The @code{catopen} function directly reads the values of the environment
 244 variables.
 245
 246
 247 @deftypefun {char *} catgets (nl_catd @var{catalog_desc}, int @var{set}, int @var{message}, const char *@var{string})
 248 The function @code{catgets} has to be used to access the massage catalog
 249 previously opened using the @code{catopen} function.  The
 250 @var{catalog_desc} parameter must be a value previously returned by
 251 @code{catopen}.
 252
 253 The next two parameters, @var{set} and @var{message}, reflect the
 254 internal organization of the message catalog files.  This will be
 255 explained in detail below.  For now it is interesting to know that a
 256 catalog can consists of several set and the messages in each thread are
 257 individually numbered using numbers.  Neither the set number nor the
 258 message number must be consecutive.  They can be arbitrarily chosen.
 259 But each message (unless equal to another one) must have its own unique
 260 pair of set and message number.
 261
 262 Since it is not guaranteed that the message catalog for the language
 263 selected by the user exists the last parameter @var{string} helps to
 264 handle this case gracefully.  If no matching string can be found
 265 @var{string} is returned.  This means for the programmer that
 266
 267 @itemize @bullet
 268 @item
 269 the @var{string} parameters should contain reasonable text (this also
 270 helps to understand the program seems otherwise there would be no hint
 271 on the string which is expected to be returned.
 272 @item
 273 all @var{string} arguments should be written in the same language.
 274 @end itemize
 275 @end deftypefun
 276
 277 It is somewhat uncomfortable to write a program using the @code{catgets}
 278 functions if no supporting functionality is available.  Since each
 279 set/message number tuple must be unique the programmer must keep lists
 280 of the messages at the same time the code is written.  And the work
 281 between several people working on the same project must be coordinated.
 282 We will see some how these problems can be relaxed a bit (@pxref{Common
 283 Usage}).
 284
 285 @deftypefun int catclose (nl_catd @var{catalog_desc})
 286 The @code{catclose} function can be used to free the resources
 287 associated with a message catalog which previously was opened by a call
 288 to @code{catopen}.  If the resources can be successfully freed the
 289 function returns @code{0}.  Otherwise it return @code{@minus{}1} and the
 290 global variable @var{errno} is set.  Errors can occur if the catalog
 291 descriptor @var{catalog_desc} is not valid in which case @var{errno} is
 292 set to @code{EBADF}.
 293 @end deftypefun
 294
 295
 296 @node The message catalog files
 297 @subsection  Format of the message catalog files
 298
 299 The only reasonable way the translate all the messages of a function and
 300 store the result in a message catalog file which can be read by the
 301 @code{catopen} function is to write all the message text to the
 302 translator and let her/him translate them all.  I.e., we must have a
 303 file with entries which associate the set/message tuple with a specific
 304 translation.  This file format is specified in the X/Open standard and
 305 is as follows:
 306
 307 @itemize @bullet
 308 @item
 309 Lines containing only whitespace characters or empty lines are ignored.
 310
 311 @item
 312 Lines which contain as the first non-whitespace character a @code{$}
 313 followed by a whitespace character are comment and are also ignored.
 314
 315 @item
 316 If a line contains as the first non-whitespace characters the sequence
 317 @code{$set} followed by a whitespace character an additional argument
 318 is required to follow.  This argument can either be:
 319
 320 @itemize @minus
 321 @item
 322 a number.  In this case the value of this number determines the set
 323 to which the following messages are added.
 324
 325 @item
 326 an identifier consisting of alphanumeric characters plus the underscore
 327 character.  In this case the set get automatically a number assigned.
 328 This value is one added to the largest set number which so far appeared.
 329
 330 How to use the symbolic names is explained in section @ref{Common Usage}.
 331
 332 It is an error if a symbol name appears more than once.  All following
 333 messages are placed in a set with this number.
 334 @end itemize
 335
 336 @item
 337 If a line contains as the first non-whitespace characters the sequence
 338 @code{$delset} followed by a whitespace character an additional argument
 339 is required to follow.  This argument can either be:
 340
 341 @itemize @minus
 342 @item
 343 a number.  In this case the value of this number determines the set
 344 which will be deleted.
 345
 346 @item
 347 an identifier consisting of alphanumeric characters plus the underscore
 348 character.  This symbolic identifier must match a name for a set which
 349 previously was defined.  It is an error if the name is unknown.
 350 @end itemize
 351
 352 In both cases all messages in the specified set will be removed.  They
 353 will not appear in the output.  But if this set is later again selected
 354 with a @code{$set} command again messages could be added and these
 355 messages will appear in the output.
 356
 357 @item
 358 If a line contains after leading whitespaces the sequence
 359 @code{$quote}, the quoting character used for this input file is
 360 changed to the first non-whitespace character following the
 361 @code{$quote}.  If no non-whitespace character is present before the
 362 line ends quoting is disable.
 363
 364 By default no quoting character is used.  In this mode strings are
 365 terminated with the first unescaped line break.  If there is a
 366 @code{$quote} sequence present newline need not be escaped.  Instead a
 367 string is terminated with the first unescaped appearance of the quote
 368 character.
 369
 370 A common usage of this feature would be to set the quote character to
 371 @code{"}.  Then any appearance of the @code{"} in the strings must
 372 be escaped using the backslash (i.e., @code{\"} must be written).
 373
 374 @item
 375 Any other line must start with a number or an alphanumeric identifier
 376 (with the underscore character included).  The following characters
 377 (starting after the first whitespace character) will form the string
 378 which gets associated with the currently selected set and the message
 379 number represented by the number and identifier respectively.
 380
 381 If the start of the line is a number the message number is obvious.  It
 382 is an error if the same message number already appeared for this set.
 383
 384 If the leading token was an identifier the message number gets
 385 automatically assigned.  The value is the current maximum messages
 386 number for this set plus one.  It is an error if the identifier was
 387 already used for a message in this set.  It is ok to reuse the
 388 identifier for a message in another thread.  How to use the symbolic
 389 identifiers will be explained below (@pxref{Common Usage}).  There is
 390 one limitation with the identifier: it must not be @code{Set}.  The
 391 reason will be explained below.
 392
 393 The text of the messages can contain escape characters.  The usual bunch
 394 of characters known from the @w{ISO C} language are recognized
 395 (@code{\n}, @code{\t}, @code{\v}, @code{\b}, @code{\r}, @code{\f},
 396 @code{\\}, and @code{\@var{nnn}}, where @var{nnn} is the octal coding of
 397 a character code).
 398 @end itemize
 399
 400 @strong{Important:} The handling of identifiers instead of numbers for
 401 the set and messages is a GNU extension.  Systems strictly following the
 402 X/Open specification do not have this feature.  An example for a message
 403 catalog file is this:
 404
 405 @smallexample
 406 $ This is a leading comment.
 407 $quote "
 408
 409 $set SetOne
 410 1 Message with ID 1.
 411 two "   Message with ID \"two\", which gets the value 2 assigned"
 412
 413 $set SetTwo
 414 $ Since the last set got the number 1 assigned this set has number 2.
 415 4000 "The numbers can be arbitrary, they need not start at one."
 416 @end smallexample
 417
 418 This small example shows various aspects:
 419 @itemize @bullet
 420 @item
 421 Lines 1 and 9 are comments since they start with @code{$} followed by
 422 a whitespace.
 423 @item
 424 The quoting character is set to @code{"}.  Otherwise the quotes in the
 425 message definition would have to be left away and in this case the
 426 message with the identifier @code{two} would loose its leading whitespace.
 427 @item
 428 Mixing numbered messages with message having symbolic names is no
 429 problem and the numbering happens automatically.
 430 @end itemize
 431
 432
 433 While this file format is pretty easy it is not the best possible for
 434 use in a running program.  The @code{catopen} function would have to
 435 parser the file and handle syntactic errors gracefully.  This is not so
 436 easy and the whole process is pretty slow.  Therefore the @code{catgets}
 437 functions expect the data in another more compact and ready-to-use file
 438 format.  There is a special program @code{gencat} which is explained in
 439 detail in the next section.
 440
 441 Files in this other format are not human readable.  To be easy to use by
 442 programs it is a binary file.  But the format is byte order independent
 443 so translation files can be shared by systems of arbitrary architecture
 444 (as long as they use the GNU C Library).
 445
 446 Details about the binary file format are not important to know since
 447 these files are always created by the @code{gencat} program.  The
 448 sources of the GNU C Library also provide the sources for the
 449 @code{gencat} program and so the interested reader can look through
 450 these source files to learn about the file format.
 451
 452
 453 @node The gencat program
 454 @subsection Generate Message Catalogs files
 455
 456 @cindex gencat
 457 The @code{gencat} program is specified in the X/Open standard and the
 458 GNU implementation follows this specification and so allows to process
 459 all correctly formed input files.  Additionally some extension are
 460 implemented which help to work in a more reasonable way with the
 461 @code{catgets} functions.
 462
 463 The @code{gencat} program can be invoked in two ways:
 464
 465 @example
 466 `gencat [@var{Option}]@dots{} [@var{Output-File} [@var{Input-File}]@dots{}]`
 467 @end example
 468
 469 This is the interface defined in the X/Open standard.  If no
 470 @var{Input-File} parameter is given input will be read from standard
 471 input.  Multiple input files will be read as if they are concatenated.
 472 If @var{Output-File} is also missing, the output will be written to
 473 standard output.  To provide the interface one is used to from other
 474 programs a second interface is provided.
 475
 476 @smallexample
 477 `gencat [@var{Option}]@dots{} -o @var{Output-File} [@var{Input-File}]@dots{}`
 478 @end smallexample
 479
 480 The option @samp{-o} is used to specify the output file and all file
 481 arguments are used as input files.
 482
 483 Beside this one can use @file{-} or @file{/dev/stdin} for
 484 @var{Input-File} to denote the standard input.  Corresponding one can
 485 use @file{-} and @file{/dev/stdout} for @var{Output-File} to denote
 486 standard output.  Using @file{-} as a file name is allowed in X/Open
 487 while using the device names is a GNU extension.
 488
 489 The @code{gencat} program works by concatenating all input files and
 490 then @strong{merge} the resulting collection of message sets with a
 491 possibly existing output file.  This is done by removing all messages
 492 with set/message number tuples matching any of the generated messages
 493 from the output file and then adding all the new messages.  To
 494 regenerate a catalog file while ignoring the old contents therefore
 495 requires to remove the output file if it exists.  If the output is
 496 written to standard output no merging takes place.
 497
 498 @noindent
 499 The following table shows the options understood by the @code{gencat}
 500 program.  The X/Open standard does not specify any option for the
 501 program so all of these are GNU extensions.
 502
 503 @table @samp
 504 @item -V
 505 @itemx --version
 506 Print the version information and exit.
 507 @item -h
 508 @itemx --help
 509 Print a usage message listing all available options, then exit successfully.
 510 @item --new
 511 Do never merge the new messages from the input files with the old content
 512 of the output files.  The old content of the output file is discarded.
 513 @item -H
 514 @itemx --header=name
 515 This option is used to emit the symbolic names given to sets and
 516 messages in the input files for use in the program.  Details about how
 517 to use this are given in the next section.  The @var{name} parameter to
 518 this option specifies the name of the output file.  It will contain a
 519 number of C preprocessor @code{#define}s to associate a name with a
 520 number.
 521
 522 Please note that the generated file only contains the symbols from the
 523 input files.  If the output is merged with the previous content of the
 524 output file the possibly existing symbols from the file(s) which
 525 generated the old output files are not in the generated header file.
 526 @end table
 527
 528
 529 @node Common Usage
 530 @subsection How to use the @code{catgets} interface
 531
 532 The @code{catgets} functions can be used in two different ways.  By
 533 following slavishly the X/Open specs and not relying on the extension
 534 and by using the GNU extensions.  We will take a look at the former
 535 method first to understand the benefits of extensions.
 536
 537 @subsubsection Not using symbolic names
 538
 539 Since the X/Open format of the message catalog files does not allow
 540 symbol names we have to work with numbers all the time.  When we start
 541 writing a program we have to replace all appearances of translatable
 542 strings with something like
 543
 544 @smallexample
 545 catgets (catdesc, set, msg, "string")
 546 @end smallexample
 547
 548 @noindent
 549 @var{catgets} is retrieved from a call to @code{catopen} which is
 550 normally done once at the program start.  The @code{"string"} is the
 551 string we want to translate.  The problems start with the set and
 552 message numbers.
 553
 554 In a bigger program several programmers usually work at the same time on
 555 the program and so coordinating the number allocation is crucial.
 556 Though no two different strings must be indexed by the same tuple of
 557 numbers it is highly desirable to reuse the numbers for equal strings
 558 with equal translations (please note that there might be strings which
 559 are equal in one language but have different translations due to
 560 difference contexts).
 561
 562 The allocation process can be relaxed a bit by different set numbers for
 563 different parts of the program.  So the number of developers who have to
 564 coordinate the allocation can be reduced.  But still lists must be keep
 565 track of the allocation and errors can easily happen.  These errors
 566 cannot be discovered by the compiler or the @code{catgets} functions.
 567 Only the user of the program might see wrong messages printed.  In the
 568 worst cases the messages are so irritating that they cannot be
 569 recognized as wrong.  Think about the translations for @code{"true"} and
 570 @code{"false"} being exchanged.  This could result in a disaster.
 571
 572
 573 @subsubsection Using symbolic names
 574
 575 The problems mentioned in the last section derive from the fact that:
 576
 577 @enumerate
 578 @item
 579 the numbers are allocated once and due to the possibly frequent use of
 580 them it is difficult to change a number later.
 581 @item
 582 the numbers do not allow to guess anything about the string and
 583 therefore collisions can easily happen.
 584 @end enumerate
 585
 586 By constantly using symbolic names and by providing a method which maps
 587 the string content to a symbolic name (however this will happen) one can
 588 prevent both problems above.  The cost of this is that the programmer
 589 has to write a complete message catalog file while s/he is writing the
 590 program itself.
 591
 592 This is necessary since the symbolic names must be mapped to numbers
 593 before the program sources can be compiled.  In the last section it was
 594 described how to generate a header containing the mapping of the names.
 595 E.g., for the example message file given in the last section we could
 596 call the @code{gencat} program as follow (assume @file{ex.msg} contains
 597 the sources).
 598
 599 @smallexample
 600 gencat -H ex.h -o ex.cat ex.msg
 601 @end smallexample
 602
 603 @noindent
 604 This generates a header file with the following content:
 605
 606 @smallexample
 607 #define SetTwoSet 0x2   /* ex.msg:8 */
 608
 609 #define SetOneSet 0x1   /* ex.msg:4 */
 610 #define SetOnetwo 0x2   /* ex.msg:6 */
 611 @end smallexample
 612
 613 As can be seen the various symbols given in the source file are mangled
 614 to generate unique identifiers and these identifiers get numbers
 615 assigned.  Reading the source file and knowing about the rules will
 616 allow to predict the content of the header file (it is deterministic)
 617 but this is not necessary.  The @code{gencat} program can take care for
 618 everything.  All the programmer has to do is to put the generated header
 619 file in the dependency list of the source files of her/his project and
 620 to add a rules to regenerate the header of any of the input files
 621 change.
 622
 623 One word about the symbol mangling.  Every symbol consists of two parts:
 624 the name of the message set plus the name of the message or the special
 625 string @code{Set}.  So @code{SetOnetwo} means this macro can be used to
 626 access the translation with identifier @code{two} in the message set
 627 @code{SetOne}.
 628
 629 The other names denote the names of the message sets.  The special
 630 string @code{Set} is used in the place of the message identifier.
 631
 632 If in the code the second string of the set @code{SetOne} is used the C
 633 code should look like this:
 634
 635 @smallexample
 636 catgets (catdesc, SetOneSet, SetOnetwo,
 637          "   Message with ID \"two\", which gets the value 2 assigned")
 638 @end smallexample
 639
 640 Writing the function this way will allow to change the message number
 641 and even the set number without requiring any change in the C source
 642 code.  (The text of the string is normally not the same; this is only
 643 for this example.)
 644
 645
 646 @subsubsection How does to this allow to develop
 647
 648 To illustrate the usual way to work with the symbolic version numbers
 649 here is a little example.  Assume we want to write the very complex and
 650 famous greeting program.  We start by writing the code as usual:
 651
 652 @smallexample
 653 #include <stdio.h>
 654 int
 655 main (void)
 656 @{
 657   printf ("Hello, world!\n");
 658   return 0;
 659 @}
 660 @end smallexample
 661
 662 Now we want to internationalize the message and therefore replace the
 663 message with whatever the user wants.
 664
 665 @smallexample
 666 #include <nl_types.h>
 667 #include <stdio.h>
 668 #include "msgnrs.h"
 669 int
 670 main (void)
 671 @{
 672   nl_catd catdesc = catopen ("hello.cat", NL_CAT_LOCALE);
 673   printf (catgets (catdesc, SetMainSet, SetMainHello,
 674                    "Hello, world!\n"));
 675   catclose (catdesc);
 676   return 0;
 677 @}
 678 @end smallexample
 679
 680 We see how the catalog object is opened and the returned descriptor used
 681 in the other function calls.  It is not really necessary to check for
 682 failure of any of the functions since even in these situations the
 683 functions will behave reasonable.  They simply will be return a
 684 translation.
 685
 686 What remains unspecified here are the constants @code{SetMainSet} and
 687 @code{SetMainHello}.  These are the symbolic names describing the
 688 message.  To get the actual definitions which match the information in
 689 the catalog file we have to create the message catalog source file and
 690 process it using the @code{gencat} program.
 691
 692 @smallexample
 693 $ Messages for the famous greeting program.
 694 $quote "
 695
 696 $set Main
 697 Hello "Hallo, Welt!\n"
 698 @end smallexample
 699
 700 Now we can start building the program (assume the message catalog source
 701 file is named @file{hello.msg} and the program source file @file{hello.c}):
 702
 703 @smallexample
 704 @cartouche
 705 % gencat -H msgnrs.h -o hello.cat hello.msg
 706 % cat msgnrs.h
 707 #define MainSet 0x1     /* hello.msg:4 */
 708 #define MainHello 0x1   /* hello.msg:5 */
 709 % gcc -o hello hello.c -I.
 710 % cp hello.cat /usr/share/locale/de/LC_MESSAGES
 711 % echo $LC_ALL
 712 de
 713 % ./hello
 714 Hallo, Welt!
 715 %
 716 @end cartouche
 717 @end smallexample
 718
 719 The call of the @code{gencat} program creates the missing header file
 720 @file{msgnrs.h} as well as the message catalog binary.  The former is
 721 used in the compilation of @file{hello.c} while the later is placed in a
 722 directory in which the @code{catopen} function will try to locate it.
 723 Please check the @code{LC_ALL} environment variable and the default path
 724 for @code{catopen} presented in the description above.
 725
 726
 727 @node The Uniforum approach
 728 @section The Uniforum approach to Message Translation
 729
 730 Sun Microsystems tried to standardize a different approach to message
 731 translation in the Uniforum group.  There never was a real standard
 732 defined but still the interface was used in Sun's operation systems.
 733 Since this approach fits better in the development process of free
 734 software it is also used throughout the GNU package and the GNU
 735 @file{gettext} package provides support for this outside the GNU C
 736 Library.
 737
 738 The code of the @file{libintl} from GNU @file{gettext} is the same as
 739 the code in the GNU C Library.  So the documentation in the GNU
 740 @file{gettext} manual is also valid for the functionality here.  The
 741 following text will describe the library functions in detail.  But the
 742 numerous helper programs are not described in this manual.  Instead
 743 people should read the GNU @file{gettext} manual
 744 (@pxref{Top,,GNU gettext utilities,gettext,Native Language Support Library and Tools}).
 745 We will only give a short overview.
 746
 747 Though the @code{catgets} functions are available by default on more
 748 systems the @code{gettext} interface is at least as portable as the
 749 former.  The GNU @file{gettext} package can be used wherever the
 750 functions are not available.
 751
 752
 753 @menu
 754 * Message catalogs with gettext::  The @code{gettext} family of functions.
 755 * Helper programs for gettext::    Programs to handle message catalogs
 756                                     for @code{gettext}.
 757 @end menu
 758
 759
 760 @node Message catalogs with gettext
 761 @subsection The @code{gettext} family of functions
 762
 763 The paradigms underlying the @code{gettext} approach to message
 764 translations is different from that of the @code{catgets} functions the
 765 basic functionally is equivalent.  There are functions of the following
 766 categories:
 767
 768 @menu
 769 * Translation with gettext::    What has to be done to translate a message.
 770 * Locating gettext catalog::    How to determine which catalog to be used.
 771 * Advanced gettext functions::  Additional functions for more complicated
 772                                  situations.
 773 * Using gettextized software::  The possibilities of the user to influence
 774                                  the way @code{gettext} works.
 775 @end menu
 776
 777 @node Translation with gettext
 778 @subsubsection What has to be done to translate a message?
 779
 780 The @code{gettext} functions have a very simple interface.  The most
 781 basic function just takes the string which shall be translated as the
 782 argument and it returns the translation.  This is fundamentally
 783 different from the @code{catgets} approach where an extra key is
 784 necessary and the original string is only used for the error case.
 785
 786 If the string which has to be translated is the only argument this of
 787 course means the string itself is the key.  I.e., the translation will
 788 be selected based on the original string.  The message catalogs must
 789 therefore contain the original strings plus one translation for any such
 790 string.  The task of the @code{gettext} function is it to compare the
 791 argument string with the available strings in the catalog and return the
 792 appropriate translation.  Of course this process is optimized so that
 793 this process is not more expensive than an access using an atomic key
 794 like in @code{catgets}.
 795
 796 The @code{gettext} approach has some advantages but also some
 797 disadvantages.  Please see the GNU @file{gettext} manual for a detailed
 798 discussion of the pros and cons.
 799
 800 All the definitions and declarations for @code{gettext} can be found in
 801 the @file{libintl.h} header file.  On systems where these functions are
 802 not part of the C library they can be found in a separate library named
 803 @file{libintl.a} (or accordingly different for shared libraries).
 804
 805 @comment libintl.h
 806 @comment GNU
 807 @deftypefun {char *} gettext (const char *@var{msgid})
 808 The @code{gettext} function searches the currently selected message
 809 catalogs for a string which is equal to @var{msgid}.  If there is such a
 810 string available it is returned.  Otherwise the argument string
 811 @var{msgid} is returned.
 812
 813 Please note that all though the return value is @code{char *} the
 814 returned string must not be changed.  This broken type results from the
 815 history of the function and does not reflect the way the function should
 816 be used.
 817
 818 Please note that above we wrote ``message catalogs'' (plural).  This is
 819 a speciality of the GNU implementation of these functions and we will
 820 say more about this when we talk about the ways message catalogs are
 821 selected (@pxref{Locating gettext catalog}).
 822
 823 The @code{gettext} function does not modify the value of the global
 824 @var{errno} variable.  This is necessary to make it possible to write
 825 something like
 826
 827 @smallexample
 828   printf (gettext ("Operation failed: %m\n"));
 829 @end smallexample
 830
 831 Here the @var{errno} value is used in the @code{printf} function while
 832 processing the @code{%m} format element and if the @code{gettext}
 833 function would change this value (it is called before @code{printf} is
 834 called) we would get a wrong message.
 835
 836 So there is no easy way to detect a missing message catalog beside
 837 comparing the argument string with the result.  But it is normally the
 838 task of the user to react on missing catalogs.  The program cannot guess
 839 when a message catalog is really necessary since for a user who s peaks
 840 the language the program was developed in does not need any translation.
 841 @end deftypefun
 842
 843 The remaining two functions to access the message catalog add some
 844 functionality to select a message catalog which is not the default one.
 845 This is important if parts of the program are developed independently.
 846 Every part can have its own message catalog and all of them can be used
 847 at the same time.  The C library itself is an example: internally it
 848 uses the @code{gettext} functions but since it must not depend on a
 849 currently selected default message catalog it must specify all ambiguous
 850 information.
 851
 852 @comment libintl.h
 853 @comment GNU
 854 @deftypefun {char *} dgettext (const char *@var{domainname}, const char *@var{msgid})
 855 The @code{dgettext} functions acts just like the @code{gettext}
 856 function.  It only takes an additional first argument @var{domainname}
 857 which guides the selection of the message catalogs which are searched
 858 for the translation.  If the @var{domainname} parameter is the null
 859 pointer the @code{dgettext} function is exactly equivalent to
 860 @code{gettext} since the default value for the domain name is used.
 861
 862 As for @code{gettext} the return value type is @code{char *} which is an
 863 anachronism.  The returned string must never be modified.
 864 @end deftypefun
 865
 866 @comment libintl.h
 867 @comment GNU
 868 @deftypefun {char *} dcgettext (const char *@var{domainname}, const char *@var{msgid}, int @var{category})
 869 The @code{dcgettext} adds another argument to those which
 870 @code{dgettext} takes.  This argument @var{category} specifies the last
 871 piece of information needed to localize the message catalog.  I.e., the
 872 domain name and the locale category exactly specify which message
 873 catalog has to be used (relative to a given directory, see below).
 874
 875 The @code{dgettext} function can be expressed in terms of
 876 @code{dcgettext} by using
 877
 878 @smallexample
 879 dcgettext (domain, string, LC_MESSAGES)
 880 @end smallexample
 881
 882 @noindent
 883 instead of
 884
 885 @smallexample
 886 dgettext (domain, string)
 887 @end smallexample
 888
 889 This also shows which values are expected for the third parameter.  One
 890 has to use the available selectors for the categories available in
 891 @file{locale.h}.  Normally the available values are @code{LC_CTYPE},
 892 @code{LC_COLLATE}, @code{LC_MESSAGES}, @code{LC_MONETARY},
 893 @code{LC_NUMERIC}, and @code{LC_TIME}.  Please note that @code{LC_ALL}
 894 must not be used and even though the names might suggest this, there is
 895 no relation to the environments variables of this name.
 896
 897 The @code{dcgettext} function is only implemented for compatibility with
 898 other systems which have @code{gettext} functions.  There is not really
 899 any situation where it is necessary (or useful) to use a different value
 900 but @code{LC_MESSAGES} in for the @var{category} parameter.  We are
 901 dealing with messages here and any other choice can only be irritating.
 902
 903 As for @code{gettext} the return value type is @code{char *} which is an
 904 anachronism.  The returned string must never be modified.
 905 @end deftypefun
 906
 907 When using the three functions above in a program it is a frequent case
 908 that the @var{msgid} argument is a constant string.  So it is worth to
 909 optimize this case.  Thinking shortly about this one will realize that
 910 as long as no new message catalog is loaded the translation of a message
 911 will not change.  I.e., the algorithm to determine the translation is
 912 deterministic.
 913
 914 Exactly this is what the optimizations implemented in the
 915 @file{libintl.h} header will use.  Whenever a program is compiler with
 916 the GNU C compiler, optimization is selected and the @var{msgid}
 917 argument to @code{gettext}, @code{dgettext} or @code{dcgettext} is a
 918 constant string the actual function call will only be done the first
 919 time the message is used and then always only if any new message catalog
 920 was loaded and so the result of the translation lookup might be
 921 different.  See the @file{libintl.h} header file for details.  For the
 922 user it is only important to know that the result is always the same,
 923 independent of the compiler or compiler options in use.
 924
 925
 926 @node Locating gettext catalog
 927 @subsubsection How to determine which catalog to be used
 928
 929 The functions to retrieve the translations for a given message have a
 930 remarkable simple interface.  But to provide the user of the program
 931 still the opportunity to select exactly the translation s/he wants and
 932 also to provide the programmer the possibility to influence the way to
 933 locate the search for catalogs files there is a quite complicated
 934 underlying mechanism which controls all this.  The code is complicated
 935 the use is easy.
 936
 937 Basically we have two different tasks to perform which can also be
 938 performed by the @code{catgets} functions:
 939
 940 @enumerate
 941 @item
 942 Locate the set of message catalogs.  There are a number of files for
 943 different languages and which all belong to the package.  Usually they
 944 are all stored in the filesystem below a certain directory.
 945
 946 There can be arbitrary many packages installed and they can follow
 947 different guidelines for the placement of their files.
 948
 949 @item
 950 Relative to the location specified by the package the actual translation
 951 files must be searched, based on the wishes of the user.  I.e., for each
 952 language the user selects the program should be able to locate the
 953 appropriate file.
 954 @end enumerate
 955
 956 This is the functionality required by the specifications for
 957 @code{gettext} and this is also what the @code{catgets} functions are
 958 able to do.  But there are some problems unresolved:
 959
 960 @itemize @bullet
 961 @item
 962 The language to be used can be specified in several different ways.
 963 There is no generally accepted standard for this and the user always
 964 expects the program understand what s/he means.  E.g., to select the
 965 German translation one could write @code{de}, @code{german}, or
 966 @code{deutsch} and the program should always react the same.
 967
 968 @item
 969 Sometimes the specification of the user is too detailed.  If s/he, e.g.,
 970 specifies @code{de_DE.ISO-8859-1} which means German, spoken in Germany,
 971 coded using the @w{ISO 8859-1} character set there is the possibility
 972 that a message catalog matching this exactly is not available.  But
 973 there could be a catalog matching @code{de} and if the character set
 974 used on the machine is always @w{ISO 8859-1} there is no reason why this
 975 later message catalog should not be used.  (We call this @dfn{message
 976 inheritance}.)
 977
 978 @item
 979 If a catalog for a wanted language is not available it is not always the
 980 second best choice to fall back on the language of the developer and
 981 simply not translate any message.  Instead a user might be better able
 982 to read the messages in another language and so the user of the program
 983 should be able to define an precedence order of languages.
 984 @end itemize
 985
 986 We can divide the configuration actions in two parts: the one is
 987 performed by the programmer, the other by the user.  We will start with
 988 the functions the programmer can use since the user configuration will
 989 be based on this.
 990
 991 As the functions described in the last sections already mention separate
 992 sets of messages can be selected by a @dfn{domain name}.  This is a
 993 simple string which should be unique for each program part with uses a
 994 separate domain.  It is possible to use in one program arbitrary many
 995 domains at the same time.  E.g., the GNU C Library itself uses a domain
 996 named @code{libc} while the program using the C Library could use a
 997 domain named @code{foo}.  The important point is that at any time
 998 exactly one domain is active.  This is controlled with the following
 999 function.
1000
1001 @comment libintl.h
1002 @comment GNU
1003 @deftypefun {char *} textdomain (const char *@var{domainname})
1004 The @code{textdomain} function sets the default domain, which is used in
1005 all future @code{gettext} calls, to @var{domainname}.  Please note that
1006 @code{dgettext} and @code{dcgettext} calls are not influenced if the
1007 @var{domainname} parameter of these functions is not the null pointer.
1008
1009 Before the first call to @code{textdomain} the default domain is
1010 @code{messages}.  This is the name specified in the specification of
1011 the @code{gettext} API.  This name is as good as any other name.  No
1012 program should ever really use a domain with this name since this can
1013 only lead to problems.
1014
1015 The function returns the value which is from now on taken as the default
1016 domain.  If the system went out of memory the returned value is
1017 @code{NULL} and the global variable @var{errno} is set to @code{ENOMEM}.
1018 Despite the return value type being @code{char *} the return string must
1019 not be changed.  It is allocated internally by the @code{textdomain}
1020 function.
1021
1022 If the @var{domainname} parameter is the null pointer no new default
1023 domain is set.  Instead the currently selected default domain is
1024 returned.
1025
1026 If the @var{domainname} parameter is the empty string the default domain
1027 is reset to its initial value, the domain with the name @code{messages}.
1028 This possibility is questionable to use since the domain @code{messages}
1029 really never should be used.
1030 @end deftypefun
1031
1032 @comment libintl.h
1033 @comment GNU
1034 @deftypefun {char *} bindtextdomain (const char *@var{domainname}, const char *@var{dirname})
1035 The @code{bindtextdomain} function can be used to specify the directory
1036 which contains the message catalogs for domain @var{domainname} for the
1037 different languages.  To be correct, this is the directory where the
1038 hierarchy of directories is expected.  Details are explained below.
1039
1040 For the programmer it is important to note that the translations which
1041 come with the program have be placed in a directory hierarchy starting
1042 at, say, @file{/foo/bar}.  Then the program should make a
1043 @code{bindtextdomain} call to bind the domain for the current program to
1044 this directory.  So it is made sure the catalogs are found.  A correctly
1045 running program does not depend on the user setting an environment
1046 variable.
1047
1048 The @code{bindtextdomain} function can be used several times and if the
1049 @var{domainname} argument is different the previously bounded domains
1050 will not be overwritten.
1051
1052 If the program which wish to use @code{bindtextdomain} at some point of
1053 time use the @code{chdir} function to change the current working
1054 directory it is important that the @var{dirname} strings ought to be an
1055 absolute pathname.  Otherwise the addressed directory might vary with
1056 the time.
1057
1058 If the @var{dirname} parameter is the null pointer @code{bindtextdomain}
1059 returns the currently selected directory for the domain with the name
1060 @var{domainname}.
1061
1062 The @code{bindtextdomain} function returns a pointer to a string
1063 containing the name of the selected directory name.  The string is
1064 allocated internally in the function and must not be changed by the
1065 user.  If the system went out of core during the execution of
1066 @code{bindtextdomain} the return value is @code{NULL} and the global
1067 variable @var{errno} is set accordingly.
1068 @end deftypefun
1069
1070
1071 @node Advanced gettext functions
1072 @subsubsection Additional functions for more complicated situations
1073
1074 The functions of the @code{gettext} family described so far (and all the
1075 @code{catgets} functions as well) have one problem in the real world
1076 which have been neglected completely in all existing approaches.  What
1077 is meant here is the handling of plural forms.
1078
1079 Looking through Unix source code before the time anybody thought about
1080 internationalization (and, sadly, even afterwards) one can often find
1081 code similar to the following:
1082
1083 @smallexample
1084    printf ("%d file%s deleted", n, n == 1 ? "" : "s");
1085 @end smallexample
1086
1087 @noindent
1088 After the first complains from people internationalizing the code people
1089 either completely avoided formulations like this or used strings like
1090 @code{"file(s)"}.  Both look unnatural and should be avoided.  First
1091 tries to solve the problem correctly looked like this:
1092
1093 @smallexample
1094    if (n == 1)
1095      printf ("%d file deleted", n);
1096    else
1097      printf ("%d files deleted", n);
1098 @end smallexample
1099
1100 But this does not solve the problem.  It helps languages where the
1101 plural form of a noun is not simply constructed by adding an `s' but
1102 that is all.  Once again people fell into the trap of believing the
1103 rules their language is using are universal.  But the handling of plural
1104 forms differs widely between the language families.  There are two
1105 things we can differ between (and even inside language families);
1106
1107 @itemize @bullet
1108 @item
1109 The form how plural forms are build differs.  This is a problem with
1110 language which have many irregularities.  German, for instance, is a
1111 drastic case.  Though English and German are part of the same language
1112 family (Germanic), the almost regular forming of plural noun forms
1113 (appending an `s') is ardly found in German.
1114
1115 @item
1116 The number of plural forms differ.  This is somewhat surprising for
1117 those who only have experiences with Romanic and Germanic languages
1118 since here the number is the same (there are two).
1119
1120 But other language families have only one form or many forms.  More
1121 information on this in an extra section.
1122 @end itemize
1123
1124 The consequence of this is that application writers should not try to
1125 solve the problem in their code.  This would be localization since it is
1126 only usable for certain, hardcoded language environments.  Instead the
1127 extended @code{gettext} interface should be used.
1128
1129 These extra functions are taking instead of the one key string two
1130 strings and an numerical argument.  The idea behind this is that using
1131 the numerical argument and the first string as a key, the implementation
1132 can select using rules specified by the translator the right plural
1133 form.  The two string arguments then will be used to provide a return
1134 value in case no message catalog is found (similar to the normal
1135 @code{gettext} behaviour).  In this case the rules for Germanic language
1136 is used and it is assumed that the first string argument is the singular
1137 form, the second the plural form.
1138
1139 This has the consequence that programs without language catalogs can
1140 display the correct strings only if the program itself is written using
1141 a Germanic language.  This is a limitation but since the GNU C library
1142 (as well as the GNU @code{gettext} package) are written as part of the
1143 GNU package and the coding standards for the GNU project require program
1144 being written in English, this solution nevertheless fulfills its
1145 purpose.
1146
1147 @comment libintl.h
1148 @comment GNU
1149 @deftypefun {char *} ngettext (const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n})
1150 The @code{ngettext} function is similar to the @code{gettext} function
1151 as it finds the message catalogs in the same way.  But it takes two
1152 extra arguments.  The @var{msgid1} parameter must contain the singular
1153 form of the string to be converted.  It is also used as the key for the
1154 search in the catalog.  The @var{msgid2} parameter is the plural form.
1155 The parameter @var{n} is used to determine the plural form.  If no
1156 message catalog is found @var{msgid1} is returned if @code{n == 1},
1157 otherwise @code{msgid2}.
1158
1159 An example for the us of this function is:
1160
1161 @smallexample
1162   printf (ngettext ("%d file removed", "%d files removed", n), n);
1163 @end smallexample
1164
1165 Please note that the numeric value @var{n} has to be passed to the
1166 @code{printf} function as well.  It is not sufficient to pass it only to
1167 @code{ngettext}.
1168 @end deftypefun
1169
1170 @comment libintl.h
1171 @comment GNU
1172 @deftypefun {char *} dngettext (const char *@var{domain}, const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n})
1173 The @code{dngettext} is similar to the @code{dgettext} function in the
1174 way the message catalog is selected.  The difference is that it takes
1175 two extra parameter to provide the correct plural form.  These two
1176 parameters are handled in the same way @code{ngettext} handles them.
1177 @end deftypefun
1178
1179 @comment libintl.h
1180 @comment GNU
1181 @deftypefun {char *} dcngettext (const char *@var{domain}, const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n}, int @var{category})
1182 The @code{dcngettext} is similar to the @code{dcgettext} function in the
1183 way the message catalog is selected.  The difference is that it takes
1184 two extra parameter to provide the correct plural form.  These two
1185 parameters are handled in the same way @code{ngettext} handles them.
1186 @end deftypefun
1187
1188 @subsubheading The problem of plural forms
1189
1190 A description of the problem can be found at the beginning of the last
1191 section.  Now there is the question how to solve it.  Without the input
1192 of linguists (which was not available) it was not possible to determine
1193 whether there are only a few different forms in which plural forms are
1194 formed or whether the number can increase with every new supported
1195 language.
1196
1197 Therefore the solution implemented is to allow the translator to specify
1198 the rules of how to select the plural form.  Since the formula varies
1199 with every language this is the only viable solution except for
1200 harcoding the information in the code (which still would require the
1201 possibility of extensionsto not prevent the use of new languages).  The
1202 details are explained in the GNU @code{gettext} manual.  Here only a a
1203 bit of information is provided.
1204
1205 The information about the plural form selection has to be stored in the
1206 header entry (the one with the empty (@code{msgid} string).  There shoud
1207 be something like:
1208
1209 @smallexample
1210   nplurals=2; plural=n == 1 ? 0 : 1
1211 @end smallexample
1212
1213 The @code{nplurals} value must be a decimal number which specifies how
1214 many different plural forms exist for this language.  The string
1215 following @code{plural} is an expression which is using the C language
1216 syntax.  Exceptions are that no negative number are allowed, numbers
1217 must be decimal, and the only variable allowed is @code{n}.  This
1218 expression will be evaluated whenever one of the functions
1219 @code{ngettext}, @code{dngettext}, or @code{dcngettext} is called.  The
1220 numeric value passed to these functions is then substituted for all uses
1221 of the variable @code{n} in the expression.  The resulting value then
1222 must be greater or equal to zero and smaller than the value given as the
1223 value of @code{nplurals}.
1224
1225 @noindent
1226 The following rules are known at this point.  The language with families
1227 are listed.  But this does not necessarily mean the information can be
1228 generalized for the whole family (as can be easily seen in the table
1229 below).@footnote{Additions are welcome.  Send appropriate information to
1230 @email{bug-glibc-manual@@gnu.org}.}
1231
1232 @table @asis
1233 @item Only one form:
1234 Some languages only require one single form.  There is no distinction
1235 between the singular and plural form.  And appropriate header entry
1236 would look like this:
1237
1238 @smallexample
1239 nplurals=1; plural=0
1240 @end smallexample
1241
1242 @noindent
1243 Languages with this property include:
1244
1245 @table @asis
1246 @item Finno-Ugric family
1247 Hungarian
1248 @item Asian family
1249 Japanese
1250 @item Turkic/Altaic family
1251 Turkish
1252 @end table
1253
1254 @item Two forms, singular used for one only
1255 This is the form used in most existing programs sine it is what English
1256 is using.  A header entry would look like this:
1257
1258 @smallexample
1259 nplurals=2; plural=n != 1
1260 @end smallexample
1261
1262 (Note: this uses the feature of C expressions that boolean expressions
1263 have to value zero or one.)
1264
1265 @noindent
1266 Languages with this property include:
1267
1268 @table @asis
1269 @item Germanic family
1270 Danish, Dutch, English, German, Norwegian, Swedish
1271 @item Finno-Ugric family
1272 Finnish
1273 @item Latin/Greek family
1274 Greek
1275 @item Semitic family
1276 Hebrew
1277 @item Romance family
1278 Italian, Spanish
1279 @item Artificial
1280 Esperanto
1281 @end table
1282
1283 @item Two forms, singular used for zero and one
1284 Exceptional case in the language family.  The header entry would be:
1285
1286 @smallexample
1287 nplurals=2; plural=n>1
1288 @end smallexample
1289
1290 @noindent
1291 Languages with this property include:
1292
1293 @table @asis
1294 @item Romanic family
1295 French
1296 @end table
1297
1298 @item Three forms, special cases for one and two
1299 The header entry would be:
1300
1301 @smallexample
1302 nplurals=3; plural=n==1 ? 0 : n==2 ? 1 : 2
1303 @end smallexample
1304
1305 @noindent
1306 Languages with this property include:
1307
1308 @table @asis
1309 @item Celtic
1310 Gaeilge
1311 @end table
1312
1313 @item Three forms, special case for one and all numbers ending in 2, 3, or 4
1314 The header entry would look like this:
1315
1316 @smallexample
1317 nplurals=3; plural=n==1 ? 0 : n%10>=2 && n%10<=4 ? 1 : 2
1318 @end smallexample
1319
1320 @noindent
1321 Languages with this property include:
1322
1323 @table @asis
1324 @item Slavic family
1325 Russian
1326 @end table
1327
1328 @item Three forms, special case for one and some numbers ending in 2, 3, or 4
1329 The header entry would look like this:
1330
1331 @smallexample
1332 nplurals=3; plural=n==1 ? 0 : \
1333   n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2
1334 @end smallexample
1335
1336 (Continuation in the next line is possible.)
1337
1338 @noindent
1339 Languages with this property include:
1340
1341 @table @asis
1342 @item Slavic family
1343 Polish
1344 @end table
1345
1346 @item Four forms, special case for one and all numbers ending in 2, 3, or 4
1347 The header entry would look like this:
1348
1349 @smallexample
1350 nplurals=4; plural=n==1 ? 0 : n%10==2 ? 1 : n==3 || n+=4 ? 2 : 3
1351 @end smallexample
1352
1353 @noindent
1354 Languages with this property include:
1355
1356 @table @asis
1357 @item Slavic family
1358 Slovenian
1359 @end table
1360 @end table
1361
1362
1363 @node Using gettextized software
1364 @subsubsection User influence on @code{gettext}
1365
1366 The last sections described what the programmer can do to
1367 internationalize the messages of the program.  But it is finally up to
1368 the user to select the message s/he wants to see.  S/He must understand
1369 them.
1370
1371 The POSIX locale model uses the environment variables @code{LC_COLLATE},
1372 @code{LC_CTYPE}, @code{LC_MESSAGES}, @code{LC_MONETARY}, @code{NUMERIC},
1373 and @code{LC_TIME} to select the locale which is to be used.  This way
1374 the user can influence lots of functions.  As we mentioned above the
1375 @code{gettext} functions also take advantage of this.
1376
1377 To understand how this happens it is necessary to take a look at the
1378 various components of the filename which gets computed to locate a
1379 message catalog.  It is composed as follows:
1380
1381 @smallexample
1382 @var{dir_name}/@var{locale}/LC_@var{category}/@var{domain_name}.mo
1383 @end smallexample
1384
1385 The default value for @var{dir_name} is system specific.  It is computed
1386 from the value given as the prefix while configuring the C library.
1387 This value normally is @file{/usr} or @file{/}.  For the former the
1388 complete @var{dir_name} is:
1389
1390 @smallexample
1391 /usr/share/locale
1392 @end smallexample
1393
1394 We can use @file{/usr/share} since the @file{.mo} files containing the
1395 message catalogs are system independent, all systems can use the same
1396 files.  If the program executed the @code{bindtextdomain} function for
1397 the message domain that is currently handled the @code{dir_name}
1398 component is the exactly the value which was given to the function as
1399 the second parameter.  I.e., @code{bindtextdomain} allows to overwrite
1400 the only system dependent and fixed value to make it possible to
1401 address file everywhere in the filesystem.
1402
1403 The @var{category} is the name of the locale category which was selected
1404 in the program code.  For @code{gettext} and @code{dgettext} this is
1405 always @code{LC_MESSAGES}, for @code{dcgettext} this is selected by the
1406 value of the third parameter.  As said above it should be avoided to
1407 ever use a category other than @code{LC_MESSAGES}.
1408
1409 The @var{locale} component is computed based on the category used.  Just
1410 like for the @code{setlocale} function here comes the user selection
1411 into the play.  Some environment variables are examined in a fixed order
1412 and the first environment variable set determines the return value of
1413 the lookup process.  In detail, for the category @code{LC_xxx} the
1414 following variables in this order are examined:
1415
1416 @table @code
1417 @item LANGUAGE
1418 @item LC_ALL
1419 @item LC_xxx
1420 @item LANG
1421 @end table
1422
1423 This looks very familiar.  With the exception of the @code{LANGUAGE}
1424 environment variable this is exactly the lookup order the
1425 @code{setlocale} function uses.  But why introducing the @code{LANGUAGE}
1426 variable?
1427
1428 The reason is that the syntax of the values these variables can have is
1429 different to what is expected by the @code{setlocale} function.  If we
1430 would set @code{LC_ALL} to a value following the extended syntax that
1431 would mean the @code{setlocale} function will never be able to use the
1432 value of this variable as well.  An additional variable removes this
1433 problem plus we can select the language independently of the locale
1434 setting which sometimes is useful.
1435
1436 While for the @code{LC_xxx} variables the value should consist of
1437 exactly one specification of a locale the @code{LANGUAGE} variable's
1438 value can consist of a colon separated list of locale names.  The
1439 attentive reader will realize that this is the way we manage to
1440 implement one of our additional demands above: we want to be able to
1441 specify an ordered list of language.
1442
1443 Back to the constructed filename we have only one component missing.
1444 The @var{domain_name} part is the name which was either registered using
1445 the @code{textdomain} function or which was given to @code{dgettext} or
1446 @code{dcgettext} as the first parameter.  Now it becomes obvious that a
1447 good choice for the domain name in the program code is a string which is
1448 closely related to the program/package name.  E.g., for the GNU C
1449 Library the domain name is @code{libc}.
1450
1451 @noindent
1452 A limit piece of example code should show how the programmer is supposed
1453 to work:
1454
1455 @smallexample
1456 @{
1457   textdomain ("test-package");
1458   bindtextdomain ("test-package", "/usr/local/share/locale");
1459   puts (gettext ("Hello, world!");
1460 @}
1461 @end smallexample
1462
1463 At the program start the default domain is @code{messages}.  The
1464 @code{textdomain} call changes this to @code{test-package}.  The
1465 @code{bindtextdomain} call specifies that the message catalogs for the
1466 domain @code{test-package} can be found below the directory
1467 @file{/usr/local/share/locale}.
1468
1469 If now the user set in her/his environment the variable @code{LANGUAGE}
1470 to @code{de} the @code{gettext} function will try to use the
1471 translations from the file
1472
1473 @smallexample
1474 /usr/local/share/locale/de/LC_MESSAGES/test-package.mo
1475 @end smallexample
1476
1477 From the above descriptions it should be clear which component of this
1478 filename is determined by which source.
1479
1480 In the above example we assumed that the @code{LANGUAGE} environment
1481 variable to @code{de}.  This might be an appropriate selection but what
1482 happens if the user wants to use @code{LC_ALL} because of the wider
1483 usability and here the required value is @code{de_DE.ISO-8859-1}?  We
1484 already mentioned above that a situation like this is not infrequent.
1485 E.g., a person might prefer reading a dialect and if this is not
1486 available fall back on the standard language.
1487
1488 The @code{gettext} functions know about situations like this and can
1489 handle them gracefully.  The functions recognize the format of the value
1490 of the environment variable.  It can split the value is different pieces
1491 and by leaving out the only or the other part it can construct new
1492 values.  This happens of course in a predictable way.  To understand
1493 this one must know the format of the environment variable value.  There
1494 are to more or less standardized forms:
1495
1496 @table @emph
1497 @item X/Open Format
1498 @code{language[_territory[.codeset]][@@modifier]}
1499
1500 @item CEN Format (European Community Standard)
1501 @code{language[_territory][+audience][+special][,[sponsor][_revision]]}
1502 @end table
1503
1504 The functions will automatically recognize which format is used.  Less
1505 specific locale names will be stripped of in the order of the following
1506 list:
1507
1508 @enumerate
1509 @item
1510 @code{revision}
1511 @item
1512 @code{sponsor}
1513 @item
1514 @code{special}
1515 @item
1516 @code{codeset}
1517 @item
1518 @code{normalized codeset}
1519 @item
1520 @code{territory}
1521 @item
1522 @code{audience}/@code{modifier}
1523 @end enumerate
1524
1525 From the last entry one can see that the meaning of the @code{modifier}
1526 field in the X/Open format and the @code{audience} format have the same
1527 meaning.  Beside one can see that the @code{language} field for obvious
1528 reasons never will be dropped.
1529
1530 The only new thing is the @code{normalized codeset} entry.  This is
1531 another goodie which is introduced to help reducing the chaos which
1532 derives from the inability of the people to standardize the names of
1533 character sets.  Instead of @w{ISO-8859-1} one can often see @w{8859-1},
1534 @w{88591}, @w{iso8859-1}, or @w{iso_8859-1}.  The @code{normalized
1535 codeset} value is generated from the user-provided character set name by
1536 applying the following rules:
1537
1538 @enumerate
1539 @item
1540 Remove all characters beside numbers and letters.
1541 @item
1542 Fold letters to lowercase.
1543 @item
1544 If the same only contains digits prepend the string @code{"iso"}.
1545 @end enumerate
1546
1547 @noindent
1548 So all of the above name will be normalized to @code{iso88591}.  This
1549 allows the program user much more freely choosing the locale name.
1550
1551 Even this extended functionality still does not help to solve the
1552 problem that completely different names can be used to denote the same
1553 locale (e.g., @code{de} and @code{german}).  To be of help in this
1554 situation the locale implementation and also the @code{gettext}
1555 functions know about aliases.
1556
1557 The file @file{/usr/share/locale/locale.alias} (replace @file{/usr} with
1558 whatever prefix you used for configuring the C library) contains a
1559 mapping of alternative names to more regular names.  The system manager
1560 is free to add new entries to fill her/his own needs.  The selected
1561 locale from the environment is compared with the entries in the first
1562 column of this file ignoring the case.  If they match the value of the
1563 second column is used instead for the further handling.
1564
1565 In the description of the format of the environment variables we already
1566 mentioned the character set as a factor in the selection of the message
1567 catalog.  In fact, only catalogs which contain text written using the
1568 character set of the system/program can be used (directly; there will
1569 come a solution for this some day).  This means for the user that s/he
1570 will always have to take care for this.  If in the collection of the
1571 message catalogs there are files for the same language but coded using
1572 different character sets the user has to be careful.
1573
1574
1575 @node Helper programs for gettext
1576 @subsection Programs to handle message catalogs for @code{gettext}
1577
1578 The GNU C Library does not contain the source code for the programs to
1579 handle message catalogs for the @code{gettext} functions.  As part of
1580 the GNU project the GNU gettext package contains everything the
1581 developer needs.  The functionality provided by the tools in this
1582 package by far exceeds the abilities of the @code{gencat} program
1583 described above for the @code{catgets} functions.
1584
1585 There is a program @code{msgfmt} which is the equivalent program to the
1586 @code{gencat} program.  It generates from the human-readable and
1587 -editable form of the message catalog a binary file which can be used by
1588 the @code{gettext} functions.  But there are several more programs
1589 available.
1590
1591 The @code{xgettext} program can be used to automatically extract the
1592 translatable messages from a source file.  I.e., the programmer need not
1593 take care for the translations and the list of messages which have to be
1594 translated.  S/He will simply wrap the translatable string in calls to
1595 @code{gettext} et.al and the rest will be done by @code{xgettext}.  This
1596 program has a lot of option which help to customize the output or do
1597 help to understand the input better.
1598
1599 Other programs help to manage development cycle when new messages appear
1600 in the source files or when a new translation of the messages appear.
1601 here it should only be noted that using all the tools in GNU gettext it
1602 is possible to @emph{completely} automize the handling of message
1603 catalog.  Beside marking the translatable string in the source code and
1604 generating the translations the developers do not have anything to do
1605 themself.