Document locale-specific changes to `sort',

author Jim Meyering <jim@meyering.net>

Sat, 22 May 1999 12:52:41 +0000 (12:52 +0000)

committer Jim Meyering <jim@meyering.net>

Sat, 22 May 1999 12:52:41 +0000 (12:52 +0000)
author Jim Meyering <jim@meyering.net>
Sat, 22 May 1999 12:52:41 +0000 (12:52 +0000)
committer Jim Meyering <jim@meyering.net>
Sat, 22 May 1999 12:52:41 +0000 (12:52 +0000)
diff --git a/doc/textutils.texi b/doc/textutils.texi

index d1fdaff..044dac0 100644 (file)
--- a/doc/textutils.texi
+++ b/doc/textutils.texi
@@ -44,7 +44,7 @@ START-INFO-DIR-ENTRY
  * tsort: (textutils)tsort invocation.           Topological sort.
  * tr: (textutils)tr invocation.                 Translate characters.
  * unexpand: (textutils)unexpand invocation.     Convert spaces to tabs.
-* uniq: (textutils)uniq invocation.             Uniqify files.
+* uniq: (textutils)uniq invocation.             Uniquify files.
  * wc: (textutils)wc invocation.                 Byte, word, and line counts.
  END-INFO-DIR-ENTRY
  @end format
@@ -161,7 +161,7 @@ Summarizing files
  Operating on sorted files
  
  * sort invocation::             Sort text files.
-* uniq invocation::             Uniqify files.
+* uniq invocation::             Uniquify files.
  * comm invocation::             Compare two sorted files line by line.
  * ptx invocation::              Produce a permuted index of file contents.
  * tsort invocation::            Topological sort.
@@ -672,7 +672,7 @@ Output at most @var{bytes} bytes of the input.  Prefixes and suffixes on
  @opindex --strings
  @cindex string constants, outputting
  Instead of the normal output, output only @dfn{string constants}: at
-least @var{n} (3 by default) consecutive ASCII graphic characters,
+least @var{n} (3 by default) consecutive @sc{ASCII} graphic characters,
  followed by a null (zero) byte.
  
  @item -t @var{type}
@@ -687,14 +687,14 @@ of each output line using each of the data types that you specified,
  in the order that you specified.
  
  Adding a trailing ``z'' to any type specification appends a display
-of the ASCII character representation of the printable characters
+of the @sc{ASCII} character representation of the printable characters
  to the output line generated by the type specification.
  
  @table @samp
  @item a
  named character,
  @item c
-ASCII character or backslash escape,
+@sc{ASCII} character or backslash escape,
  @item d
  signed decimal,
  @item f
@@ -779,7 +779,7 @@ Output as octal bytes.  Equivalent to @samp{-toC}.
  
  @item -c
  @opindex -c
-Output as ASCII characters or backslash escapes.  Equivalent to
+Output as @sc{ASCII} characters or backslash escapes.  Equivalent to
  @samp{-tc}.
  
  @item -d
@@ -1998,7 +1998,7 @@ These commands work with (or produce) sorted files.
  
  @menu
  * sort invocation::             Sort text files.
-* uniq invocation::             Uniqify files.
+* uniq invocation::             Uniquify files.
  * comm invocation::             Compare two sorted files line by line.
  * ptx invocation::              Produce a permuted index of file contents.
  * tsort invocation::            Topological sort.
@@ -2043,18 +2043,21 @@ works.
  
  @end table
  
+@vindex LC_COLLATE
  A pair of lines is compared as follows: if any key fields have been
  specified, @code{sort} compares each pair of fields, in the order
  specified on the command line, according to the associated ordering
  options, until a difference is found or no fields are left.
+Unless otherwise specified, all comparisons use the character
+collating sequence specified by the @env{LC_COLLATE} locale.
  
  If any of the global options @samp{Mbdfinr} are given but no key fields
  are specified, @code{sort} compares the entire lines according to the
  global options.
  
  Finally, as a last resort when all keys compare equal (or if no
-ordering options were specified at all), @code{sort} compares the lines
-byte by byte in machine collating sequence.  The last resort comparison
+ordering options were specified at all), @code{sort} compares the entire
+lines.  The last resort comparison
  honors the @samp{-r} global option.  The @samp{-s} (stable) option
  disables this last-resort comparison so that lines in which all fields
  compare equal are left in their original relative order.  If no fields
@@ -2063,7 +2066,10 @@ or global options are specified, @samp{-s} has no effect.
  GNU @code{sort} (as specified for all GNU utilities) has no limits on
  input line length or restrictions on bytes allowed within lines.  In
  addition, if the final byte of an input file is not a newline, GNU
-@code{sort} silently supplies one.
+@code{sort} silently supplies one.  A line's trailing newline is part of
+the line for comparison purposes; for example, with no options in an
+@sc{ASCII} locale, a line starting with a tab sorts before an empty line
+because tab precedes newline in the @sc{ASCII} collating sequence.
  
  Upon any error, @code{sort} exits with a status of @samp{2}.
  
@@ -2073,11 +2079,14 @@ value as the directory for temporary files instead of @file{/tmp}.  The
  @samp{-T @var{tempdir}} option in turn overrides the environment
  variable.
  
+@vindex LC_CTYPE
  The following options affect the ordering of output lines.  They may be
  specified globally or as part of a specific key field.  If no key
  fields are specified, global options apply to comparison of entire
  lines; otherwise the global options are inherited by key fields that do
-not specify any special options of their own.
+not specify any special options of their own.  The @samp{-b}, @samp{-d},
+@samp{-f} and @samp{-i} options classify characters according to
+the @env{LC_CTYPE} locale.
  
  @table @samp
  
@@ -2102,40 +2111,59 @@ sorting so that, for example, @samp{b} and @samp{B} sort as equal.
  @item -g
  @opindex -g
  @cindex general numeric sort
-Sort numerically, but use strtod(3) to arrive at the numeric values.
+Sort numerically, using the standard C function @code{strtod} to convert
+a prefix of each line to a double-precision floating point number.
  This allows floating point numbers to be specified in scientific notation,
-like @code{1.0e-34} and @code{10e100}.  Use this option only if there
-is no alternative;  it is much slower than @samp{-n} and numbers with
-too many significant digits will be compared as if they had been
-truncated.  In addition, numbers outside the range of representable
-double precision floating point numbers are treated as if they were
-zeroes; overflow and underflow are not reported.
+like @code{1.0e-34} and @code{10e100}.
+Do not report overflow, underflow, or conversion errors.
+Use the following collating sequence:
+
+@itemize @bullet
+@item
+Lines that do not start with numbers (all considered to be equal).
+@item
+NaNs (``Not a Number'' values, in IEEE floating point arithmetic)
+in a consistent but machine-dependent order.
+@item
+Minus infinity.
+@item
+Finite numbers in ascending numeric order (with @math{-0} and @math{+0} equal).
+@item
+Plus infinity.
+@end itemize
+
+Use this option only if there is no alternative; it is much slower than
+@samp{-n} and it can lose information when converting to floating point.
  
  @item -i
  @opindex -i
  @cindex unprintable characters, ignoring
-Ignore characters outside the printable ASCII range 040-0176 octal
-(inclusive) when sorting.
+Ignore unprintable characters.
  
  @item -M
  @opindex -M
  @cindex months, sorting by
+@vindex LC_TIME
  An initial string, consisting of any amount of whitespace, followed
-by three letters abbreviating a month name, is folded to UPPER case and
+by a month name abbreviation, is folded to UPPER case and
  compared in the order @samp{JAN} < @samp{FEB} < @dots{} < @samp{DEC}.
-Invalid names compare low to valid names.
+Invalid names compare low to valid names.  The @env{LC_TIME} locale
+determines the month spellings.
  
  @item -n
  @opindex -n
  @cindex numeric sort
+@vindex LC_NUMERIC
  Sort numerically: the number begins each line; specifically, it consists
  of optional whitespace, an optional @samp{-} sign, and zero or more
-digits, optionally followed by a decimal point and zero or more digits.
+digits possibly separated by thousands separators, optionally followed
+by a radix character and zero or more digits.  The @env{LC_NUMERIC}
+locale specifies the radix character and thousands separator.
  
  @code{sort -n} uses what might be considered an unconventional method
  to compare strings representing floating point numbers.  Rather than
  first converting each string to the C @code{double} type and then
-comparing those values, sort aligns the decimal points in the two
+comparing those values, sort aligns the radix characters in the two
  strings and compares the strings a character at a time.  One benefit
  of using this approach is its speed.  In practice this is much more
  efficient than performing the two corresponding string-to-double (or even
@@ -2180,7 +2208,7 @@ following.
  
  @item -u
  @opindex -u
-@cindex uniqifying output
+@cindex uniquifying output
  For the default case or the @samp{-m} option, only output the first
  of a sequence of lines that compare equal.  For the @samp{-c} option,
  check that no pair of consecutive lines compares equal.
@@ -2199,7 +2227,7 @@ See below for more examples.
  @opindex -z
  @cindex sort zero-terminated lines
  Treat the input as a set of lines, each terminated by a zero byte (@sc{ASCII}
-@sc{NUL} (Null) character) instead of a @sc{ASCII} @sc{LF} (Line Feed.)
+@sc{NUL} (Null) character) instead of an @sc{ASCII} @sc{LF} (Line Feed).
  This option can be useful in conjunction with @samp{perl -0} or
  @samp{find -print0} and @samp{xargs -0} which do the same in order to
  reliably handle arbitrary pathnames (even those which contain Line Feed
@@ -2342,10 +2370,10 @@ sort -t : -b -k 5,5 -k 3,3n /etc/passwd
  
  
  @node uniq invocation
-@section @code{uniq}: Uniqify files
+@section @code{uniq}: Uniquify files
  
  @pindex uniq
-@cindex uniqify files
+@cindex uniquify files
  
  @code{uniq} writes the unique lines in the given @file{input}, or
  standard input if nothing is given or for an @var{input} name of
@@ -2618,7 +2646,7 @@ As it is setup now, the program assumes that the input file is coded
  using 8-bit ISO 8859-1 code, also known as Latin-1 character set,
  @emph{unless} if it is compiled for MS-DOS, in which case it uses the
  character set of the IBM-PC.  (GNU @code{ptx} is not known to work on
-smaller MS-DOS machines anymore.)  Compared to 7-bit ASCII, the set of
+smaller MS-DOS machines anymore.)  Compared to 7-bit @sc{ASCII}, the set of
  characters which are letters is then different, this fact alters the
  behaviour of regular expression matching.  Thus, the default regular
  expression for a keyword allows foreign or diacriticized letters.
@@ -2907,7 +2935,7 @@ sequence @code{^\@{ @}} and @code{~\@{ @}} respectively.  Other
  diacriticized characters of the underlying character set produce an
  appropriate @TeX{} sequence as far as possible.  The other non-graphical
  characters, like newline and tab, and all others characters which are
-not part of ASCII, are merely changed to exactly one space, with no
+not part of @sc{ASCII}, are merely changed to exactly one space, with no
  special attempt to compress consecutive spaces.  Let me know how to
  improve this special character processing for @TeX{}.
  
@@ -3842,8 +3870,8 @@ yourself using when setting up fancy data plumbing. The @code{sort}
  command reads and sorts each file named on the command line.  It then
  merges the sorted data and writes it to standard output.  It will read
  standard input if no files are given on the command line (thus
-making it into a filter).  The sort is based on the machine collating
-sequence (@sc{ASCII}) or based on  user-supplied ordering criteria.
+making it into a filter).  The sort is based on the character collating
+sequence or based on user-supplied ordering criteria.
  
  
  @node The uniq command
@@ -4019,7 +4047,7 @@ $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' | ...
  The second @code{tr} command operates on the complement of the listed
  characters, which are all the letters, the digits, the underscore, and
  the blank.  The @samp{\012} represents the newline character; it has to
-be left alone.  (The ASCII TAB character should also be included for
+be left alone.  (The @sc{ASCII} tab character should also be included for
  good measure in a production script.)
  
  At this point, we have data consisting of words separated by blank space.
@@ -4065,7 +4093,7 @@ with the help of two more @code{sort} options:
  
  @table @samp
  @item -n
-do a numeric sort, not an ASCII one
+do a numeric sort, not a textual one
  
  @item -r
  reverse the order of the sort
author	Jim Meyering <jim@meyering.net>
	Sat, 22 May 1999 12:52:41 +0000 (12:52 +0000)
committer	Jim Meyering <jim@meyering.net>
	Sat, 22 May 1999 12:52:41 +0000 (12:52 +0000)