1 .TH MAWK 1 "Dec 22 1994" "Version 1.2" "USER COMMANDS"
5 mawk \- pattern scanning and text processing language
14 [\-\|\-] 'program text' [file ...]
28 is an interpreter for the AWK Programming Language.
30 is useful for manipulation of data files,
31 text retrieval and processing,
32 and for prototyping and experimenting with algorithms.
34 is a \fInew awk\fR meaning it implements the AWK language as
35 defined in Aho, Kernighan and Weinberger,
36 .I "The AWK Programming Language,"
37 Addison-Wesley Publishing, 1988. (Hereafter referred to as
40 conforms to the Posix 1003.2
42 definition of the AWK language
43 which contains a few features not described in the AWK
46 provides a small number of extensions.
48 An AWK program is a sequence of \fIpattern {action}\fR pairs and
50 Short programs are entered on the command line
51 usually enclosed in ' ' to avoid shell
53 Longer programs can be read in from a
54 file with the \-f option.
55 Data input is read from the list of files on
56 the command line or from standard input when the list is empty.
57 The input is broken into records as determined by the
58 record separator variable, \fBRS\fR. Initially,
60 = "\en" and records are synonymous with lines.
61 Each record is compared against each
63 and if it matches, the program text for
67 .TP \w'\-\fBW'u+\w'\fRsprintf=\fInum\fR'u+2n
69 sets the field separator, \fBFS\fR, to
73 Program text is read from \fIfile\fR instead of from the
74 command line. Multiple
78 \-\fBv \fIvar=value\fR
85 indicates the unambiguous end of options.
87 The above options will be available with any Posix compatible
88 implementation of AWK, and implementation specific options are
93 .TP \w'\-\fBW'u+\w'\fRsprintf=\fInum\fR'u+2n
96 writes its version and copyright
97 to stdout and compiled limits to
101 writes an assembler like listing of the internal
102 representation of the program to stdout and exits 0
103 (on successful compilation).
105 \-\fBW \fRinteractive
106 sets unbuffered writes to stdout and line buffered reads from stdin.
107 Records from stdin are lines regardless of the value of
110 \-\fBW \fRexec \fIfile
111 Program text is read from
113 and this is the last option. Useful on systems that support the
115 "magic number" convention for executable scripts.
117 \-\fBW \fRsprintf=\fInum\fR
120 internal sprintf buffer to
122 bytes. More than rare use of this option indicates
124 should be recompiled.
126 \-\fBW \fRposix_space
129 not to consider '\en' to be space.
133 are recognized and on some systems \fB\-W\fRe is mandatory to avoid
134 command line length limitations.
135 .SH "THE AWK LANGUAGE"
136 .SS "\fB1. Program structure"
137 An AWK program is a sequence of
138 .I "pattern {action}"
140 function definitions.
148 expression , expression
153 of \fIpattern {action}\fR can be omitted. If
155 is omitted it is implicitly { print }. If
157 is omitted, then it is implicitly matched.
161 patterns require an action.
163 Statements are terminated by newlines, semi-colons or both.
164 Groups of statements such as
165 actions or loop bodies are blocked via { ... } as in C. The
166 last statement in a block doesn't need a terminator. Blank lines
167 have no meaning; an empty statement is terminated with a
168 semi-colon. Long statements
169 can be continued with a backslash, \e\|. A statement can be broken
170 without a backslash after a comma, left brace, &&, ||,
173 the right parenthesis of an
179 right parenthesis of a function definition.
180 A comment starts with # and extends to, but does not include
183 The following statements control program flow inside blocks.
214 ( \fIvar \fBin \fIarray\fR )
222 .SS "\fB2. Data types, conversion and comparison"
223 There are two basic data types, numeric and string.
224 Numeric constants can be integer like \-2,
225 decimal like 1.08, or in scientific notation like
226 \-1.1e4 or .28E\-3. All numbers are represented internally and all
227 computations are done in floating point arithmetic.
228 So for example, the expression
230 is true and true is represented as 1.0.
232 String constants are enclosed in double quotes.
235 "This is a string with a newline at the end.\en"
237 Strings can be continued across a line by escaping (\e) the newline.
238 The following escape sequences are recognized.
244 \eb backspace, ascii 8
246 \en newline, ascii 10
247 \ev vertical tab, ascii 11
248 \ef formfeed, ascii 12
249 \er carriage return, ascii 13
250 \eddd 1, 2 or 3 octal digits for ascii ddd
251 \exhh 1 or 2 hex digits for ascii hh
254 If you escape any other character \ec, you get \ec, i.e.,
258 There are really three basic data types; the third is
259 .I "number and string"
260 which has both a numeric value and a string value
262 User defined variables come into existence when first referenced
263 and are initialized to
265 a number and string value which has numeric value 0 and string value
267 Non-trivial number and string typed data come from input
268 and are typically stored in fields. (See section 4).
270 The type of an expression is determined by its context and automatic
271 type conversion occurs if needed. For example, to evaluate the
275 y = x + 2 ; z = x "hello"
278 The value stored in variable y will be typed numeric.
280 the value read from x is converted to numeric before it is added to
281 2 and stored in y. The value stored in variable z will be typed
282 string, and the value of x will be converted to string if necessary
283 and concatenated with "hello". (Of course, the value and type
284 stored in x is not changed by any conversions.)
285 A string expression is converted to numeric using its longest
286 numeric prefix as with
288 A numeric expression is converted to string by replacing
291 .BR sprintf(CONVFMT ,
295 can be represented on the host machine as an exact integer then
296 it is converted to \fBsprintf\fR("%d", \*(ex).
298 is an AWK built-in that duplicates the functionality of
302 is a built-in variable used for internal conversion
303 from number to string and initialized to "%.6g".
304 Explicit type conversions can be forced,
311 \*(ex\d1\u \fBrel-op \*(ex\d2\u,
312 if both operands are numeric or number and string then the comparison
313 is numeric; if both operands are string the comparison is string;
314 if one operand is string, the non-string operand is converted and
315 the comparison is string. The result is numeric, 1 or 0.
317 In boolean contexts such as,
318 \fBif\fR ( \*(ex ) \fIstatement\fR,
319 a string expression evaluates true if and only if it is not the
321 numeric values if and only if not numerically zero.
323 .SS "\fB3. Regular expressions"
324 In the AWK language, records, fields and strings are often
325 tested for matching a
326 .IR "regular expression" .
327 Regular expressions are enclosed in slashes, and
333 is an AWK expression that evaluates to 1 if \*(ex "matches"
335 which means a substring of \*(ex is in the set of strings
338 With no match the expression evaluates to 0; replacing
339 ~ with the "not match" operator, !~ , reverses the meaning.
340 As pattern-action pairs,
343 /\fIr\fR/ { \fIaction\fR } and\
344 \fB$0\fR ~ /\fIr\fR/ { \fIaction\fR }
348 and for each input record that matches
352 In fact, /\fIr\fR/ is an AWK expression that is
353 equivalent to (\fB$0\fR ~ /\fIr\fR/) anywhere except when on the
354 right side of a match operator or passed as an argument to
355 a built-in function that expects a regular expression
358 AWK uses extended regular expressions as with
360 The regular expression metacharacters, i.e., those with special
361 meaning in regular expressions are
364 \ ^ $ . [ ] | ( ) * + ?
367 Regular expressions are built up from characters as follows:
369 .TP \w'[^c\d1\uc\d2\uc\d3\u...]'u+1n
371 matches any non-metacharacter
375 matches a character defined by the same escape sequences used
376 in string constants or the literal
381 is not an escape sequence.
384 matches any character (including newline).
387 matches the front of a string.
390 matches the back of a string.
392 [c\d1\uc\d2\uc\d3\u...]
393 matches any character in the class
394 c\d1\uc\d2\uc\d3\u... . An interval of characters is denoted
395 c\d1\u\-c\d2\u inside a class [...].
397 [^c\d1\uc\d2\uc\d3\u...]
398 matches any character not in the class
399 c\d1\uc\d2\uc\d3\u...
402 Regular expressions are built up from other regular expressions
405 .TP \w'[^c\d1\uc\d2\uc\d3\u...]'u+1n
406 \fIr\fR\d1\u\fIr\fR\d2\u
409 followed immediately by
413 \fIr\fR\d1\u | \fIr\fR\d2\u
420 matches \fIr\fR repeated zero or more times.
423 matches \fIr\fR repeated one or more times.
426 matches \fIr\fR zero or once.
429 matches \fIr\fR, providing grouping.
432 The increasing precedence of operators is alternation,
439 /^[_a\-zA-Z][_a\-zA\-Z0\-9]*$/ and
440 /^[\-+]?([0\-9]+\e\|.?|\e\|.[0\-9])[0\-9]*([eE][\-+]?[0\-9]+)?$/
443 are matched by AWK identifiers and AWK numeric constants
444 respectively. Note that . has to be escaped to be
445 recognized as a decimal point, and that metacharacters are not
446 special inside character classes.
448 Any expression can be used on the right hand side of the ~ or !~
450 passed to a built-in that expects
451 a regular expression.
452 If needed, it is converted to string, and then interpreted
453 as a regular expression. For example,
456 BEGIN { identifier = "[_a\-zA\-Z][_a\-zA\-Z0\-9]*" }
461 prints all lines that start with an AWK identifier.
464 recognizes the empty regular expression, //\|, which matches the
465 empty string and hence is matched by any string at the front,
466 back and between every character. For example,
469 echo abc | mawk { gsub(//, "X") ; print }
474 .SS "\fB4. Records and fields"
475 Records are read in one at a time, and stored in the
479 The record is split into
485 The built-in variable
487 is set to the number of fields,
492 are incremented by 1.
499 causes the fields and
507 to be reconstructed by
512 Assignment to a field with index greater than
520 Data input stored in fields
521 is string, unless the entire field has numeric
522 form and then the type is number and string.
527 mawk '{ print($1>100, $1>"100", $2>100, $2>"100") }'
536 is number and string. The first comparison is numeric,
537 the second is string, the third is string
538 (100 is converted to "100"),
539 and the last is string.
541 .SS "\fB5. Expressions and operators"
543 The expression syntax is
544 similar to C. Primary expressions are numeric constants,
545 string constants, variables, fields, arrays and function calls.
547 for a variable, array or function can be a sequence of
548 letters, digits and underscores, that does
549 not start with a digit.
550 Variables are not declared; they exist when first referenced and
555 expressions are composed with the following operators in
556 order of increasing precedence.
560 .vs +2p \" open up a little
561 \fIassignment\fR = += \-= *= /= %= ^=
562 \fIconditional\fR ? :
565 \fIarray membership\fR \fBin
567 \fIrelational\fR < > <= >= == !=
568 \fIconcatenation\fR (no explicit operator)
573 \fIexponentiation\fR ^
574 \fIinc and dec\fR ++ \-\|\- (both post and pre)
580 Assignment, conditional and exponentiation associate right to
581 left; the other operators associate left to right. Any
582 expression can be parenthesized.
585 .ds ae \fIarray\fR[\fIexpr\fR]
586 Awk provides one-dimensional arrays. Array elements are expressed
589 is internally converted to string type, so, for example,
590 A[1] and A["1"] are the same element and the actual
592 Arrays indexed by strings are called associative arrays.
593 Initially an array is empty; elements exist when first accessed.
595 \fIexpr\fB in\fI array\fR
600 There is a form of the
602 statement that loops over each index of an array.
605 \fBfor\fR ( \fIvar\fB in \fIarray \fR) \fIstatement\fR
616 transverses the indices of
627 supports an extension,
630 which deletes all elements of
633 Multidimensional arrays are synthesized with concatenation using
634 the built-in variable
636 \fIarray\fR[\fIexpr\fR\d1\u,\|\fIexpr\fR\d2\u]
638 \fIarray\fR[\fIexpr\fR\d1\u \fBSUBSEP \fIexpr\fR\d2\u].
639 Testing for a multidimensional element uses a parenthesized index,
643 if ( (i, j) in A ) print A[i, j]
647 .SS "\fB7. Builtin-variables\fR"
649 The following variables are built-in and initialized before program
654 number of command line arguments.
657 array of command line arguments, 0..ARGC-1.
660 format for internal conversion of numbers to string,
664 array indexed by environment variables. An environment string,
665 \fIvar=value\fR is stored as
666 \fBENVIRON\fR[\fIvar\fR] =
670 name of the current input file.
673 current record number in
677 splits records into fields as a regular expression.
680 number of fields in the current record.
683 current record number in the total input stream.
686 format for printing numbers; initially = "%.6g".
689 inserted between fields on output, initially = " ".
692 terminates each record on output, initially = "\en".
695 length set by the last call to the built-in function,
699 input record separator, initially = "\en".
702 index set by the last call to
706 used to build multiple array subscripts, initially = "\e034".
709 .SS "\fB8. Built-in functions"
713 gsub(\fIr,s,t\fR) gsub(\fIr,s\fR)
714 Global substitution, every match of regular expression
718 is replaced by string
720 The number of replacements is returned.
725 is used. An & in the replacement string
727 is replaced by the matched substring of
729 \e& and \e\e put literal & and \e, respectively,
730 in the replacement string.
737 then the position where
739 starts is returned, else 0 is returned.
740 The first character of
745 Returns the length of string
749 Returns the index of the first longest match of regular expression
753 Returns 0 if no match.
756 is set to the return value.
758 is set to the length of the match or \-1 if no match. If the
759 empty string is matched,
761 is set to 0, and 1 is returned if the match is at the front, and
762 length(\fIs\fR)+1 is returned if the match is at the back.
764 split(\fIs,A,r\fR) split(\fIs,A\fR)
767 is split into fields by regular expression
769 and the fields are loaded into array
772 is returned. See section 11 below for more detail.
779 sprintf(\fIformat,expr-list\fR)
780 Returns a string constructed from
784 See the description of printf() below.
786 sub(\fIr,s,t\fR) sub(\fIr,s\fR)
787 Single substitution, same as gsub() except at most one substitution.
789 substr(\fIs,i,n\fR) substr(\fIs,i\fR)
790 Returns the substring of string
798 is omitted, the suffix of
807 with all upper case characters converted to lower case.
812 with all lower case characters converted to upper case.
823 atan2(\fIy,x\fR) Arctan of \fIy\fR/\fIx\fR between -\*(Pi and \*(Pi.
825 cos(\fIx\fR) Cosine function, \fIx\fR in radians.
827 exp(\fIx\fR) Exponential function.
829 int(\fIx\fR) Returns \fIx\fR truncated towards zero.
831 log(\fIx\fR) Natural logarithm.
833 rand() Returns a random number between zero and one.
835 sin(\fIx\fR) Sine function, \fIx\fR in radians.
837 sqrt(\fIx\fR) Returns square root of \fIx\fR.
840 srand(\fIexpr\fR) srand()
841 Seeds the random number generator, using the clock if
843 is omitted, and returns the value of the previous seed.
845 seeds the random number generator from the clock at startup
846 so there is no real need to call srand(). Srand(\fIexpr\fR)
847 is useful for repeating pseudo random sequences.
850 .SS "\fB9. Input and output"
851 There are two output statements,
862 print \*(ex\d1\u, \*(ex\d2\u, ..., \*(ex\dn\u
864 \*(ex\d1\u \fBOFS \*(ex\d2\u \fBOFS\fR ... \*(ex\dn\u
866 to standard output. Numeric expressions are converted to
870 printf \fIformat, expr-list\fR
871 duplicates the printf C library function writing to standard output.
872 The complete ANSI C format specifications are recognized with
873 conversions %c, %d, %e, %E, %f, %g, %G,
874 %i, %o, %s, %u, %x, %X and %%,
875 and conversion qualifiers h and l.
878 The argument list to print or printf can optionally be enclosed in
880 Print formats numbers using
882 or "%d" for exact integers.
883 "%c" with a numeric argument prints the corresponding 8 bit
884 character, with a string argument it prints the first character of
886 The output of print and printf can be redirected to a file or
887 command by appending >
894 to the end of the print statement.
899 only once, subsequent redirections append to the already open stream.
902 associates the filename "/dev/stderr" with stderr which allows
903 print and printf to be redirected to stderr.
905 also associates "\-" and "/dev/stdout" with stdin and stdout which
906 allows these streams to be passed to functions.
910 has the following variations.
926 updates the fields and
930 reads the next record into
937 getline \fIvar\fR < \fIfile
938 reads the next record of
943 \fI command\fR | getline
948 and updates the fields and
951 \fI command\fR | getline \fIvar
958 Getline returns 0 on end-of-file, \-1 on error, otherwise 1.
960 Commands on the end of pipes are executed by /bin/sh.
962 The function \fBclose\fR(\*(ex) closes the file or pipe
970 is a piped command, and \-1 otherwise.
971 Close is used to reread a file or command, make sure the other
972 end of an output pipe is finished or conserve file resources.
974 The function \fBfflush\fR(\*(ex) flushes the output file or pipe
979 is an open output stream else \-1.
980 Fflush without an argument flushes stdout.
981 Fflush with an empty argument ("") flushes all open output.
984 \fBsystem\fR(\fIexpr\fR)
989 and returns the exit status of the command
993 array are not passed to commands executed with
996 .SS \fB10. User defined functions
997 The syntax for a user defined function is
1000 \fBfunction\fR name( \fIargs\fR ) { \fIstatements\fR }
1003 The function body can contain a return statement
1006 \fBreturn\fI opt_expr\fR
1009 A return statement is not required.
1010 Function calls may be nested or recursive.
1011 Functions are passed expressions by value
1012 and arrays by reference.
1013 Extra arguments serve as local variables
1014 and are initialized to
1016 For example, csplit(\fIs,\|A\fR) puts each character of
1020 and returns the length of
1024 function csplit(s, A, n, i)
1027 for( i = 1 ; i <= n ; i++ ) A[i] = substr(s, i, 1)
1032 Putting extra space between passed arguments and local
1033 variables is conventional.
1034 Functions can be referenced before they are defined, but the
1035 function name and the '(' of the arguments must touch to
1036 avoid confusion with concatenation.
1038 .SS "\fB11. Splitting strings, records and files"
1039 Awk programs use the same algorithm to
1040 split strings into arrays with split(), and records into fields
1044 uses essentially the same algorithm to split files into
1048 Split(\fIexpr,\|A,\|sep\fR) works as follows:
1054 is omitted, it is replaced by
1057 can be an expression or regular expression. If it is an
1058 expression of non-string type, it is converted to string.
1063 = " " (a single space),
1064 then <SPACE> is trimmed from the front and back of
1070 defines <SPACE> as the regular expression
1074 is treated as a regular expression, except that meta-characters
1075 are ignored for a string of length 1,
1077 split(x, A, "*") and split(x, A, /\e*/) are the same.
1080 If \*(ex is not string, it is converted to string.
1081 If \*(ex is then the empty string "", split() returns 0
1086 all non-overlapping, non-null and longest matches of
1092 into fields which are loaded into
1094 The fields are placed in
1095 A[1], A[2], ..., A[n] and split() returns n, the number
1096 of fields which is the number
1097 of matches plus one.
1100 that looks numeric is typed number and string.
1103 Splitting records into fields works the same except the
1104 pieces are loaded into
1117 splits files into records by the same algorithm, but with the
1118 slight difference that
1120 is really a terminator instead of a separator.
1121 (\fBORS\fR is really a terminator too).
1137 if "a::b:" is the contents of an input file and
1140 there are two records "a" and "b".
1144 = " " is not special.
1150 breaks the record into individual characters, and, similarly,
1151 split(\fIs,A,\fR"") places the individual characters of
1156 .SS "\fB12. Multi-line records"
1161 as a regular expression, multi-line
1162 records are easy. Setting
1164 = "\en\en+", makes one or more blank
1165 lines separate records. If
1167 = " " (the default), then single
1168 newlines, by the rules for <SPACE> above, become space and
1169 single newlines are field separators.
1172 For example, if a file is "a\ b\enc\en\en",
1176 = "\ ", then there is one record "a\ b\enc" with three
1177 fields "a", "b" and "c". Changing
1180 fields "a b" and "c"; changing
1182 = "", gives one field
1183 identical to the record.
1186 If you want lines with spaces or tabs to be considered blank,
1189 = "\en([\ \et]*\en)+".
1190 For compatibility with other awks, setting
1193 effect as if blank lines are stripped from the
1194 front and back of files and then records are determined as if
1197 Posix requires that "\en" always separates records when
1199 = "" regardless of the value of
1202 does not support this convention, because defining
1203 "\en" as <SPACE> makes it unnecessary.
1206 Most of the time when you change
1208 for multi-line records, you
1209 will also want to change
1211 to "\en\en" so the record spacing is preserved on output.
1213 .SS "\fB13. Program execution"
1214 This section describes the order of program execution.
1217 is set to the total number of command line arguments passed to
1218 the execution phase of the program.
1220 is set the name of the AWK interpreter and
1223 holds the remaining command line arguments exclusive of
1224 options and program source.
1228 mawk \-f prog v=1 A t=hello B
1246 block is executed in order.
1247 If the program consists
1250 blocks, then execution terminates, else
1251 an input stream is opened and execution continues.
1255 the input stream is set to stdin,
1256 else the command line arguments
1259 are examined for a file argument.
1261 The command line arguments divide into three sets:
1262 file arguments, assignment arguments and empty strings "".
1263 An assignment has the form
1264 \fIvar\fR=\fIstring\fR.
1267 is examined as a possible file argument,
1268 if it is empty it is skipped;
1269 if it is an assignment argument, the assignment to
1273 skips to the next argument;
1276 is opened for input.
1277 If it fails to open, execution terminates with exit code 2.
1278 If no command line argument is a file argument, then input
1282 action opens input. "\-" as a file argument denotes stdin.
1284 Once an input stream is open, each input record is tested
1287 and if it matches, the associated
1290 An expression pattern matches if it is boolean true (see
1291 the end of section 2).
1294 pattern matches before any input has been read, and
1297 pattern matches after all input has been read.
1299 \fIexpr\fR1,\|\fIexpr\fR2 ,
1300 matches every record between the match of
1306 When end of file occurs on the input stream, the remaining
1307 command line arguments are examined for a file argument, and
1308 if there is one it is opened, else the
1311 is considered matched
1317 In the example, the assignment
1319 takes place after the
1325 is typed number and string.
1326 Input is then read from file A.
1329 is set to the string "hello",
1330 and B is opened for input.
1331 On end of file B, the
1339 level can be changed with the
1343 \fBexit \fIopt_expr\fR
1350 causes the next input record to be read and pattern testing
1351 to restart with the first
1352 .I "pattern {action}"
1353 pair in the program.
1357 causes immediate execution of the
1359 actions or program termination if there are none or
1367 sets the exit value of the program unless overridden by
1370 or subsequent error.
1379 { chars += length($0) + 1 # add one for the \en
1383 END{ print NR, words, chars }
1385 3. count the number of unique "real words".
1387 BEGIN { FS = "[^A-Za-z]+" }
1389 { for(i = 1 ; i <= NF ; i++) word[$i] = "" }
1391 END { delete word[""]
1392 for ( i in word ) cnt++
1397 4. sum the second field of
1398 every record based on the first field.
1401 $1 ~ /credit\||\|gain/ { sum += $2 }
1402 $1 ~ /debit\||\|loss/ { sum \-= $2 }
1406 5. sort a file, comparing as string
1408 { line[NR] = $0 "" } # make sure of comparison type
1409 # in case some lines look numeric
1411 END { isort(line, NR)
1412 for(i = 1 ; i <= NR ; i++) print line[i]
1415 #insertion sort of A[1..n]
1416 function isort( A, n, i, j, hold)
1418 for( i = 2 ; i <= n ; i++)
1421 while ( A[j\-1] > hold )
1422 { j\-\|\- ; A[j+1] = A[j] }
1425 # sentinel A[0] = "" will be created if needed
1429 .SH "COMPATIBILITY ISSUES"
1430 The Posix 1003.2(draft 11.3) definition of the AWK language
1431 is AWK as described in the AWK book with a few extensions
1432 that appeared in SystemVR4 nawk. The extensions are:
1435 New functions: toupper() and tolower().
1437 New variables: ENVIRON[\|] and CONVFMT.
1439 ANSI C conversion specifications for printf() and sprintf().
1441 New command options: \-v var=value, multiple -f options and
1442 implementation options as arguments to \-W.
1446 Posix AWK is oriented to operate on files a line at
1449 can be changed from "\en" to another single character,
1451 is hard to find any use for this \(em there are no
1452 examples in the AWK book.
1453 By convention, \fBRS\fR = "", makes one or more blank lines
1454 separate records, allowing multi-line records. When
1455 \fBRS\fR = "", "\en" is always a field separator
1456 regardless of the value in
1463 to be a regular expression.
1464 When "\en" appears in records, it is treated as space, and
1466 always determines fields.
1468 Removing the line at a time paradigm can make some programs
1470 often improve performance. For example,
1471 redoing example 3 from above,
1474 BEGIN { RS = "[^A-Za-z]+" }
1478 END { delete word[ "" ]
1479 for( i in word ) cnt++
1484 counts the number of unique words by making each word a record.
1485 On moderate size files,
1487 executes twice as fast, because of the simplified inner loop.
1489 The following program replaces each comment by a single space in
1494 RS = "/\|\e*([^*]\||\|\e*+[^/*])*\e*+/"
1495 # comment is record separator
1500 { print hold ; hold = $0 }
1502 END { printf "%s" , hold }
1505 Buffering one record is needed to avoid terminating the last
1506 record with a space.
1510 the following are all equivalent,
1513 x ~ /a\e+b/ x ~ "a\e+b" x ~ "a\e\e+b"
1516 The strings get scanned twice, once as string and once as
1517 regular expression. On the string scan,
1519 ignores the escape on non-escape characters while the AWK
1524 which necessitates the double escaping of meta-characters in
1526 Posix explicitly declines to define the behavior which passively
1527 forces programs that must run under a variety of awks to use
1528 the more portable but less readable, double escape.
1530 Posix AWK does not recognize "/dev/std{out,err}" or \ex hex escape
1531 sequences in strings. Unlike ANSI C,
1533 limits the number of digits that follows \ex to two as the current
1534 implementation only supports 8 bit characters.
1537 first appeared in a recent (1993) AT&T awk released to netlib, and is
1538 not part of the posix standard. Aggregate deletion with
1541 is not part of the posix standard.
1543 Posix explicitly leaves the behavior of
1545 = "" undefined, and mentions splitting the record into characters as
1546 a possible interpretation, but currently this use is not portable
1547 across implementations.
1549 Finally, here is how
1551 handles exceptional cases not discussed in the
1552 AWK book or the Posix draft. It is unsafe to assume
1553 consistency across awks and safe to skip to
1557 substr(s, i, n) returns the characters of s in the intersection
1558 of the closed interval [1, length(s)] and the half-open interval
1559 [i, i+n). When this intersection is empty, the empty string is
1560 returned; so substr("ABC", 1, 0) = "" and
1561 substr("ABC", \-4, 6) = "A".
1563 Every string, including the empty string, matches the empty string
1565 front so, s ~ // and s ~ "", are always 1 as is match(s, //) and
1566 match(s, ""). The last two set
1570 index(s, t) is always the same as match(s, t1) where t1 is the
1571 same as t with metacharacters escaped. Hence consistency
1572 with match requires that
1573 index(s, "") always returns 1.
1574 Also the condition, index(s,t) != 0 if and only t is a substring
1575 of s, requires index("","") = 1.
1577 If getline encounters end of file, getline var, leaves var
1578 unchanged. Similarly, on entry to the
1584 have their value unaltered from the last record.
1588 Aho, Kernighan and Weinberger,
1589 .IR "The AWK Programming Language" ,
1590 Addison-Wesley Publishing, 1988, (the AWK book),
1591 defines the language, opening with a tutorial
1592 and advancing to many interesting programs that delve into
1593 issues of software design and analysis relevant to programming
1596 .IR "The GAWK Manual" ,
1597 The Free Software Foundation, 1991, is a tutorial
1598 and language reference
1599 that does not attempt the depth of the AWK book
1600 and assumes the reader may be a novice programmer.
1601 The section on AWK arrays is excellent. It also
1602 discusses Posix requirements for AWK.
1605 cannot handle ascii NUL \e0 in the source or data files. You
1606 can output NUL using printf with %c, and any other 8 bit
1607 character is acceptable input.
1610 implements printf() and sprintf() using the C library functions,
1611 printf and sprintf, so full ANSI compatibility requires an ANSI
1612 C library. In practice this means the h conversion qualifier may
1613 not be available. Also
1615 inherits any bugs or limitations of the library functions.
1617 Implementors of the AWK language have shown a consistent lack
1618 of imagination when naming their programs.
1620 Mike Brennan (brennan@whidbey.com).