4 MAWK(1) USER COMMANDS MAWK(1)
9 mawk - pattern scanning and text processing language
12 mawk [-W _
\bo_
\bp_
\bt_
\bi_
\bo_
\bn] [-F _
\bv_
\ba_
\bl_
\bu_
\be] [-v _
\bv_
\ba_
\br=_
\bv_
\ba_
\bl_
\bu_
\be] [--] 'program
14 mawk [-W _
\bo_
\bp_
\bt_
\bi_
\bo_
\bn] [-F _
\bv_
\ba_
\bl_
\bu_
\be] [-v _
\bv_
\ba_
\br=_
\bv_
\ba_
\bl_
\bu_
\be] [-f _
\bp_
\br_
\bo_
\bg_
\br_
\ba_
\bm-_
\bf_
\bi_
\bl_
\be]
18 mawk is an interpreter for the AWK Programming Language.
19 The AWK language is useful for manipulation of data files,
20 text retrieval and processing, and for prototyping and
21 experimenting with algorithms. mawk is a _
\bn_
\be_
\bw _
\ba_
\bw_
\bk meaning it
22 implements the AWK language as defined in Aho, Kernighan and
23 Weinberger, _
\bT_
\bh_
\be _
\bA_
\bW_
\bK _
\bP_
\br_
\bo_
\bg_
\br_
\ba_
\bm_
\bm_
\bi_
\bn_
\bg _
\bL_
\ba_
\bn_
\bg_
\bu_
\ba_
\bg_
\be, Addison-Wesley
24 Publishing, 1988. (Hereafter referred to as the AWK book.)
25 mawk conforms to the Posix 1003.2 (draft 11.3) definition of
26 the AWK language which contains a few features not described
27 in the AWK book, and mawk provides a small number of exten-
30 An AWK program is a sequence of _
\bp_
\ba_
\bt_
\bt_
\be_
\br_
\bn {_
\ba_
\bc_
\bt_
\bi_
\bo_
\bn} pairs and
31 function definitions. Short programs are entered on the
32 command line usually enclosed in ' ' to avoid shell
33 interpretation. Longer programs can be read in from a file
34 with the -f option. Data input is read from the list of
35 files on the command line or from standard input when the
36 list is empty. The input is broken into records as deter-
37 mined by the record separator variable, RS. Initially, RS =
38 "\n" and records are synonymous with lines. Each record is
39 compared against each _
\bp_
\ba_
\bt_
\bt_
\be_
\br_
\bn and if it matches, the program
40 text for {_
\ba_
\bc_
\bt_
\bi_
\bo_
\bn} is executed.
43 -F _
\bv_
\ba_
\bl_
\bu_
\be sets the field separator, FS, to _
\bv_
\ba_
\bl_
\bu_
\be.
45 -f _
\bf_
\bi_
\bl_
\be Program text is read from _
\bf_
\bi_
\bl_
\be instead of
46 from the command line. Multiple -f options
49 -v _
\bv_
\ba_
\br=_
\bv_
\ba_
\bl_
\bu_
\be assigns _
\bv_
\ba_
\bl_
\bu_
\be to program variable _
\bv_
\ba_
\br.
51 -- indicates the unambiguous end of options.
53 The above options will be available with any Posix compati-
54 ble implementation of AWK, and implementation specific
55 options are prefaced with -W. mawk provides six:
57 -W version mawk writes its version and copyright to
58 stdout and compiled limits to stderr and
63 Version 1.2 Last change: Dec 22 1994 1
70 MAWK(1) USER COMMANDS MAWK(1)
74 -W dump writes an assembler like listing of the
75 internal representation of the program to
76 stdout and exits 0 (on successful compila-
79 -W interactive sets unbuffered writes to stdout and line
80 buffered reads from stdin. Records from
81 stdin are lines regardless of the value of
84 -W exec _
\bf_
\bi_
\bl_
\be Program text is read from _
\bf_
\bi_
\bl_
\be and this is
85 the last option. Useful on systems that sup-
86 port the #! "magic number" convention for
89 -W sprintf=_
\bn_
\bu_
\bm adjusts the size of mawk's internal sprintf
90 buffer to _
\bn_
\bu_
\bm bytes. More than rare use of
91 this option indicates mawk should be recom-
94 -W posix_space forces mawk not to consider '\n' to be space.
96 The short forms -W[vdiesp] are recognized and on some sys-
97 tems -We is mandatory to avoid command line length limita-
102 An AWK program is a sequence of _
\bp_
\ba_
\bt_
\bt_
\be_
\br_
\bn {_
\ba_
\bc_
\bt_
\bi_
\bo_
\bn} pairs and
103 user function definitions.
109 expression , expression
111 One, but not both, of _
\bp_
\ba_
\bt_
\bt_
\be_
\br_
\bn {_
\ba_
\bc_
\bt_
\bi_
\bo_
\bn} can be omitted. If
112 {_
\ba_
\bc_
\bt_
\bi_
\bo_
\bn} is omitted it is implicitly { print }. If _
\bp_
\ba_
\bt_
\bt_
\be_
\br_
\bn
113 is omitted, then it is implicitly matched. BEGIN and END
114 patterns require an action.
116 Statements are terminated by newlines, semi-colons or both.
117 Groups of statements such as actions or loop bodies are
118 blocked via { ... } as in C. The last statement in a block
119 doesn't need a terminator. Blank lines have no meaning; an
120 empty statement is terminated with a semi-colon. Long state-
121 ments can be continued with a backslash, \. A statement can
122 be broken without a backslash after a comma, left brace, &&,
123 ||, do, else, the right parenthesis of an if, while or for
124 statement, and the right parenthesis of a function defini-
125 tion. A comment starts with # and extends to, but does not
129 Version 1.2 Last change: Dec 22 1994 2
136 MAWK(1) USER COMMANDS MAWK(1)
140 include the end of line.
142 The following statements control program flow inside blocks.
144 if ( _
\be_
\bx_
\bp_
\br ) _
\bs_
\bt_
\ba_
\bt_
\be_
\bm_
\be_
\bn_
\bt
146 if ( _
\be_
\bx_
\bp_
\br ) _
\bs_
\bt_
\ba_
\bt_
\be_
\bm_
\be_
\bn_
\bt else _
\bs_
\bt_
\ba_
\bt_
\be_
\bm_
\be_
\bn_
\bt
148 while ( _
\be_
\bx_
\bp_
\br ) _
\bs_
\bt_
\ba_
\bt_
\be_
\bm_
\be_
\bn_
\bt
150 do _
\bs_
\bt_
\ba_
\bt_
\be_
\bm_
\be_
\bn_
\bt while ( _
\be_
\bx_
\bp_
\br )
152 for ( _
\bo_
\bp_
\bt__
\be_
\bx_
\bp_
\br ; _
\bo_
\bp_
\bt__
\be_
\bx_
\bp_
\br ; _
\bo_
\bp_
\bt__
\be_
\bx_
\bp_
\br ) _
\bs_
\bt_
\ba_
\bt_
\be_
\bm_
\be_
\bn_
\bt
154 for ( _
\bv_
\ba_
\br in _
\ba_
\br_
\br_
\ba_
\by ) _
\bs_
\bt_
\ba_
\bt_
\be_
\bm_
\be_
\bn_
\bt
160 2. Data types, conversion and comparison
161 There are two basic data types, numeric and string. Numeric
162 constants can be integer like -2, decimal like 1.08, or in
163 scientific notation like -1.1e4 or .28E-3. All numbers are
164 represented internally and all computations are done in
165 floating point arithmetic. So for example, the expression
166 0.2e2 == 20 is true and true is represented as 1.0.
168 String constants are enclosed in double quotes.
170 "This is a string with a newline at the end.\n"
172 Strings can be continued across a line by escaping (\) the
173 newline. The following escape sequences are recognized.
178 \b backspace, ascii 8
181 \v vertical tab, ascii 11
182 \f formfeed, ascii 12
183 \r carriage return, ascii 13
184 \ddd 1, 2 or 3 octal digits for ascii ddd
185 \xhh 1 or 2 hex digits for ascii hh
187 If you escape any other character \c, you get \c, i.e., mawk
190 There are really three basic data types; the third is _
\bn_
\bu_
\bm_
\bb_
\be_
\br
191 _
\ba_
\bn_
\bd _
\bs_
\bt_
\br_
\bi_
\bn_
\bg which has both a numeric value and a string value
195 Version 1.2 Last change: Dec 22 1994 3
202 MAWK(1) USER COMMANDS MAWK(1)
206 at the same time. User defined variables come into
207 existence when first referenced and are initialized to _
\bn_
\bu_
\bl_
\bl,
208 a number and string value which has numeric value 0 and
209 string value "". Non-trivial number and string typed data
210 come from input and are typically stored in fields. (See
213 The type of an expression is determined by its context and
214 automatic type conversion occurs if needed. For example, to
215 evaluate the statements
217 y = x + 2 ; z = x "hello"
219 The value stored in variable y will be typed numeric. If x
220 is not numeric, the value read from x is converted to
221 numeric before it is added to 2 and stored in y. The value
222 stored in variable z will be typed string, and the value of
223 x will be converted to string if necessary and concatenated
224 with "hello". (Of course, the value and type stored in x is
225 not changed by any conversions.) A string expression is con-
226 verted to numeric using its longest numeric prefix as with
227 _
\ba_
\bt_
\bo_
\bf(3). A numeric expression is converted to string by
228 replacing _
\be_
\bx_
\bp_
\br with sprintf(CONVFMT, _
\be_
\bx_
\bp_
\br), unless _
\be_
\bx_
\bp_
\br can
229 be represented on the host machine as an exact integer then
230 it is converted to sprintf("%d", _
\be_
\bx_
\bp_
\br). Sprintf() is an AWK
231 built-in that duplicates the functionality of _
\bs_
\bp_
\br_
\bi_
\bn_
\bt_
\bf(3),
232 and CONVFMT is a built-in variable used for internal conver-
233 sion from number to string and initialized to "%.6g".
234 Explicit type conversions can be forced, _
\be_
\bx_
\bp_
\br "" is string
235 and _
\be_
\bx_
\bp_
\br+0 is numeric.
237 To evaluate, _
\be_
\bx_
\bp_
\br1 rel-op _
\be_
\bx_
\bp_
\br2, if both operands are
238 numeric or number and string then the comparison is numeric;
239 if both operands are string the comparison is string; if one
240 operand is string, the non-string operand is converted and
241 the comparison is string. The result is numeric, 1 or 0.
243 In boolean contexts such as, if ( _
\be_
\bx_
\bp_
\br ) _
\bs_
\bt_
\ba_
\bt_
\be_
\bm_
\be_
\bn_
\bt, a string
244 expression evaluates true if and only if it is not the empty
245 string ""; numeric values if and only if not numerically
248 3. Regular expressions
249 In the AWK language, records, fields and strings are often
250 tested for matching a _
\br_
\be_
\bg_
\bu_
\bl_
\ba_
\br _
\be_
\bx_
\bp_
\br_
\be_
\bs_
\bs_
\bi_
\bo_
\bn. Regular expres-
251 sions are enclosed in slashes, and
253 _
\be_
\bx_
\bp_
\br ~ /_
\br/
255 is an AWK expression that evaluates to 1 if _
\be_
\bx_
\bp_
\br "matches"
256 _
\br, which means a substring of _
\be_
\bx_
\bp_
\br is in the set of strings
257 defined by _
\br. With no match the expression evaluates to 0;
261 Version 1.2 Last change: Dec 22 1994 4
268 MAWK(1) USER COMMANDS MAWK(1)
272 replacing ~ with the "not match" operator, !~ , reverses the
273 meaning. As pattern-action pairs,
275 /_
\br/ { _
\ba_
\bc_
\bt_
\bi_
\bo_
\bn } and $0 ~ /_
\br/ { _
\ba_
\bc_
\bt_
\bi_
\bo_
\bn }
277 are the same, and for each input record that matches _
\br,
278 _
\ba_
\bc_
\bt_
\bi_
\bo_
\bn is executed. In fact, /_
\br/ is an AWK expression that
279 is equivalent to ($0 ~ /_
\br/) anywhere except when on the
280 right side of a match operator or passed as an argument to a
281 built-in function that expects a regular expression argu-
284 AWK uses extended regular expressions as with _
\be_
\bg_
\br_
\be_
\bp(1). The
285 regular expression metacharacters, i.e., those with special
286 meaning in regular expressions are
288 ^ $ . [ ] | ( ) * + ?
290 Regular expressions are built up from characters as follows:
292 _
\bc matches any non-metacharacter _
\bc.
294 \_
\bc matches a character defined by the same
295 escape sequences used in string constants
296 or the literal character _
\bc if \_
\bc is not an
299 . matches any character (including newline).
301 ^ matches the front of a string.
303 $ matches the back of a string.
305 [c1c2c3...] matches any character in the class
306 c1c2c3... . An interval of characters is
307 denoted c1-c2 inside a class [...].
309 [^c1c2c3...] matches any character not in the class
312 Regular expressions are built up from other regular expres-
315 _
\br1_
\br2 matches _
\br1 followed immediately by _
\br2
318 _
\br1 | _
\br2 matches _
\br1 or _
\br2 (alternation).
320 _
\br* matches _
\br repeated zero or more times.
322 _
\br+ matches _
\br repeated one or more times.
327 Version 1.2 Last change: Dec 22 1994 5
334 MAWK(1) USER COMMANDS MAWK(1)
338 _
\br? matches _
\br zero or once.
340 (_
\br) matches _
\br, providing grouping.
342 The increasing precedence of operators is alternation, con-
343 catenation and unary (*, + or ?).
347 /^[_a-zA-Z][_a-zA-Z0-9]*$/ and
348 /^[-+]?([0-9]+\.?|\.[0-9])[0-9]*([eE][-+]?[0-9]+)?$/
350 are matched by AWK identifiers and AWK numeric constants
351 respectively. Note that . has to be escaped to be recog-
352 nized as a decimal point, and that metacharacters are not
353 special inside character classes.
355 Any expression can be used on the right hand side of the ~
356 or !~ operators or passed to a built-in that expects a regu-
357 lar expression. If needed, it is converted to string, and
358 then interpreted as a regular expression. For example,
360 BEGIN { identifier = "[_a-zA-Z][_a-zA-Z0-9]*" }
364 prints all lines that start with an AWK identifier.
366 mawk recognizes the empty regular expression, //, which
367 matches the empty string and hence is matched by any string
368 at the front, back and between every character. For exam-
371 echo abc | mawk { gsub(//, "X") ; print }
375 4. Records and fields
376 Records are read in one at a time, and stored in the _
\bf_
\bi_
\be_
\bl_
\bd
377 variable $0. The record is split into _
\bf_
\bi_
\be_
\bl_
\bd_
\bs which are
378 stored in $1, $2, ..., $NF. The built-in variable NF is set
379 to the number of fields, and NR and FNR are incremented by
380 1. Fields above $NF are set to "".
382 Assignment to $0 causes the fields and NF to be recomputed.
383 Assignment to NF or to a field causes $0 to be reconstructed
384 by concatenating the $i's separated by OFS. Assignment to a
385 field with index greater than NF, increases NF and causes $0
388 Data input stored in fields is string, unless the entire
389 field has numeric form and then the type is number and
393 Version 1.2 Last change: Dec 22 1994 6
400 MAWK(1) USER COMMANDS MAWK(1)
407 mawk '{ print($1>100, $1>"100", $2>100, $2>"100") }'
410 $0 and $2 are string and $1 is number and string. The first
411 comparison is numeric, the second is string, the third is
412 string (100 is converted to "100"), and the last is string.
414 5. Expressions and operators
415 The expression syntax is similar to C. Primary expressions
416 are numeric constants, string constants, variables, fields,
417 arrays and function calls. The identifier for a variable,
418 array or function can be a sequence of letters, digits and
419 underscores, that does not start with a digit. Variables
420 are not declared; they exist when first referenced and are
421 initialized to _
\bn_
\bu_
\bl_
\bl.
423 New expressions are composed with the following operators in
424 order of increasing precedence.
426 _
\ba_
\bs_
\bs_
\bi_
\bg_
\bn_
\bm_
\be_
\bn_
\bt = += -= *= /= %= ^=
427 _
\bc_
\bo_
\bn_
\bd_
\bi_
\bt_
\bi_
\bo_
\bn_
\ba_
\bl ? :
428 _
\bl_
\bo_
\bg_
\bi_
\bc_
\ba_
\bl _
\bo_
\br ||
429 _
\bl_
\bo_
\bg_
\bi_
\bc_
\ba_
\bl _
\ba_
\bn_
\bd &&
430 _
\ba_
\br_
\br_
\ba_
\by _
\bm_
\be_
\bm_
\bb_
\be_
\br_
\bs_
\bh_
\bi_
\bp in
431 _
\bm_
\ba_
\bt_
\bc_
\bh_
\bi_
\bn_
\bg ~ !~
432 _
\br_
\be_
\bl_
\ba_
\bt_
\bi_
\bo_
\bn_
\ba_
\bl < > <= >= == !=
433 _
\bc_
\bo_
\bn_
\bc_
\ba_
\bt_
\be_
\bn_
\ba_
\bt_
\bi_
\bo_
\bn (no explicit operator)
434 _
\ba_
\bd_
\bd _
\bo_
\bp_
\bs + -
435 _
\bm_
\bu_
\bl _
\bo_
\bp_
\bs * / %
436 _
\bu_
\bn_
\ba_
\br_
\by + -
437 _
\bl_
\bo_
\bg_
\bi_
\bc_
\ba_
\bl _
\bn_
\bo_
\bt !
438 _
\be_
\bx_
\bp_
\bo_
\bn_
\be_
\bn_
\bt_
\bi_
\ba_
\bt_
\bi_
\bo_
\bn ^
439 _
\bi_
\bn_
\bc _
\ba_
\bn_
\bd _
\bd_
\be_
\bc ++ -- (both post and pre)
440 _
\bf_
\bi_
\be_
\bl_
\bd $
442 Assignment, conditional and exponentiation associate right
443 to left; the other operators associate left to right. Any
444 expression can be parenthesized.
447 Awk provides one-dimensional arrays. Array elements are
448 expressed as _
\ba_
\br_
\br_
\ba_
\by[_
\be_
\bx_
\bp_
\br]. _
\bE_
\bx_
\bp_
\br is internally converted to
449 string type, so, for example, A[1] and A["1"] are the same
450 element and the actual index is "1". Arrays indexed by
451 strings are called associative arrays. Initially an array
452 is empty; elements exist when first accessed. An expres-
453 sion, _
\be_
\bx_
\bp_
\br in _
\ba_
\br_
\br_
\ba_
\by evaluates to 1 if _
\ba_
\br_
\br_
\ba_
\by[_
\be_
\bx_
\bp_
\br] exists,
459 Version 1.2 Last change: Dec 22 1994 7
466 MAWK(1) USER COMMANDS MAWK(1)
470 There is a form of the for statement that loops over each
473 for ( _
\bv_
\ba_
\br in _
\ba_
\br_
\br_
\ba_
\by ) _
\bs_
\bt_
\ba_
\bt_
\be_
\bm_
\be_
\bn_
\bt
475 sets _
\bv_
\ba_
\br to each index of _
\ba_
\br_
\br_
\ba_
\by and executes _
\bs_
\bt_
\ba_
\bt_
\be_
\bm_
\be_
\bn_
\bt. The
476 order that _
\bv_
\ba_
\br transverses the indices of _
\ba_
\br_
\br_
\ba_
\by is not
479 The statement, delete _
\ba_
\br_
\br_
\ba_
\by[_
\be_
\bx_
\bp_
\br], causes _
\ba_
\br_
\br_
\ba_
\by[_
\be_
\bx_
\bp_
\br] not to
480 exist. mawk supports an extension, delete _
\ba_
\br_
\br_
\ba_
\by, which
481 deletes all elements of _
\ba_
\br_
\br_
\ba_
\by.
483 Multidimensional arrays are synthesized with concatenation
484 using the built-in variable SUBSEP. _
\ba_
\br_
\br_
\ba_
\by[_
\be_
\bx_
\bp_
\br1,_
\be_
\bx_
\bp_
\br2] is
485 equivalent to _
\ba_
\br_
\br_
\ba_
\by[_
\be_
\bx_
\bp_
\br1 SUBSEP _
\be_
\bx_
\bp_
\br2]. Testing for a mul-
486 tidimensional element uses a parenthesized index, such as
488 if ( (i, j) in A ) print A[i, j]
492 The following variables are built-in and initialized before
495 ARGC number of command line arguments.
497 ARGV array of command line arguments, 0..ARGC-1.
499 CONVFMT format for internal conversion of numbers to
500 string, initially = "%.6g".
502 ENVIRON array indexed by environment variables. An
503 environment string, _
\bv_
\ba_
\br=_
\bv_
\ba_
\bl_
\bu_
\be is stored as
504 ENVIRON[_
\bv_
\ba_
\br] = _
\bv_
\ba_
\bl_
\bu_
\be.
506 FILENAME name of the current input file.
508 FNR current record number in FILENAME.
510 FS splits records into fields as a regular
513 NF number of fields in the current record.
515 NR current record number in the total input
518 OFMT format for printing numbers; initially =
521 OFS inserted between fields on output, initially
525 Version 1.2 Last change: Dec 22 1994 8
532 MAWK(1) USER COMMANDS MAWK(1)
538 ORS terminates each record on output, initially =
541 RLENGTH length set by the last call to the built-in
544 RS input record separator, initially = "\n".
546 RSTART index set by the last call to match().
548 SUBSEP used to build multiple array subscripts, ini-
551 8. Built-in functions
554 gsub(_
\br,_
\bs,_
\bt) gsub(_
\br,_
\bs)
555 Global substitution, every match of regular
556 expression _
\br in variable _
\bt is replaced by string
557 _
\bs. The number of replacements is returned. If _
\bt
558 is omitted, $0 is used. An & in the replacement
559 string _
\bs is replaced by the matched substring of
560 _
\bt. \& and \\ put literal & and \, respectively,
561 in the replacement string.
564 If _
\bt is a substring of _
\bs, then the position where
565 _
\bt starts is returned, else 0 is returned. The
566 first character of _
\bs is in position 1.
569 Returns the length of string _
\bs.
572 Returns the index of the first longest match of
573 regular expression _
\br in string _
\bs. Returns 0 if no
574 match. As a side effect, RSTART is set to the
575 return value. RLENGTH is set to the length of the
576 match or -1 if no match. If the empty string is
577 matched, RLENGTH is set to 0, and 1 is returned if
578 the match is at the front, and length(_
\bs)+1 is
579 returned if the match is at the back.
581 split(_
\bs,_
\bA,_
\br) split(_
\bs,_
\bA)
582 String _
\bs is split into fields by regular expres-
583 sion _
\br and the fields are loaded into array _
\bA.
584 The number of fields is returned. See section 11
585 below for more detail. If _
\br is omitted, FS is
591 Version 1.2 Last change: Dec 22 1994 9
598 MAWK(1) USER COMMANDS MAWK(1)
602 sprintf(_
\bf_
\bo_
\br_
\bm_
\ba_
\bt,_
\be_
\bx_
\bp_
\br-_
\bl_
\bi_
\bs_
\bt)
603 Returns a string constructed from _
\be_
\bx_
\bp_
\br-_
\bl_
\bi_
\bs_
\bt
604 according to _
\bf_
\bo_
\br_
\bm_
\ba_
\bt. See the description of
607 sub(_
\br,_
\bs,_
\bt) sub(_
\br,_
\bs)
608 Single substitution, same as gsub() except at most
611 substr(_
\bs,_
\bi,_
\bn) substr(_
\bs,_
\bi)
612 Returns the substring of string _
\bs, starting at
613 index _
\bi, of length _
\bn. If _
\bn is omitted, the suffix
614 of _
\bs, starting at _
\bi is returned.
617 Returns a copy of _
\bs with all upper case characters
618 converted to lower case.
621 Returns a copy of _
\bs with all lower case characters
622 converted to upper case.
626 atan2(_
\by,_
\bx) Arctan of _
\by/_
\bx between -pi and pi.
628 cos(_
\bx) Cosine function, _
\bx in radians.
630 exp(_
\bx) Exponential function.
632 int(_
\bx) Returns _
\bx truncated towards zero.
634 log(_
\bx) Natural logarithm.
636 rand() Returns a random number between zero and one.
638 sin(_
\bx) Sine function, _
\bx in radians.
640 sqrt(_
\bx) Returns square root of _
\bx.
642 srand(_
\be_
\bx_
\bp_
\br) srand()
643 Seeds the random number generator, using the clock
644 if _
\be_
\bx_
\bp_
\br is omitted, and returns the value of the
645 previous seed. mawk seeds the random number gen-
646 erator from the clock at startup so there is no
647 real need to call srand(). Srand(_
\be_
\bx_
\bp_
\br) is useful
648 for repeating pseudo random sequences.
651 There are two output statements, print and printf.
657 Version 1.2 Last change: Dec 22 1994 10
664 MAWK(1) USER COMMANDS MAWK(1)
668 writes $0 ORS to standard output.
670 print _
\be_
\bx_
\bp_
\br1, _
\be_
\bx_
\bp_
\br2, ..., _
\be_
\bx_
\bp_
\brn
671 writes _
\be_
\bx_
\bp_
\br1 OFS _
\be_
\bx_
\bp_
\br2 OFS ... _
\be_
\bx_
\bp_
\brn ORS to stan-
672 dard output. Numeric expressions are converted to
675 printf _
\bf_
\bo_
\br_
\bm_
\ba_
\bt, _
\be_
\bx_
\bp_
\br-_
\bl_
\bi_
\bs_
\bt
676 duplicates the printf C library function writing
677 to standard output. The complete ANSI C format
678 specifications are recognized with conversions %c,
679 %d, %e, %E, %f, %g, %G, %i, %o, %s, %u, %x, %X and
680 %%, and conversion qualifiers h and l.
682 The argument list to print or printf can optionally be
683 enclosed in parentheses. Print formats numbers using OFMT
684 or "%d" for exact integers. "%c" with a numeric argument
685 prints the corresponding 8 bit character, with a string
686 argument it prints the first character of the string. The
687 output of print and printf can be redirected to a file or
688 command by appending > _
\bf_
\bi_
\bl_
\be, >> _
\bf_
\bi_
\bl_
\be or | _
\bc_
\bo_
\bm_
\bm_
\ba_
\bn_
\bd to the end
689 of the print statement. Redirection opens _
\bf_
\bi_
\bl_
\be or _
\bc_
\bo_
\bm_
\bm_
\ba_
\bn_
\bd
690 only once, subsequent redirections append to the already
691 open stream. By convention, mawk associates the filename
692 "/dev/stderr" with stderr which allows print and printf to
693 be redirected to stderr. mawk also associates "-" and
694 "/dev/stdout" with stdin and stdout which allows these
695 streams to be passed to functions.
697 The input function getline has the following variations.
700 reads into $0, updates the fields, NF, NR and FNR.
702 getline < _
\bf_
\bi_
\bl_
\be
703 reads into $0 from _
\bf_
\bi_
\bl_
\be, updates the fields and
707 reads the next record into _
\bv_
\ba_
\br, updates NR and
710 getline _
\bv_
\ba_
\br < _
\bf_
\bi_
\bl_
\be
711 reads the next record of _
\bf_
\bi_
\bl_
\be into _
\bv_
\ba_
\br.
713 _
\bc_
\bo_
\bm_
\bm_
\ba_
\bn_
\bd | getline
714 pipes a record from _
\bc_
\bo_
\bm_
\bm_
\ba_
\bn_
\bd into $0 and updates
717 _
\bc_
\bo_
\bm_
\bm_
\ba_
\bn_
\bd | getline _
\bv_
\ba_
\br
718 pipes a record from _
\bc_
\bo_
\bm_
\bm_
\ba_
\bn_
\bd into _
\bv_
\ba_
\br.
723 Version 1.2 Last change: Dec 22 1994 11
730 MAWK(1) USER COMMANDS MAWK(1)
734 Getline returns 0 on end-of-file, -1 on error, otherwise 1.
736 Commands on the end of pipes are executed by /bin/sh.
738 The function close(_
\be_
\bx_
\bp_
\br) closes the file or pipe associated
739 with _
\be_
\bx_
\bp_
\br. Close returns 0 if _
\be_
\bx_
\bp_
\br is an open file, the
740 exit status if _
\be_
\bx_
\bp_
\br is a piped command, and -1 otherwise.
741 Close is used to reread a file or command, make sure the
742 other end of an output pipe is finished or conserve file
745 The function fflush(_
\be_
\bx_
\bp_
\br) flushes the output file or pipe
746 associated with _
\be_
\bx_
\bp_
\br. Fflush returns 0 if _
\be_
\bx_
\bp_
\br is an open
747 output stream else -1. Fflush without an argument flushes
748 stdout. Fflush with an empty argument ("") flushes all open
751 The function system(_
\be_
\bx_
\bp_
\br) uses /bin/sh to execute _
\be_
\bx_
\bp_
\br and
752 returns the exit status of the command _
\be_
\bx_
\bp_
\br. Changes made
753 to the ENVIRON array are not passed to commands executed
754 with system or pipes.
756 10. User defined functions
757 The syntax for a user defined function is
759 function name( _
\ba_
\br_
\bg_
\bs ) { _
\bs_
\bt_
\ba_
\bt_
\be_
\bm_
\be_
\bn_
\bt_
\bs }
761 The function body can contain a return statement
763 return _
\bo_
\bp_
\bt__
\be_
\bx_
\bp_
\br
765 A return statement is not required. Function calls may be
766 nested or recursive. Functions are passed expressions by
767 value and arrays by reference. Extra arguments serve as
768 local variables and are initialized to _
\bn_
\bu_
\bl_
\bl. For example,
769 csplit(_
\bs,_
\bA) puts each character of _
\bs into array _
\bA and
770 returns the length of _
\bs.
772 function csplit(s, A, n, i)
775 for( i = 1 ; i <= n ; i++ ) A[i] = substr(s, i, 1)
779 Putting extra space between passed arguments and local vari-
780 ables is conventional. Functions can be referenced before
781 they are defined, but the function name and the '(' of the
782 arguments must touch to avoid confusion with concatenation.
784 11. Splitting strings, records and files
785 Awk programs use the same algorithm to split strings into
789 Version 1.2 Last change: Dec 22 1994 12
796 MAWK(1) USER COMMANDS MAWK(1)
800 arrays with split(), and records into fields on FS. mawk
801 uses essentially the same algorithm to split files into
804 Split(_
\be_
\bx_
\bp_
\br,_
\bA,_
\bs_
\be_
\bp) works as follows:
806 (1) If _
\bs_
\be_
\bp is omitted, it is replaced by FS. _
\bS_
\be_
\bp can
807 be an expression or regular expression. If it is
808 an expression of non-string type, it is converted
811 (2) If _
\bs_
\be_
\bp = " " (a single space), then <SPACE> is
812 trimmed from the front and back of _
\be_
\bx_
\bp_
\br, and _
\bs_
\be_
\bp
813 becomes <SPACE>. mawk defines <SPACE> as the reg-
814 ular expression /[ \t\n]+/. Otherwise _
\bs_
\be_
\bp is
815 treated as a regular expression, except that
816 meta-characters are ignored for a string of length
817 1, e.g., split(x, A, "*") and split(x, A, /\*/)
820 (3) If _
\be_
\bx_
\bp_
\br is not string, it is converted to string.
821 If _
\be_
\bx_
\bp_
\br is then the empty string "", split()
822 returns 0 and _
\bA is set empty. Otherwise, all
823 non-overlapping, non-null and longest matches of
824 _
\bs_
\be_
\bp in _
\be_
\bx_
\bp_
\br, separate _
\be_
\bx_
\bp_
\br into fields which are
825 loaded into _
\bA. The fields are placed in A[1],
826 A[2], ..., A[n] and split() returns n, the number
827 of fields which is the number of matches plus one.
828 Data placed in _
\bA that looks numeric is typed
831 Splitting records into fields works the same except the
832 pieces are loaded into $1, $2,..., $NF. If $0 is empty, NF
833 is set to 0 and all $i to "".
835 mawk splits files into records by the same algorithm, but
836 with the slight difference that RS is really a terminator
837 instead of a separator. (ORS is really a terminator too).
839 E.g., if FS = ":+" and $0 = "a::b:" , then NF = 3 and
840 $1 = "a", $2 = "b" and $3 = "", but if "a::b:" is the
841 contents of an input file and RS = ":+", then there are
842 two records "a" and "b".
844 RS = " " is not special.
846 If FS = "", then mawk breaks the record into individual
847 characters, and, similarly, split(_
\bs,_
\bA,"") places the indivi-
848 dual characters of _
\bs into _
\bA.
850 12. Multi-line records
851 Since mawk interprets RS as a regular expression, multi-line
855 Version 1.2 Last change: Dec 22 1994 13
862 MAWK(1) USER COMMANDS MAWK(1)
866 records are easy. Setting RS = "\n\n+", makes one or more
867 blank lines separate records. If FS = " " (the default),
868 then single newlines, by the rules for <SPACE> above, become
869 space and single newlines are field separators.
871 For example, if a file is "a b\nc\n\n", RS = "\n\n+"
872 and FS = " ", then there is one record "a b\nc" with
873 three fields "a", "b" and "c". Changing FS = "\n",
874 gives two fields "a b" and "c"; changing FS = "", gives
875 one field identical to the record.
877 If you want lines with spaces or tabs to be considered
878 blank, set RS = "\n([ \t]*\n)+". For compatibility with
879 other awks, setting RS = "" has the same effect as if blank
880 lines are stripped from the front and back of files and then
881 records are determined as if RS = "\n\n+". Posix requires
882 that "\n" always separates records when RS = "" regardless
883 of the value of FS. mawk does not support this convention,
884 because defining "\n" as <SPACE> makes it unnecessary.
886 Most of the time when you change RS for multi-line records,
887 you will also want to change ORS to "\n\n" so the record
888 spacing is preserved on output.
890 13. Program execution
891 This section describes the order of program execution.
892 First ARGC is set to the total number of command line argu-
893 ments passed to the execution phase of the program. ARGV[0]
894 is set the name of the AWK interpreter and ARGV[1] ...
895 ARGV[ARGC-1] holds the remaining command line arguments
896 exclusive of options and program source. For example with
898 mawk -f prog v=1 A t=hello B
900 ARGC = 5 with ARGV[0] = "mawk", ARGV[1] = "v=1", ARGV[2] =
901 "A", ARGV[3] = "t=hello" and ARGV[4] = "B".
903 Next, each BEGIN block is executed in order. If the program
904 consists entirely of BEGIN blocks, then execution ter-
905 minates, else an input stream is opened and execution con-
906 tinues. If ARGC equals 1, the input stream is set to stdin,
907 else the command line arguments ARGV[1] ... ARGV[ARGC-1]
908 are examined for a file argument.
910 The command line arguments divide into three sets: file
911 arguments, assignment arguments and empty strings "". An
912 assignment has the form _
\bv_
\ba_
\br=_
\bs_
\bt_
\br_
\bi_
\bn_
\bg. When an ARGV[i] is
913 examined as a possible file argument, if it is empty it is
914 skipped; if it is an assignment argument, the assignment to
915 _
\bv_
\ba_
\br takes place and i skips to the next argument; else
916 ARGV[i] is opened for input. If it fails to open, execution
917 terminates with exit code 2. If no command line argument is
921 Version 1.2 Last change: Dec 22 1994 14
928 MAWK(1) USER COMMANDS MAWK(1)
932 a file argument, then input comes from stdin. Getline in a
933 BEGIN action opens input. "-" as a file argument denotes
936 Once an input stream is open, each input record is tested
937 against each _
\bp_
\ba_
\bt_
\bt_
\be_
\br_
\bn, and if it matches, the associated
938 _
\ba_
\bc_
\bt_
\bi_
\bo_
\bn is executed. An expression pattern matches if it is
939 boolean true (see the end of section 2). A BEGIN pattern
940 matches before any input has been read, and an END pattern
941 matches after all input has been read. A range pattern,
942 _
\be_
\bx_
\bp_
\br1,_
\be_
\bx_
\bp_
\br2 , matches every record between the match of
943 _
\be_
\bx_
\bp_
\br1 and the match _
\be_
\bx_
\bp_
\br2 inclusively.
945 When end of file occurs on the input stream, the remaining
946 command line arguments are examined for a file argument, and
947 if there is one it is opened, else the END _
\bp_
\ba_
\bt_
\bt_
\be_
\br_
\bn is con-
948 sidered matched and all END _
\ba_
\bc_
\bt_
\bi_
\bo_
\bn_
\bs are executed.
950 In the example, the assignment v=1 takes place after the
951 BEGIN _
\ba_
\bc_
\bt_
\bi_
\bo_
\bn_
\bs are executed, and the data placed in v is
952 typed number and string. Input is then read from file A.
953 On end of file A, t is set to the string "hello", and B is
954 opened for input. On end of file B, the END _
\ba_
\bc_
\bt_
\bi_
\bo_
\bn_
\bs are
957 Program flow at the _
\bp_
\ba_
\bt_
\bt_
\be_
\br_
\bn {_
\ba_
\bc_
\bt_
\bi_
\bo_
\bn} level can be changed
961 exit _
\bo_
\bp_
\bt__
\be_
\bx_
\bp_
\br
963 statements. A next statement causes the next input record
964 to be read and pattern testing to restart with the first
965 _
\bp_
\ba_
\bt_
\bt_
\be_
\br_
\bn {_
\ba_
\bc_
\bt_
\bi_
\bo_
\bn} pair in the program. An exit statement
966 causes immediate execution of the END actions or program
967 termination if there are none or if the exit occurs in an
968 END action. The _
\bo_
\bp_
\bt__
\be_
\bx_
\bp_
\br sets the exit value of the program
969 unless overridden by a later exit or subsequent error.
978 { chars += length($0) + 1 # add one for the \n
982 END{ print NR, words, chars }
987 Version 1.2 Last change: Dec 22 1994 15
994 MAWK(1) USER COMMANDS MAWK(1)
998 3. count the number of unique "real words".
1000 BEGIN { FS = "[^A-Za-z]+" }
1002 { for(i = 1 ; i <= NF ; i++) word[$i] = "" }
1004 END { delete word[""]
1005 for ( i in word ) cnt++
1009 4. sum the second field of every record based on the first
1012 $1 ~ /credit|gain/ { sum += $2 }
1013 $1 ~ /debit|loss/ { sum -= $2 }
1017 5. sort a file, comparing as string
1019 { line[NR] = $0 "" } # make sure of comparison type
1020 # in case some lines look numeric
1022 END { isort(line, NR)
1023 for(i = 1 ; i <= NR ; i++) print line[i]
1026 #insertion sort of A[1..n]
1027 function isort( A, n, i, j, hold)
1029 for( i = 2 ; i <= n ; i++)
1032 while ( A[j-1] > hold )
1033 { j-- ; A[j+1] = A[j] }
1036 # sentinel A[0] = "" will be created if needed
1040 COMPATIBILITY ISSUES
1041 The Posix 1003.2(draft 11.3) definition of the AWK language
1042 is AWK as described in the AWK book with a few extensions
1043 that appeared in SystemVR4 nawk. The extensions are:
1045 New functions: toupper() and tolower().
1047 New variables: ENVIRON[] and CONVFMT.
1049 ANSI C conversion specifications for printf() and
1053 Version 1.2 Last change: Dec 22 1994 16
1060 MAWK(1) USER COMMANDS MAWK(1)
1066 New command options: -v var=value, multiple -f options
1067 and implementation options as arguments to -W.
1070 Posix AWK is oriented to operate on files a line at a time.
1071 RS can be changed from "\n" to another single character, but
1072 it is hard to find any use for this - there are no examples
1073 in the AWK book. By convention, RS = "", makes one or more
1074 blank lines separate records, allowing multi-line records.
1075 When RS = "", "\n" is always a field separator regardless of
1078 mawk, on the other hand, allows RS to be a regular expres-
1079 sion. When "\n" appears in records, it is treated as space,
1080 and FS always determines fields.
1082 Removing the line at a time paradigm can make some programs
1083 simpler and can often improve performance. For example,
1084 redoing example 3 from above,
1086 BEGIN { RS = "[^A-Za-z]+" }
1090 END { delete word[ "" ]
1091 for( i in word ) cnt++
1095 counts the number of unique words by making each word a
1096 record. On moderate size files, mawk executes twice as
1097 fast, because of the simplified inner loop.
1099 The following program replaces each comment by a single
1100 space in a C program file,
1103 RS = "/\*([^*]|\*+[^/*])*\*+/"
1104 # comment is record separator
1109 { print hold ; hold = $0 }
1111 END { printf "%s" , hold }
1113 Buffering one record is needed to avoid terminating the last
1114 record with a space.
1119 Version 1.2 Last change: Dec 22 1994 17
1126 MAWK(1) USER COMMANDS MAWK(1)
1130 With mawk, the following are all equivalent,
1132 x ~ /a\+b/ x ~ "a\+b" x ~ "a\\+b"
1134 The strings get scanned twice, once as string and once as
1135 regular expression. On the string scan, mawk ignores the
1136 escape on non-escape characters while the AWK book advocates
1137 _
\b\_
\bc be recognized as _
\bc which necessitates the double escaping
1138 of meta-characters in strings. Posix explicitly declines to
1139 define the behavior which passively forces programs that
1140 must run under a variety of awks to use the more portable
1141 but less readable, double escape.
1143 Posix AWK does not recognize "/dev/std{out,err}" or \x hex
1144 escape sequences in strings. Unlike ANSI C, mawk limits the
1145 number of digits that follows \x to two as the current
1146 implementation only supports 8 bit characters. The built-in
1147 fflush first appeared in a recent (1993) AT&T awk released
1148 to netlib, and is not part of the posix standard. Aggregate
1149 deletion with delete _
\ba_
\br_
\br_
\ba_
\by is not part of the posix stan-
1152 Posix explicitly leaves the behavior of FS = "" undefined,
1153 and mentions splitting the record into characters as a pos-
1154 sible interpretation, but currently this use is not portable
1155 across implementations.
1157 Finally, here is how mawk handles exceptional cases not dis-
1158 cussed in the AWK book or the Posix draft. It is unsafe to
1159 assume consistency across awks and safe to skip to the next
1162 substr(s, i, n) returns the characters of s in the
1163 intersection of the closed interval [1, length(s)] and
1164 the half-open interval [i, i+n). When this intersec-
1165 tion is empty, the empty string is returned; so
1166 substr("ABC", 1, 0) = "" and substr("ABC", -4, 6) =
1169 Every string, including the empty string, matches the
1170 empty string at the front so, s ~ // and s ~ "", are
1171 always 1 as is match(s, //) and match(s, ""). The last
1172 two set RLENGTH to 0.
1174 index(s, t) is always the same as match(s, t1) where t1
1175 is the same as t with metacharacters escaped. Hence
1176 consistency with match requires that index(s, "")
1177 always returns 1. Also the condition, index(s,t) != 0
1178 if and only t is a substring of s, requires
1185 Version 1.2 Last change: Dec 22 1994 18
1192 MAWK(1) USER COMMANDS MAWK(1)
1196 If getline encounters end of file, getline var, leaves
1197 var unchanged. Similarly, on entry to the END actions,
1198 $0, the fields and NF have their value unaltered from
1202 _
\be_
\bg_
\br_
\be_
\bp(1)
1204 Aho, Kernighan and Weinberger, _
\bT_
\bh_
\be _
\bA_
\bW_
\bK _
\bP_
\br_
\bo_
\bg_
\br_
\ba_
\bm_
\bm_
\bi_
\bn_
\bg _
\bL_
\ba_
\bn_
\bg_
\bu_
\ba_
\bg_
\be,
1205 Addison-Wesley Publishing, 1988, (the AWK book), defines the
1206 language, opening with a tutorial and advancing to many
1207 interesting programs that delve into issues of software
1208 design and analysis relevant to programming in any language.
1210 _
\bT_
\bh_
\be _
\bG_
\bA_
\bW_
\bK _
\bM_
\ba_
\bn_
\bu_
\ba_
\bl, The Free Software Foundation, 1991, is a
1211 tutorial and language reference that does not attempt the
1212 depth of the AWK book and assumes the reader may be a novice
1213 programmer. The section on AWK arrays is excellent. It also
1214 discusses Posix requirements for AWK.
1217 mawk cannot handle ascii NUL \0 in the source or data files.
1218 You can output NUL using printf with %c, and any other 8 bit
1219 character is acceptable input.
1221 mawk implements printf() and sprintf() using the C library
1222 functions, printf and sprintf, so full ANSI compatibility
1223 requires an ANSI C library. In practice this means the h
1224 conversion qualifier may not be available. Also mawk inher-
1225 its any bugs or limitations of the library functions.
1227 Implementors of the AWK language have shown a consistent
1228 lack of imagination when naming their programs.
1231 Mike Brennan (brennan@whidbey.com).
1251 Version 1.2 Last change: Dec 22 1994 19