Scripts and Utilities -- Awk lecture

This file: http://www.cs.utk.edu/~plank/plank/classes/cs494/494/notes/Awk/lecture.html

Lecture links: http://www.cs.utk.edu/~plank/plank/classes/cs494/494/notes/Awk/links.html

Awk

Awk is a ``pattern scanning and processing language'' which is useful for writing quick and dirty programs that don't have to be compiled. The calling syntax of awk is like sed:

UNIX> awk program [ file ]

UNIX> awk -f program-file [ file ]

Like sed, awk can work on standard input or on a file. Like the shell, if you start an awk program with

#!/bin/awk -f

then you can execute the program directly from the shell.

Most systems also have nawk, which stands for ``new awk.'' Nawk has many more features than awk and is generally more useful. I am just going to cover awk, but you should check out nawk too in your own time. Nawk has some nice things like a random number generator, that awk doesn't have.

Program syntax of awk

awk programs are composed of ``pattern-action'' statements of the form:

pattern { action }

What such a statement does is apply the action to all lines that match the pattern. If there is no pattern, then it applies the action to all lines. If there is no action, then the default action is to copy the line to standard output. Patterns can be regular expressions enclosed in slashes (they can be more than that, but for now, just assume that they are regular expressions).

So, for example, the program awkgrep works just like ``grep Jim''.

UNIX> cat awkgrep
#!/bin/awk -f

/Jim/
UNIX> cat input
Which of these lines doesn't belong:

Bill Clinton
George Bush
Ronald Reagan
Jimmy Carter
Sylvester Stallone
UNIX> awkgrep input
Jimmy Carter
UNIX> awkgrep < input
Jimmy Carter
UNIX>

Basic actions, fields

Actions basically look like C programs. There are some big differences, but for the most part, you can do most basic things that you can do in C.

Awk breaks up each line into fields, which are basically whitespace-separated words. You can get at word i by specifying $i. The variable NF contains the number of words on the line. The variable $0 is the line itself.

So, to print out the first and last words on each line, you can do:

UNIX> cat input
Which of these lines doesn't belong:

Bill Clinton
George Bush
Ronald Reagan
Jimmy Carter
Sylvester Stallone
UNIX> awk '{ print $1, $NF }' input
Which belong:
 
Bill Clinton
George Bush
Ronald Reagan
Jimmy Carter
Sylvester Stallone
UNIX>

An alternative awkgrep prints out $0 when it finds the pattern:
UNIX> cat awkgrep2 #!/bin/awk -f /Jim/ { print $0 } UNIX> awkgrep2 input Jimmy Carter UNIX>
Awk has a printf just like C. You don't have to use parentheses when you call it (although you can if you'd like). Unlike print, printf will not print a newline if you don't want it to. So, for example, awkrev reverses the lines of a file:
UNIX> cat awkrev #!/bin/awk -f { for (i = NF; i > 0; i-- ) printf "%s ", $i printf "\n" } UNIX> awkrev input belong: doesn't lines these of Which Clinton Bill Bush George Reagan Ronald Carter Jimmy Stallone Sylvester UNIX>
A few things that you'll notice about awkrev: Actions can be multiline. You don't need semicolons to separate lines like in C. However, you can specify multiple commands on a line and separate them with semi-colons as in C. And you can block commands with curly braces as in C. If you want a command to span two lines (this often happens with complex printf statements), you need to end the first line with a backslash.
Also, you'll notice that awkrev didn't declare the variable i. Awk just figured out that it's an integer.
Type casting
Awk lets you convert variables from one type to another on the fly. For example, to convert an integer to a string, you simply use it as a string. String construction can be done with concatenation, which is often very convenient. These principles are used in awkcast:
UNIX> echo "4 Jim" | awkcast Word 1: as a number: 4, as a string: 4. 0 appended: number: 40, string 40 Word 2: as a number: 0, as a string: Jim. 0 appended: number: 0, string Jim0 UNIX>
Casting a string to an integer gives it its atoi() value.
BEGIN and END
There are two special patterns, BEGIN and END, which cause the corresponding actions to be executed before and after any lines are processed respectively. Therefore, the following program (awkwc) counts the number of lines and words in the input file.
UNIX> cat awkwc #!/bin/awk -f BEGIN { nl = 0; nw = 0 } { nl++ ; nw += NF } END { print "Lines:", nl, "words:", nw } UNIX> awkwc awkwc Lines: 5 words: 26 UNIX> wc awkwc 5 26 103 awkwc UNIX>

next and exit
Awk tries to process each statement on each line. Unlike sed, there is no ``hold space.'' Instead, each statement is processed on the original version of each line. Two special commands in awk are next and exit. Next specifies to stop processing the current input line, and to go directly to the next one, skipping all the rest of the statements. Exit specifies for awk to exit immediately.
Here are some simple examples. awkpo prints out only the odd numbered lines (note that this is an awkward way to do this, but it works):
UNIX> cat awkpo #!/bin/awk -f BEGIN { ln=0 } { ln++ if (ln%2 == 0) next print $0 } UNIX> cat -n input 1 Which of these lines doesn't belong: 2 3 Bill Clinton 4 George Bush 5 Ronald Reagan 6 Jimmy Carter 7 Sylvester Stallone UNIX> cat -n input | awkpo 1 Which of these lines doesn't belong: 3 Bill Clinton 5 Ronald Reagan 7 Sylvester Stallone UNIX>
awkptR prints out all lines until it reaches a lines with a capital R
UNIX> cat awkptR #!/bin/awk -f /R/ { exit } { print $0 } UNIX> awkptR input Which of these lines doesn't belong: Bill Clinton George Bush UNIX>

Arrays
Arrays in awk are a little odd. First, you don't have to malloc() any storage -- just use it and there it is. Second, arrays can have any indices -- integers, floating point numbers or strings. This is called ``associative'' indexing, and can be very convenient. You cannot have multi-dimensional arrays or arrays of arrays though. To simulate multidimensional arrays, you can just concatenate the indices.
Take a look at awkgolf. This is typical of quick-and-dirty awk programs that you sometimes write to look at data. This one processes golf scores. Suppose you have some score files, as in the files usopen, masters, kemper and memorial. These files first have the name of the tournament in all caps, and then scores for a bunch of golfers. Suppose you'd like to see all the golfers with scores for each tournament in a readable form. This is what awkgolf does. Let's break it into its four parts.
The first part is the BEGIN line:
BEGIN { nt = 0 ; np = 0 }
This simply initializes two variables: nt is the number of tournaments, and np is the number of players.
The next line looks a little cryptic:
/^[A-Z]*$/ { this = $0; tourn[nt] = $0 ; nt++; next }
This only works on lines that are all capital letters. These are the lines that identify tournaments. On these lines, it does the following:

Sets the this variable to be the tournament name.
Puts the tournament's name into the tourn array.
Increments nt variable.
Skips the rest of the program and goes onto the next line.

The next part works on all lines that contain the pattern '--'. These are the lines with golfers' scores:
/--/ { golfer = $1 for (i = 2; $i != "--" ; i++) golfer = golfer" "$i if (isgolfer[golfer] != "yes") { isgolfer[golfer] = "yes" g[np] = golfer np++; } score[golfer" "this] = $(i+1) }
The first two lines of this action set the golfer variable to be the golfer's name. Note that you can do string comparison in awk using standard boolean operators, unlike in C where you would have to use strcmp().
The next 5 lines use awk's associative arrays: The array isgolfer is checked to see if it contains the string ``yes'' under the golfer's name. If so, we have processed this golfer before. If not, we sed the golfer's entry in isgolfer to ``yes,'' set the np-th entry of the array g to be the golfer, and increment np.
Finally, we set the golfer's score for the tournament in the score array. Note that we don't use double-indirection. Instead, we simply concatenate the golfer's name and the tournament's name, and use that as the index for the array.
The last part of the program does the final formatting:
END { printf("%-25s", " "); for (j = 0; j < nt; j++) printf("%9s", tourn[j]) printf("\n") for (i = 0; i < np; i++) { printf("%-25s", g[i]) for (j = 0; j < nt; j++) printf("%9s", score[g[i]" "tourn[j]]) printf("\n") } }
The first three lines print out 25 spaces, and then the names of the tournaments as held in the tourn array. Then we loop through each golfer, and print the golfer's name, padded to 25 characters, and then his score in each tournament. Note that if the golfer didn't play in the tournament, that entry of the tournament array will be the null string. This is quite convenient, because we don't have to test for whether the golfer played the tournament -- we can just use awk's default values.
Ok, lets try awkgolf:
UNIX> awkgolf kemper # Note that the ouput is only sorted because its # sorted in the input file KEMPER Justin Leonard -10 Greg Norman -7 Nick Faldo -7 Nick Price -7 Loren Roberts -6 Jay Haas -5 Paul Stankowski -5 Lee Janzen -4 Phil Mickelson -4 Davis Love III -3 Tom Lehman 0 Vijay Singh 0 Kirk Triplett 1 Steve Jones 2 Mark O'Meara 5 Don Pooley missed Ernie Els missed Fred Couples missed Hal Sutton missed Jesper Parnevik missed Scott McCarron missed Steve Stricker missed UNIX> cat masters usopen kemper memorial | awkgolf MASTERS USOPEN KEMPER MEMORIAL Tiger Woods 281 6 5 Tommy Tolles 283 2 -11 Tom Watson 284 16 0 Paul Stankowski 285 6 -5 -3 Fred Couples 286 13 missed Davis Love III 286 5 -3 -7 Justin Leonard 286 9 -10 0 Steve Elkington 287 7 Tom Lehman 287 -2 0 -3 Ernie Els 288 -4 missed -1 Vijay Singh 288 21 0 -14 Jesper Parnevik 289 11 missed -4 Lee Westwood 291 6 Nick Price 291 6 -7 Lee Janzen 292 13 -4 -11 Jim Furyk 293 2 -12 Mark O'Meara 294 9 5 -2 Scott McCarron 294 3 missed missed Scott Hoch 298 3 -11 Jumbo Ozaki 300 missed Frank Nobilo 303 9 -10 Bob Tway missed 2 -7 Brad Faxon missed 17 2 David Duval missed 11 -5 Greg Norman missed missed -7 -12 Loren Roberts missed 4 -6 Nick Faldo missed 11 -7 Phil Mickelson missed 10 -4 Steve Jones missed 15 2 3 Steve Stricker missed 9 missed -1 Jay Haas 2 -5 -4 Billy Andrade 4 -7 Hal Sutton 6 missed -1 Kirk Triplett 1 -2 Don Pooley missed -4 UNIX>

File indirection
You can specify that the output of print and printf go to a file with indirection. For example, to copy standard input to the file f1 you could do:
UNIX> awk '{print $0 > "f1"}' < input UNIX> cat f1 Which of these lines doesn't belong: Bill Clinton George Bush Ronald Reagan Jimmy Carter Sylvester Stallone UNIX>

Awk without standard input
Sometimes you just want to write a program that doesn't use standard input. To do this, you just write the whole program as a BEGIN statement, exiting at the end.
Multiline awk programs in the Bourne shell
The Bourne shell lets you define multiline strings simply by putting newlines in the string (within single or double quotes, of course). This means that you can embed simple multiline awk scripts in a sh program without having to use cumbersome backslashes, or intermediate files. For example, shwc works just like awkwc, but works as a shell script rather than an awk program.
UNIX> shwc awkwc Lines: 5 words: 26 UNIX> shwc < awkwc Lines: 5 words: 26 UNIX> shwc awkwc awkwc usage: shwc [ file ] UNIX>

Awk's limitations
Awk is useful for simple data processing. It is not useful when things get more complex for a few reasons. First, if your data file is huge, you'll do better to write a C program (using for example the fields library from CS302/360) because it will be more efficient sometimes by a factor of 60 or more. Second, once you start writing procedure calls in awk, it seems to me you may as well be writing C code. Third, you often find awk's lack of double indirection and string processing cumbersome and inefficient.
Awk is not a good language for string processing. Irritatingly, it doesn't let you get at string elements with array operations. I.e. the following will fail:
UNIX> cat sp.awk { s = $1 ; s[0] = 'a' ; print s } UNIX> awk -f sp.awk input awk: syntax error near line 1 awk: illegal statement near line 1 UNIX>
Of course, sed is ideal for string processing, so often you can get what you want with a combination of sed and awk.
Nawk has much more built into it than awk, and accepts awk as a subset, so if you're wanting to do things in awk but can't, check out nawk. I'm not a big nawk user, so I won't give you a big sell on nawk, but you should look at the man page.

Scripts and Utilities -- Awk lecture

Awk

Program syntax of awk

Basic actions, fields

Type casting

BEGIN and END

next and exit

Arrays

File indirection

Awk without standard input

Multiline awk programs in the Bourne shell

Awk's limitations