UNIX> awk program [ file ]or
UNIX> awk -f program-file [ file ]Like sed, awk can work on standard input or on a file. Like the shell, if you start an awk program with
#!/bin/awk -fthen you can execute the program directly from the shell.
Most systems also have nawk, which stands for ``new awk.'' Nawk has many more features than awk and is generally more useful. I am just going to cover awk, but you should check out nawk too in your own time. Nawk has some nice things like a random number generator, that awk doesn't have.
pattern { action }
What such a statement does is apply the action to all lines that match
the pattern. If there is no pattern, then it applies the action to
all lines. If there is no action, then the default action is to copy the
line to standard output. Patterns can be regular expressions enclosed
in slashes (they can be more than that, but for now, just assume that they
are regular expressions).
So, for example, the program awkgrep works just like ``grep Jim''.
UNIX> cat awkgrep #!/bin/awk -f /Jim/ UNIX> cat input Which of these lines doesn't belong: Bill Clinton George Bush Ronald Reagan Jimmy Carter Sylvester Stallone UNIX> awkgrep input Jimmy Carter UNIX> awkgrep < input Jimmy Carter UNIX>
Awk breaks up each line into fields, which are basically whitespace-separated words. You can get at word i by specifying $i. The variable NF contains the number of words on the line. The variable $0 is the line itself.
So, to print out the first and last words on each line, you can do:
UNIX> cat input
Which of these lines doesn't belong:
Bill Clinton
George Bush
Ronald Reagan
Jimmy Carter
Sylvester Stallone
UNIX> awk '{ print $1, $NF }' input
Which belong:
Bill Clinton
George Bush
Ronald Reagan
Jimmy Carter
Sylvester Stallone
UNIX>
An alternative awkgrep prints out $0 when it finds the
pattern:
UNIX> cat awkgrep2
#!/bin/awk -f
/Jim/ { print $0 }
UNIX> awkgrep2 input
Jimmy Carter
UNIX>
Awk has a printf just like C. You don't have to use
parentheses when you call it (although you can if you'd like). Unlike
print, printf will not print a newline if you don't want
it to. So, for example, awkrev reverses
the lines of a file:
UNIX> cat awkrev
#!/bin/awk -f
{ for (i = NF; i > 0; i-- ) printf "%s ", $i
printf "\n" }
UNIX> awkrev input
belong: doesn't lines these of Which
Clinton Bill
Bush George
Reagan Ronald
Carter Jimmy
Stallone Sylvester
UNIX>
A few things that you'll notice about awkrev: Actions can be
multiline. You don't need semicolons to separate lines like in C.
However, you can specify multiple commands on a line and separate them
with semi-colons as in C. And you can block commands with curly
braces as in C. If you want a command to span two lines (this often happens
with complex printf statements), you need to end the first line with
a backslash.
Also, you'll notice that awkrev didn't declare the variable i. Awk just figured out that it's an integer.
UNIX> echo "4 Jim" | awkcast
Word 1: as a number: 4, as a string: 4.
0 appended: number: 40, string 40
Word 2: as a number: 0, as a string: Jim.
0 appended: number: 0, string Jim0
UNIX>
Casting a string to an integer gives it its atoi() value.
UNIX> cat awkwc
#!/bin/awk -f
BEGIN { nl = 0; nw = 0 }
{ nl++ ; nw += NF }
END { print "Lines:", nl, "words:", nw }
UNIX> awkwc awkwc
Lines: 5 words: 26
UNIX> wc awkwc
5 26 103 awkwc
UNIX>
Here are some simple examples. awkpo prints out only the odd numbered lines (note that this is an awkward way to do this, but it works):
UNIX> cat awkpo
#!/bin/awk -f
BEGIN { ln=0 }
{ ln++
if (ln%2 == 0) next
print $0
}
UNIX> cat -n input
1 Which of these lines doesn't belong:
2
3 Bill Clinton
4 George Bush
5 Ronald Reagan
6 Jimmy Carter
7 Sylvester Stallone
UNIX> cat -n input | awkpo
1 Which of these lines doesn't belong:
3 Bill Clinton
5 Ronald Reagan
7 Sylvester Stallone
UNIX>
awkptR prints out all lines until it reaches
a lines with a capital R
UNIX> cat awkptR
#!/bin/awk -f
/R/ { exit }
{ print $0 }
UNIX> awkptR input
Which of these lines doesn't belong:
Bill Clinton
George Bush
UNIX>
Take a look at awkgolf. This is typical of quick-and-dirty awk programs that you sometimes write to look at data. This one processes golf scores. Suppose you have some score files, as in the files usopen, masters, kemper and memorial. These files first have the name of the tournament in all caps, and then scores for a bunch of golfers. Suppose you'd like to see all the golfers with scores for each tournament in a readable form. This is what awkgolf does. Let's break it into its four parts.
The first part is the BEGIN line:
BEGIN { nt = 0 ; np = 0 }
This simply initializes two variables: nt is the number of
tournaments, and np is the number of players.
The next line looks a little cryptic:
/^[A-Z]*$/ { this = $0; tourn[nt] = $0 ; nt++; next }
This only works on lines that are all capital letters. These are the
lines that identify tournaments. On these lines, it does the following:
The next part works on all lines that contain the pattern '--'. These are the lines with golfers' scores:
/--/ { golfer = $1
for (i = 2; $i != "--" ; i++) golfer = golfer" "$i
if (isgolfer[golfer] != "yes") {
isgolfer[golfer] = "yes"
g[np] = golfer
np++;
}
score[golfer" "this] = $(i+1)
}
The first two lines of this action set the golfer variable to be
the golfer's name. Note that you can do string comparison in awk
using standard boolean operators, unlike in C where you would have to use
strcmp().
The next 5 lines use awk's associative arrays: The array isgolfer is checked to see if it contains the string ``yes'' under the golfer's name. If so, we have processed this golfer before. If not, we sed the golfer's entry in isgolfer to ``yes,'' set the np-th entry of the array g to be the golfer, and increment np.
Finally, we set the golfer's score for the tournament in the score array. Note that we don't use double-indirection. Instead, we simply concatenate the golfer's name and the tournament's name, and use that as the index for the array.
The last part of the program does the final formatting:
END { printf("%-25s", " ");
for (j = 0; j < nt; j++) printf("%9s", tourn[j])
printf("\n")
for (i = 0; i < np; i++) {
printf("%-25s", g[i])
for (j = 0; j < nt; j++) printf("%9s", score[g[i]" "tourn[j]])
printf("\n")
}
}
The first three lines print out 25 spaces, and then the names of the
tournaments as held in the tourn array. Then we loop through
each golfer, and print the golfer's name, padded to 25 characters,
and then his score in each tournament. Note that if the golfer
didn't play in the tournament, that entry of the tournament array will
be the null string. This is quite convenient, because we don't have to
test for whether the golfer played the tournament -- we can just use
awk's default values.
Ok, lets try awkgolf:
UNIX> awkgolf kemper # Note that the ouput is only sorted because its
# sorted in the input file
KEMPER
Justin Leonard -10
Greg Norman -7
Nick Faldo -7
Nick Price -7
Loren Roberts -6
Jay Haas -5
Paul Stankowski -5
Lee Janzen -4
Phil Mickelson -4
Davis Love III -3
Tom Lehman 0
Vijay Singh 0
Kirk Triplett 1
Steve Jones 2
Mark O'Meara 5
Don Pooley missed
Ernie Els missed
Fred Couples missed
Hal Sutton missed
Jesper Parnevik missed
Scott McCarron missed
Steve Stricker missed
UNIX> cat masters usopen kemper memorial | awkgolf
MASTERS USOPEN KEMPER MEMORIAL
Tiger Woods 281 6 5
Tommy Tolles 283 2 -11
Tom Watson 284 16 0
Paul Stankowski 285 6 -5 -3
Fred Couples 286 13 missed
Davis Love III 286 5 -3 -7
Justin Leonard 286 9 -10 0
Steve Elkington 287 7
Tom Lehman 287 -2 0 -3
Ernie Els 288 -4 missed -1
Vijay Singh 288 21 0 -14
Jesper Parnevik 289 11 missed -4
Lee Westwood 291 6
Nick Price 291 6 -7
Lee Janzen 292 13 -4 -11
Jim Furyk 293 2 -12
Mark O'Meara 294 9 5 -2
Scott McCarron 294 3 missed missed
Scott Hoch 298 3 -11
Jumbo Ozaki 300 missed
Frank Nobilo 303 9 -10
Bob Tway missed 2 -7
Brad Faxon missed 17 2
David Duval missed 11 -5
Greg Norman missed missed -7 -12
Loren Roberts missed 4 -6
Nick Faldo missed 11 -7
Phil Mickelson missed 10 -4
Steve Jones missed 15 2 3
Steve Stricker missed 9 missed -1
Jay Haas 2 -5 -4
Billy Andrade 4 -7
Hal Sutton 6 missed -1
Kirk Triplett 1 -2
Don Pooley missed -4
UNIX>
UNIX> awk '{print $0 > "f1"}' < input
UNIX> cat f1
Which of these lines doesn't belong:
Bill Clinton
George Bush
Ronald Reagan
Jimmy Carter
Sylvester Stallone
UNIX>
UNIX> shwc awkwc Lines: 5 words: 26 UNIX> shwc < awkwc Lines: 5 words: 26 UNIX> shwc awkwc awkwc usage: shwc [ file ] UNIX>
Awk is not a good language for string processing. Irritatingly, it doesn't let you get at string elements with array operations. I.e. the following will fail:
UNIX> cat sp.awk
{ s = $1 ; s[0] = 'a' ; print s }
UNIX> awk -f sp.awk input
awk: syntax error near line 1
awk: illegal statement near line 1
UNIX>
Of course, sed is ideal for string processing, so often you can
get what you want with a combination of sed and awk.
Nawk has much more built into it than awk, and accepts awk as a subset, so if you're wanting to do things in awk but can't, check out nawk. I'm not a big nawk user, so I won't give you a big sell on nawk, but you should look at the man page.