Scripts and Utilities -- Awk lecture


  • Jim Plank
  • Directory: /home/cs494/notes/Awk
  • This file: http://www.cs.utk.edu/~plank/plank/classes/cs494/494/notes/Awk/lecture.html
  • Lecture links: http://www.cs.utk.edu/~plank/plank/classes/cs494/494/notes/Awk/links.html
  • Email questions and answers

    Awk

    Awk is a ``pattern scanning and processing language'' which is useful for writing quick and dirty programs that don't have to be compiled. The calling syntax of awk is like sed:
    UNIX> awk program [ file ]
    
    or
    UNIX> awk -f program-file [ file ]
    
    Like sed, awk can work on standard input or on a file. Like the shell, if you start an awk program with
    #!/bin/awk -f
    
    then you can execute the program directly from the shell.

    Most systems also have nawk, which stands for ``new awk.'' Nawk has many more features than awk and is generally more useful. I am just going to cover awk, but you should check out nawk too in your own time. Nawk has some nice things like a random number generator, that awk doesn't have.


    Program syntax of awk

    awk programs are composed of ``pattern-action'' statements of the form:
    pattern { action }
    
    What such a statement does is apply the action to all lines that match the pattern. If there is no pattern, then it applies the action to all lines. If there is no action, then the default action is to copy the line to standard output. Patterns can be regular expressions enclosed in slashes (they can be more than that, but for now, just assume that they are regular expressions).

    So, for example, the program awkgrep works just like ``grep Jim''.

    UNIX> cat awkgrep
    #!/bin/awk -f
    
    /Jim/
    UNIX> cat input
    Which of these lines doesn't belong:
    
    Bill Clinton
    George Bush
    Ronald Reagan
    Jimmy Carter
    Sylvester Stallone
    UNIX> awkgrep input
    Jimmy Carter
    UNIX> awkgrep < input
    Jimmy Carter
    UNIX> 
    

    Basic actions, fields

    Actions basically look like C programs. There are some big differences, but for the most part, you can do most basic things that you can do in C.

    Awk breaks up each line into fields, which are basically whitespace-separated words. You can get at word i by specifying $i. The variable NF contains the number of words on the line. The variable $0 is the line itself.

    So, to print out the first and last words on each line, you can do:

    UNIX> cat input
    Which of these lines doesn't belong:
    
    Bill Clinton
    George Bush
    Ronald Reagan
    Jimmy Carter
    Sylvester Stallone
    UNIX> awk '{ print $1, $NF }' input
    Which belong:
     
    Bill Clinton
    George Bush
    Ronald Reagan
    Jimmy Carter
    Sylvester Stallone
    UNIX> 
    
    An alternative awkgrep prints out $0 when it finds the pattern:
    UNIX> cat awkgrep2
    #!/bin/awk -f
    
    /Jim/ { print $0 }
    UNIX> awkgrep2 input
    Jimmy Carter
    UNIX> 
    
    Awk has a printf just like C. You don't have to use parentheses when you call it (although you can if you'd like). Unlike print, printf will not print a newline if you don't want it to. So, for example, awkrev reverses the lines of a file:
    UNIX> cat awkrev
    #!/bin/awk -f
    
            { for (i = NF; i > 0; i-- ) printf "%s ", $i
              printf "\n" }
    UNIX> awkrev input
    belong: doesn't lines these of Which 
    
    Clinton Bill 
    Bush George 
    Reagan Ronald 
    Carter Jimmy 
    Stallone Sylvester 
    UNIX> 
    
    A few things that you'll notice about awkrev: Actions can be multiline. You don't need semicolons to separate lines like in C. However, you can specify multiple commands on a line and separate them with semi-colons as in C. And you can block commands with curly braces as in C. If you want a command to span two lines (this often happens with complex printf statements), you need to end the first line with a backslash.

    Also, you'll notice that awkrev didn't declare the variable i. Awk just figured out that it's an integer.

    Type casting

    Awk lets you convert variables from one type to another on the fly. For example, to convert an integer to a string, you simply use it as a string. String construction can be done with concatenation, which is often very convenient. These principles are used in awkcast:
    UNIX> echo "4 Jim" | awkcast
    Word 1: as a number: 4, as a string: 4.
             0 appended: number: 40, string 40
    Word 2: as a number: 0, as a string: Jim.
             0 appended: number: 0, string Jim0
    UNIX> 
    
    Casting a string to an integer gives it its atoi() value.

    BEGIN and END

    There are two special patterns, BEGIN and END, which cause the corresponding actions to be executed before and after any lines are processed respectively. Therefore, the following program (awkwc) counts the number of lines and words in the input file.
    UNIX> cat awkwc
    #!/bin/awk -f
    
    BEGIN   { nl = 0; nw = 0 }
            { nl++ ; nw += NF }
    END     { print "Lines:", nl, "words:", nw } 
    UNIX> awkwc awkwc
    Lines: 5 words: 26
    UNIX> wc awkwc
           5      26     103 awkwc
    UNIX> 
    

    next and exit

    Awk tries to process each statement on each line. Unlike sed, there is no ``hold space.'' Instead, each statement is processed on the original version of each line. Two special commands in awk are next and exit. Next specifies to stop processing the current input line, and to go directly to the next one, skipping all the rest of the statements. Exit specifies for awk to exit immediately.

    Here are some simple examples. awkpo prints out only the odd numbered lines (note that this is an awkward way to do this, but it works):

    UNIX> cat awkpo
    #!/bin/awk -f
    
    BEGIN   { ln=0 }
            { ln++
              if (ln%2 == 0) next
              print $0
            }
    UNIX> cat -n input
         1  Which of these lines doesn't belong:
         2
         3  Bill Clinton
         4  George Bush
         5  Ronald Reagan
         6  Jimmy Carter
         7  Sylvester Stallone
    UNIX> cat -n input | awkpo
         1  Which of these lines doesn't belong:
         3  Bill Clinton
         5  Ronald Reagan
         7  Sylvester Stallone
    UNIX> 
    
    awkptR prints out all lines until it reaches a lines with a capital R
    UNIX> cat awkptR
    #!/bin/awk -f
    
    /R/     { exit }
            { print $0 }
    UNIX> awkptR input
    Which of these lines doesn't belong:
    
    Bill Clinton
    George Bush
    UNIX> 
    

    Arrays

    Arrays in awk are a little odd. First, you don't have to malloc() any storage -- just use it and there it is. Second, arrays can have any indices -- integers, floating point numbers or strings. This is called ``associative'' indexing, and can be very convenient. You cannot have multi-dimensional arrays or arrays of arrays though. To simulate multidimensional arrays, you can just concatenate the indices.

    Take a look at awkgolf. This is typical of quick-and-dirty awk programs that you sometimes write to look at data. This one processes golf scores. Suppose you have some score files, as in the files usopen, masters, kemper and memorial. These files first have the name of the tournament in all caps, and then scores for a bunch of golfers. Suppose you'd like to see all the golfers with scores for each tournament in a readable form. This is what awkgolf does. Let's break it into its four parts.

    The first part is the BEGIN line:

    BEGIN { nt = 0 ; np = 0 }
    
    This simply initializes two variables: nt is the number of tournaments, and np is the number of players.

    The next line looks a little cryptic:

    /^[A-Z]*$/ { this = $0; tourn[nt] = $0 ; nt++; next }
    
    This only works on lines that are all capital letters. These are the lines that identify tournaments. On these lines, it does the following:

    The next part works on all lines that contain the pattern '--'. These are the lines with golfers' scores:

    /--/    { golfer = $1
              for (i = 2; $i != "--" ; i++) golfer = golfer" "$i
              if (isgolfer[golfer] != "yes") {
                isgolfer[golfer] = "yes"
                g[np] = golfer
                np++;
              }
              score[golfer" "this] = $(i+1)
            }
    
    The first two lines of this action set the golfer variable to be the golfer's name. Note that you can do string comparison in awk using standard boolean operators, unlike in C where you would have to use strcmp().

    The next 5 lines use awk's associative arrays: The array isgolfer is checked to see if it contains the string ``yes'' under the golfer's name. If so, we have processed this golfer before. If not, we sed the golfer's entry in isgolfer to ``yes,'' set the np-th entry of the array g to be the golfer, and increment np.

    Finally, we set the golfer's score for the tournament in the score array. Note that we don't use double-indirection. Instead, we simply concatenate the golfer's name and the tournament's name, and use that as the index for the array.

    The last part of the program does the final formatting:

    END     { printf("%-25s", " ");
              for (j = 0; j < nt; j++) printf("%9s", tourn[j])
              printf("\n")
    
              for (i = 0; i < np; i++) {
                printf("%-25s", g[i])
                for (j = 0; j < nt; j++) printf("%9s", score[g[i]" "tourn[j]])
                printf("\n")
              }
            }
    
    The first three lines print out 25 spaces, and then the names of the tournaments as held in the tourn array. Then we loop through each golfer, and print the golfer's name, padded to 25 characters, and then his score in each tournament. Note that if the golfer didn't play in the tournament, that entry of the tournament array will be the null string. This is quite convenient, because we don't have to test for whether the golfer played the tournament -- we can just use awk's default values.

    Ok, lets try awkgolf:

    UNIX> awkgolf kemper    # Note that the ouput is only sorted because its 
                            # sorted in the input file
                                KEMPER
    Justin Leonard                 -10
    Greg Norman                     -7
    Nick Faldo                      -7
    Nick Price                      -7
    Loren Roberts                   -6
    Jay Haas                        -5
    Paul Stankowski                 -5
    Lee Janzen                      -4
    Phil Mickelson                  -4
    Davis Love III                  -3
    Tom Lehman                       0
    Vijay Singh                      0
    Kirk Triplett                    1
    Steve Jones                      2
    Mark O'Meara                     5
    Don Pooley                  missed
    Ernie Els                   missed
    Fred Couples                missed
    Hal Sutton                  missed
    Jesper Parnevik             missed
    Scott McCarron              missed
    Steve Stricker              missed
    UNIX> cat masters usopen kemper memorial | awkgolf
                               MASTERS   USOPEN   KEMPER MEMORIAL
    Tiger Woods                    281        6                 5
    Tommy Tolles                   283        2               -11
    Tom Watson                     284       16                 0
    Paul Stankowski                285        6       -5       -3
    Fred Couples                   286       13   missed         
    Davis Love III                 286        5       -3       -7
    Justin Leonard                 286        9      -10        0
    Steve Elkington                287        7                  
    Tom Lehman                     287       -2        0       -3
    Ernie Els                      288       -4   missed       -1
    Vijay Singh                    288       21        0      -14
    Jesper Parnevik                289       11   missed       -4
    Lee Westwood                   291        6                  
    Nick Price                     291        6       -7         
    Lee Janzen                     292       13       -4      -11
    Jim Furyk                      293        2               -12
    Mark O'Meara                   294        9        5       -2
    Scott McCarron                 294        3   missed   missed
    Scott Hoch                     298        3               -11
    Jumbo Ozaki                    300   missed                  
    Frank Nobilo                   303        9               -10
    Bob Tway                    missed        2                -7
    Brad Faxon                  missed       17                 2
    David Duval                 missed       11                -5
    Greg Norman                 missed   missed       -7      -12
    Loren Roberts               missed        4       -6         
    Nick Faldo                  missed       11       -7         
    Phil Mickelson              missed       10       -4         
    Steve Jones                 missed       15        2        3
    Steve Stricker              missed        9   missed       -1
    Jay Haas                                  2       -5       -4
    Billy Andrade                             4                -7
    Hal Sutton                                6   missed       -1
    Kirk Triplett                                      1       -2
    Don Pooley                                    missed       -4
    UNIX> 
    

    File indirection

    You can specify that the output of print and printf go to a file with indirection. For example, to copy standard input to the file f1 you could do:
    UNIX> awk '{print $0 > "f1"}' < input
    UNIX> cat f1
    Which of these lines doesn't belong:
    
    Bill Clinton
    George Bush
    Ronald Reagan
    Jimmy Carter
    Sylvester Stallone
    UNIX> 
    

    Awk without standard input

    Sometimes you just want to write a program that doesn't use standard input. To do this, you just write the whole program as a BEGIN statement, exiting at the end.

    Multiline awk programs in the Bourne shell

    The Bourne shell lets you define multiline strings simply by putting newlines in the string (within single or double quotes, of course). This means that you can embed simple multiline awk scripts in a sh program without having to use cumbersome backslashes, or intermediate files. For example, shwc works just like awkwc, but works as a shell script rather than an awk program.
    UNIX> shwc awkwc
    Lines: 5 words: 26
    UNIX> shwc < awkwc
    Lines: 5 words: 26
    UNIX> shwc awkwc awkwc
    usage: shwc [ file ]
    UNIX> 
    

    Awk's limitations

    Awk is useful for simple data processing. It is not useful when things get more complex for a few reasons. First, if your data file is huge, you'll do better to write a C program (using for example the fields library from CS302/360) because it will be more efficient sometimes by a factor of 60 or more. Second, once you start writing procedure calls in awk, it seems to me you may as well be writing C code. Third, you often find awk's lack of double indirection and string processing cumbersome and inefficient.

    Awk is not a good language for string processing. Irritatingly, it doesn't let you get at string elements with array operations. I.e. the following will fail:

    UNIX> cat sp.awk
            { s = $1 ; s[0] = 'a' ; print s }
    UNIX> awk -f sp.awk input
    awk: syntax error near line 1
    awk: illegal statement near line 1
    UNIX> 
    
    Of course, sed is ideal for string processing, so often you can get what you want with a combination of sed and awk.

    Nawk has much more built into it than awk, and accepts awk as a subset, so if you're wanting to do things in awk but can't, check out nawk. I'm not a big nawk user, so I won't give you a big sell on nawk, but you should look at the man page.