Regular Expressions


  • Brad Vander Zanden
  • This lecture is extensively copied and adapted from Dr. Plank's Grep lecture in his original scripts & utilities course
    These notes introduce regular expressions. They are useful in two quite different contexts:

    1. Scripting languages provide them as a well of specifying patterns to be matched in documents.

    2. Compiler generators provide them as a way of specifying the acceptable "tokens" (called lexemes) in a language.

    Different languages use different syntax to specify regular expressions. These notes will use the Unix system syntax since a wide variety of Unix tools make use of this syntax and because Perl makes use of it as well. Unfortunately Python and Emacs use a different type of syntax so it does not carry over to these two tools.


    grep

    We will use the Unix utility grep to illustrate how regular expressions can be used for pattern matching, Grep stands for ``get regular expression''. Its syntax is
    UNIX> grep pattern [ files ]
    
    If you don't specify files on the command line, then it will use standard input. It prints out all lines in the specified files that contain the pattern. If you specified more than one file on the command line, then it will prepend the line with the file that it came from. Examples:
    UNIX> grep penny md
    He will get but a penny a day
    UNIX> grep penny < md
    He will get but a penny a day
    UNIX> grep all md sth
    sth:Our shadow's taller than our souls
    sth:There walks a lady we all know
    sth:Why all that glitters is not gold
    sth:These lyrics are all old as mold
    UNIX> 
    
    The pattern is a ``regular expression.'' While they're not exactly the same as regular expressions in something like CS380, they're pretty close. I'll borrow from the grep and ed man pages to define regular expressions: Ok, so this means that if you want to grep for any number, you can use [0-9]. If you want all lower case letters, use [a-z], and all lower and upper case letters, use [a-zA-Z]. It's always best to use single quotes when you're specifying patterns. Here are some examples:
    UNIX> cat greptest
    Jim Plank
    This string contains no numbers
    This string does though (1)
    -9.00
    G0 V0LS
    UNIX> grep '[Gg]' greptest
    This string contains no numbers
    This string does though (1)
    G0 V0LS
    UNIX> grep '[0-9]' greptest
    This string does though (1)
    -9.00
    G0 V0LS
    UNIX> grep '[A-Z]' greptest
    Jim Plank
    This string contains no numbers
    This string does though (1)
    G0 V0LS
    UNIX> grep '[^A-Za-z ]' greptest
    This string does though (1)
    -9.00
    G0 V0LS
    UNIX> 
    
    So, to grep for lines with exactly 9 characters, (note the newline doesn't count) do:
    UNIX> grep '^.........$' greptest 
    Jim Plank
    UNIX> 
    
    To grep for lines with at least 9 characters, do:
    UNIX> grep '.........' greptest 
    Jim Plank
    This string contains no numbers
    This string does though (1)
    UNIX>
    
    To grep for lines that end with two numbers, do:
    UNIX> grep '[0-9][0-9]$' greptest
    -9.00
    UNIX>
    
    Examples: (don't forget the quotes when using the greater-than and less-than signs).
    UNIX> grep all sth
    Our shadow's taller than our souls
    There walks a lady we all know
    Why all that glitters is not gold
    These lyrics are all old as mold
    UNIX> grep '\<.ll\>' sth
    There walks a lady we all know
    Why all that glitters is not gold
    These lyrics are all old as mold
    UNIX> grep 'dow\>' sth
    Our shadow's taller than our souls
    UNIX> grep '\<.\>' sth
    Our shadow's taller than our souls       (matching the s in "shadow's")
    There walks a lady we all know
    UNIX> 
    
    Note that it matches zero or more. So, the following will match all lines, even though none have Z's:
    UNIX> grep 'Z*' md
    See Saw, Margery Daw
       Johnny will have a new Master
    He will get but a penny a day
       Because this poem is a disaster!
    UNIX> 
    
    Here are some more examples. The first greps for two words separated by a space (actually, since * can match zero, this will also match a single space, or a word before or following a single space). The second greps for a period followed by any number of zeros, and then the end of line. The last greps for any line with two zeros somewhere.
    UNIX> grep '^[^ ]* [^ ]*$' greptest
    Jim Plank
    G0 V0LS
    UNIX> grep '\.0*$' greptest
    -9.00
    UNIX> grep '0.*0' greptest
    -9.00
    G0 V0LS
    UNIX> 
    
    So, some more examples. This first is equivalent to grepping for 0.
    UNIX> grep '0\{1\}' greptest
    -9.00
    G0 V0LS
    
    This is equivalent to grepping for 000*:
    UNIX> grep '0\{2,\}' greptest
    -9.00
    
    Here we grep for 5-letter words containing just lower case letters, then 5-letter words, then words of at least 5 letters:
    UNIX> grep '\<[a-z]\{5\}\>' greptest
    UNIX> grep '\<[A-Za-z]\{5\}\>' greptest
    Jim Plank
    UNIX> grep '\<[A-Za-z]\{5,\}\>' greptest
    Jim Plank
    This string contains no numbers
    This string does though (1)
    UNIX> 
    
    If you want to make sure that grep prints out the file name of the file that the line comes from, include /dev/null on the command line. Then you'll have at least two files on the command line, and grep will be sure to print the file name:
    UNIX> grep '\<.ld\>' sth
    These lyrics are all old as mold
    UNIX> grep '\<.ld\>' sth /dev/null
    sth:These lyrics are all old as mold
    UNIX> 
    
    grep can do far more than this -- you need to read the man page to figure it all out. Also you should read about egrep and fgrep.