Scripts and Utilities -- Cat/Sort/At/Grep/Find lecture


  • Jim Plank
  • Directory: /home/cs494/notes/Cat
  • This file: http://www.cs.utk.edu/~plank/plank/classes/cs494/494/notes/Cat/lecture.html
  • Lecture links: http://www.cs.utk.edu/~plank/plank/classes/cs494/494/notes/Cat/links.html
  • Email questions and answers
    This lecture is a bit of a hodgepodge. It will cover:

    cat

    Cat is probably the most basic Unix command. If no input files are specified, cat copies standard input to standard output. If input files are specified, then cat copies the input files to standard output. Thus, cat is ideal for appending two files into a third:
    UNIX> cat f1
    And she's buying
    UNIX> cat f2
    a stairway to heah ven
    UNIX> cat f1 f2
    And she's buying
    a stairway to heah ven
    UNIX> cat f1 f2 > f3
    UNIX> cat f3
    And she's buying
    a stairway to heah ven
    UNIX> 
    
    Remember, when using either the Bourne shell or csh, you cannot redirect standard output to be a file that you're using as input. Why? Because both shells create the output file before running the command. Thus, you'll lose the input file before the command gets executed:
    UNIX> cat f3
    And she's buying
    a stairway to heah ven
    UNIX> cat f3 > f3
    cat: input f3 is output
    UNIX> cat f3
    UNIX> cat f1 f2 > f3
    UNIX> cat f3
    And she's buying
    a stairway to heah ven
    UNIX> cat < f3 > f3
    cat: input - is output
    UNIX> cat f3
    UNIX> 
    

    cat options

    There are a few command line arguments to cat that make it a more useful command than it may first appear. Read the man page for a full description. First, -ve displays non-printing characters, and a $ for each newline, which sometimes tells you surprising things:
    UNIX> cat sth
    And when we wind on down the road
    Our shadow's taller than our souls
    There walks a lady we all know
    Who shines bright lights and wants to know
    Why all that glitters is not gold
    These lyrics are all old as mold
    UNIX> cat -ve sth
    And when we wind on down the road$
    Our shadow's taller than our souls$
    There walks a lady we all know$
    Who shines subliminal message^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^Hbright lights and wants to know                   $
    Why all that glitters is not gold$
    These lyrics are all old as mold$
    UNIX> 
    
    Second, -n prepends the line number to each line. E.g.
    UNIX> cat md
    See Saw, Margery Daw
       Johnny will have a new Master
    He will get but a penny a day
       Because this poem is a disaster!
    UNIX> cat -n md
         1  See Saw, Margery Daw
         2     Johnny will have a new Master
         3  He will get but a penny a day
         4     Because this poem is a disaster!
    UNIX> 
    
    Third, if you intermix - with the filenames, cat will use standard input when it reaches the -:
    UNIX> echo "This is a tiresome song" | cat f1 - f2
    And she's buying
    This is a tiresome song
    a stairway to heah ven
    UNIX> 
    
    Irritatingly, even cat has different implementations on different machines. Thus, -e works in one way on one machine, and another on another. For example, try -e on SunOS and Solaris machines. Or -s. This is very frustrating if you are trying to write portable shell scripts. My philosophy on this is to not use options which are not portable. This means that before you write a shell script that you are going to send to someone you should test it out on many machines and operating systems. A good way to break a shell script that runs on a BSD machine (like SunOS or Ultrix) is to try it on a System V machine (like Solaris or HPUX). It is frustrating, but you must do it.

    Sort

    Like cat, sort is a straightforward program, that sorts files. If you give it no command line options, it sorts lexicographically by lines. Note that initial spaces are included, which is sometimes not what you want:
    UNIX> cat md
    See Saw, Margery Daw
       Johnny will have a new Master
    He will get but a penny a day
       Because this poem is a disaster!
    UNIX> sort md
       Because this poem is a disaster!
       Johnny will have a new Master
    He will get but a penny a day
    See Saw, Margery Daw
    UNIX> 
    
    Use -r to sort in reverse order, and -n to sort numerically rather than lexicographically:
    UNIX> cat gstats
    9   8  8.99  6.99 Steve Elkington
    14 10 12.28  7.19 Brad "Chicken Neck" Faxon
    1   1 10.87 10.87 Tom Kalinowski
    12 10 10.13  7.16 Tom Lehman
    7   6  9.46  6.87 Greg Norman
    14 11 10.02  5.94 Jesper Parnevik
    9   9  5.10  5.10 Nick Price
    10 10  5.63  5.63 Tiger Woods
    UNIX> sort gstats
    1   1 10.87 10.87 Tom Kalinowski
    10 10  5.63  5.63 Tiger Woods
    12 10 10.13  7.16 Tom Lehman
    14 10 12.28  7.19 Brad "Chicken Neck" Faxon
    14 11 10.02  5.94 Jesper Parnevik
    7   6  9.46  6.87 Greg Norman
    9   8  8.99  6.99 Steve Elkington
    9   9  5.10  5.10 Nick Price
    UNIX> sort -n gstats
    1   1 10.87 10.87 Tom Kalinowski
    7   6  9.46  6.87 Greg Norman
    9   8  8.99  6.99 Steve Elkington
    9   9  5.10  5.10 Nick Price
    10 10  5.63  5.63 Tiger Woods
    12 10 10.13  7.16 Tom Lehman
    14 10 12.28  7.19 Brad "Chicken Neck" Faxon
    14 11 10.02  5.94 Jesper Parnevik
    UNIX> sort -r gstats
    9   9  5.10  5.10 Nick Price
    9   8  8.99  6.99 Steve Elkington
    7   6  9.46  6.87 Greg Norman
    14 11 10.02  5.94 Jesper Parnevik
    14 10 12.28  7.19 Brad "Chicken Neck" Faxon
    12 10 10.13  7.16 Tom Lehman
    10 10  5.63  5.63 Tiger Woods
    1   1 10.87 10.87 Tom Kalinowski
    UNIX> sort -nr gstats
    14 11 10.02  5.94 Jesper Parnevik
    14 10 12.28  7.19 Brad "Chicken Neck" Faxon
    12 10 10.13  7.16 Tom Lehman
    10 10  5.63  5.63 Tiger Woods
    9   9  5.10  5.10 Nick Price
    9   8  8.99  6.99 Steve Elkington
    7   6  9.46  6.87 Greg Norman
    1   1 10.87 10.87 Tom Kalinowski
    UNIX> 
    
    Sort lets you break up the input into "fields", and sort on a particular field. The default delimiter for fields is white space. You specify the sorting field with +n, where n is the number of the field. Sort is zero-indexed, so the first field is field zero. Thus, if you want to sort (numerically) using the third column of gstats, you do:
    UNIX> sort -n +2 gstats
    9   9  5.10  5.10 Nick Price
    10 10  5.63  5.63 Tiger Woods
    9   8  8.99  6.99 Steve Elkington
    7   6  9.46  6.87 Greg Norman
    14 11 10.02  5.94 Jesper Parnevik
    12 10 10.13  7.16 Tom Lehman
    1   1 10.87 10.87 Tom Kalinowski
    14 10 12.28  7.19 Brad "Chicken Neck" Faxon
    UNIX> 
    
    And if you want to sort by the golfers' first names, you do:
    UNIX> sort +4 gstats
    14 10 12.28  7.19 Brad "Chicken Neck" Faxon
    7   6  9.46  6.87 Greg Norman
    14 11 10.02  5.94 Jesper Parnevik
    9   9  5.10  5.10 Nick Price
    9   8  8.99  6.99 Steve Elkington
    10 10  5.63  5.63 Tiger Woods
    1   1 10.87 10.87 Tom Kalinowski
    12 10 10.13  7.16 Tom Lehman
    UNIX> 
    
    When sorting lexicographically, sort includes white space, which seems odd. To have it ignore leading white space in a field, you use the -b option. Thus, to sort a file lexicographically, and ignore leading white space, you do:
    UNIX> sort -b +0 md
       Because this poem is a disaster!
    He will get but a penny a day
       Johnny will have a new Master
    See Saw, Margery Daw
    UNIX> 
    
    The last two important options of sort are -u, which strips out duplicates, and -i which ignores the distinction between upper and lower case. As always, read the man page for more info and more options.

    At

    At is a command that lets you submit a job to be executed sometime in the future. The syntax is:
    at [ -s ] time [date] [ script ]
    
    The -s says to treat the script as a Bourne shell rather than csh script. If you don't specify a script file, then it uses standard input.

    When the proper time comes about, the operating system (specifically, the cron daemon) will execute a process under your ownership that first cd's to the directory in which you made the at call, and then runs the script. If there is output to the job, the OS will email you the output.

    At is a great program. The main problem with it is that there is no real standard at syntax. Some OS's don't support the -s command line argument. Some let you specify relative times (like "at now + 1 hour". Some don't. However, most of the time, you it doesn't matter. If you give at a time without a date, it assumes that you mean the nearest date with that time.

    For example, suppose it is 4:00 PM, on June 17, 1997. The following at commands will send mail to our department head at 2:00 AM on June 18. (Don't actually do this -- if so, you are responsible for the outcome, not me):

    UNIX> cat wardmail
    Hi Bob,
    
    See how late I'm working?  I think I deserve a raise!!!
    
    Dr. J
    UNIX> at 2am
    at> mail -s 'an idea' ward < wardmail
    at> 
    warning: commands will be executed using /bin/csh
    job 866613600.a at Wed Jun 18 02:00:00 1997
    UNIX> at 2am June 18
    at> mail -s 'an idea' ward < wardmail
    at> 
    warning: commands will be executed using /bin/csh
    job 866613601.a at Wed Jun 18 02:00:00 1997
    UNIX> echo "mail -s 'an idea ward < wardmail" | at 2am June 18
    warning: commands will be executed using /bin/csh
    job 866613602.a at Wed Jun 18 02:00:00 1997
    UNIX>
    
    To see what jobs you have currently queued up, do atq or at -l:
    UNIX> atq
     Rank     Execution Date     Owner     Job         Queue   Job Name
      1st   Jun 18, 1997 02:00   jplank  866613600.a     a     stdin
      2nd   Jun 18, 1997 02:00   jplank  866613601.a     a     stdin
      3rd   Jun 18, 1997 02:00   jplank  866613602.a     a     stdin
    UNIX> at -l
    866613600.a     Wed Jun 18 02:00:00 1997
    866613601.a     Wed Jun 18 02:00:01 1997
    866613602.a     Wed Jun 18 02:00:02 1997
    UNIX> 
    
    And if you have a change of heart and want to remove the jobs, use at -r or atrm. Note that all systems don't support all of these -- sometimes you have to hunt around to figure out how at and related commands work. As always, read the man page.
    UNIX> at -r 866613600.a
    UNIX> at -l
    866613601.a     Wed Jun 18 02:00:01 1997
    866613602.a     Wed Jun 18 02:00:02 1997
    UNIX> atrm 866613601.a 866613602.a
    866613601.a: removed
    866613602.a: removed
    UNIX> at -l
    
    If at -s does not work on your system and you want to run a Bourne shell script, do
    UNIX> echo 'sh scriptname' | at time
    

    grep

    Grep stands for ``get regular expression''. Its syntax is
    UNIX> grep pattern [ files ]
    
    If you don't specify files on the command line, then it will use standard input. It prints out all lines in the specified files that contain the pattern. If you specified more than one file on the command line, then it will prepend the line with the file that it came from. Examples:
    UNIX> grep penny md
    He will get but a penny a day
    UNIX> grep penny < md
    He will get but a penny a day
    UNIX> grep all md sth
    sth:Our shadow's taller than our souls
    sth:There walks a lady we all know
    sth:Why all that glitters is not gold
    sth:These lyrics are all old as mold
    UNIX> 
    
    The pattern is a ``regular expression.'' While they're not exactly the same as regular expressions in something like CS380, they're pretty close. I'll borrow from the grep and ed man pages to define regular expressions: Ok, so this means that if you want to grep for any number, you can use [0-9]. If you want all lower case letters, use [a-z], and all lower and upper case letters, use [a-zA-Z]. It's always best to use single quotes when you're specifying patterns. Here are some examples:
    UNIX> cat greptest
    Jim Plank
    This string contains no numbers
    This string does though (1)
    -9.00
    G0 V0LS
    UNIX> grep '[Gg]' greptest
    This string contains no numbers
    This string does though (1)
    G0 V0LS
    UNIX> grep '[0-9]' greptest
    This string does though (1)
    -9.00
    G0 V0LS
    UNIX> grep '[A-Z]' greptest
    Jim Plank
    This string contains no numbers
    This string does though (1)
    G0 V0LS
    UNIX> grep '[^A-Za-z ]' greptest
    This string does though (1)
    -9.00
    G0 V0LS
    UNIX> 
    
    So, to grep for lines with exactly 9 characters, (note the newline doesn't count) do:
    UNIX> grep '^.........$' greptest 
    Jim Plank
    UNIX> 
    
    To grep for lines with at least 9 characters, do:
    UNIX> grep '.........' greptest 
    Jim Plank
    This string contains no numbers
    This string does though (1)
    UNIX>
    
    To grep for lines that end with two numbers, do:
    UNIX> grep '[0-9][0-9]$' greptest
    -9.00
    UNIX>
    
    Examples: (don't forget the quotes when using the greater-than and less-than signs).
    UNIX> grep all sth
    Our shadow's taller than our souls
    There walks a lady we all know
    Why all that glitters is not gold
    These lyrics are all old as mold
    UNIX> grep '\<.ll\>' sth
    There walks a lady we all know
    Why all that glitters is not gold
    These lyrics are all old as mold
    UNIX> grep 'dow\>' sth
    Our shadow's taller than our souls
    UNIX> grep '\<.\>' sth
    Our shadow's taller than our souls       (matching the s in "shadow's")
    There walks a lady we all know
    UNIX> 
    
    Note that it matches zero or more. So, the following will match all lines, even though none have Z's:
    UNIX> grep 'Z*' md
    See Saw, Margery Daw
       Johnny will have a new Master
    He will get but a penny a day
       Because this poem is a disaster!
    UNIX> 
    
    Here are some more examples. The first greps for two words separated by a space (actually, since * can match zero, this will also match a single space, or a word before or following a single space). The second greps for a period followed by any number of zeros, and then the end of line. The last greps for any line with two zeros somewhere.
    UNIX> grep '^[^ ]* [^ ]*$' greptest
    Jim Plank
    G0 V0LS
    UNIX> grep '\.0*$' greptest
    -9.00
    UNIX> grep '0.*0' greptest
    -9.00
    G0 V0LS
    UNIX> 
    
    So, some more examples. This first is equivalent to grepping for 0.
    UNIX> grep '0\{1\}' greptest
    -9.00
    G0 V0LS
    
    This is equivalent to grepping for 000*:
    UNIX> grep '0\{2,\}' greptest
    -9.00
    
    Here we grep for 5-letter words containing just lower case letters, then 5-letter words, then words of at least 5 letters:
    UNIX> grep '\<[a-z]\{5\}\>' greptest
    UNIX> grep '\<[A-Za-z]\{5\}\>' greptest
    Jim Plank
    UNIX> grep '\<[A-Za-z]\{5,\}\>' greptest
    Jim Plank
    This string contains no numbers
    This string does though (1)
    UNIX> 
    
    If you want to make sure that grep prints out the file name of the file that the line comes from, include /dev/null on the command line. Then you'll have at least two files on the command line, and grep will be sure to print the file name:
    UNIX> grep '\<.ld\>' sth
    These lyrics are all old as mold
    UNIX> grep '\<.ld\>' sth /dev/null
    sth:These lyrics are all old as mold
    UNIX> 
    
    grep can do far more than this -- you need to read the man page to figure it all out. Also you should read about egrep and fgrep.

    find

    Find is a command that does recursive directory traversal. It is most useful when you need to do one of three things:
    1. Find a file with a specific name.
    2. Grep through all of your files to find a specific word or pattern in one.
    3. Remove a bunch of files with specific names in all your directories.
    I'll go over these three examples. Read the man page to figure out how to do other cool things with find.

    First, to find a files with a specific name in all directories reachable from x this one, do:

    find x -name name -print
    
    For examples, to find all .c files reachable from your home directory, do:
    UNIX> find $HOME -name '*.c' -print
    /mahogany/homes/plank/src/jgraph/work/draw.c
    /mahogany/homes/plank/src/jgraph/work/edit.c
    /mahogany/homes/plank/src/jgraph/work/exit.c
    /mahogany/homes/plank/src/jgraph/work/jgraph.c
    /mahogany/homes/plank/src/jgraph/work/libmalloc.c
    /mahogany/homes/plank/src/jgraph/work/list.c
    /mahogany/homes/plank/src/jgraph/work/printline.c
    /mahogany/homes/plank/src/jgraph/work/prio_list.c
    /mahogany/homes/plank/src/jgraph/work/process.c
    /mahogany/homes/plank/src/jgraph/work/show.c
    ...
    UNIX>
    
    Note that you should put the '*' in single quotes. Also, note that find will not traverse '..', nor will it traverse soft links to other directories. Sometimes this is a drag, but it's pretty much for the best. Suppose you wanted to find all the core files reachable from the current directory. Then you do:
    UNIX> find . -name core -print
    
    Note the way that you match files is using the shell's wildcarding, and not using regular expressions. In other words, if you want to find all your files with two-letter names, do:
    UNIX> find $HOME -name '??' -print
    
    If you want to print all files reachable from your home directory, do
    UNIX> find $HOME -print
    
    Using find to find filenames is something I do so much that I have a shell script called jf that finds all files reachable from the current directory that contain a given string:
    UNIX> jf html
    find . -name '*html*' -print
    ./bin/dehtml
    ./src/jgraph/jgraph-help.html
    ...
    UNIX> jf README
    find . -name '*README*' -print
    ./bin/README
    ./src/jgraph/work/README
    ./src/jgraph/README
    ./src/rb/README
    ./src/rb/new/README
    ./src/noweb/README
    ./src/bnk/README
    ...
    UNIX>
    
    The second use of find is to grep through all files to find a certain string. For example, suppose I want to find if the word "schizoid" exists in any files reachable from /home/cs494. Then I'd do:
    UNIX> find /home/cs494 -type f -exec grep schizoid {} \;
    Twenty first century schizoid man.
    # Find the word "schizoid" in /home/cs494
    find /home/cs494 -type f -exec grep schizoid {} \;
    find /home/cs494 -type f -exec grep schizoid {} /dev/null \;
    UNIX>
    
    Great, so it exists. Here's how you find out what file it's in:
    UNIX> find /home/cs494 -type f -exec grep schizoid {} /dev/null \;
    /home/cs494/notes/Cat/more_bad_lyrics/schizoid-man:Twenty first century schizoid man.
    /home/cs494/notes/Cat/find3:# Find the word "schizoid" in /home/cs494
    /home/cs494/notes/Cat/find3:find /home/cs494 -type f -exec grep schizoid {} \;
    /home/cs494/notes/Cat/find3:find /home/cs494 -type f -exec grep schizoid {} /dev/null \;
    UNIX> 
    
    You'll find that you end up doing things like this more than you'd imagine, especially if you're kind of forgetful, like I am.

    Lastly, suppose that you want to remove all of your core files. Then you do (I put the -i in so that you'll be prompted -- leave it off if you don't want to be prompted):

    UNIX> find $HOME -name core -exec rm -i {} \;
    rm: remove /mahogany/homes/plank/src/ohhell/core? y
    rm: remove /mahogany/homes/plank/src/jgammon/core?  y
    ...
    etc
    
    There's a lot going on here, but this is the simple way to do it. You should read the man page for find to figure out exactly what's going on. Some people find find to be quite confusing, especially the -exec part, where you have to put a backslash before the semi-colon. Give it a read.