Scripts and Utilities -- Perl lecture


  • Jim Plank
  • Directory: /home/cs494/notes/Perl
  • This file: http://www.cs.utk.edu/~plank/plank/classes/cs494/494/notes/Perl/lecture.html
  • Lecture links: http://www.cs.utk.edu/~plank/plank/classes/cs494/494/notes/Perl/links.html
  • Email questions and answers

    Perl

    Perl stands for ``practical extraction and report language.'' It's yet another portable language that is useful for writing quick and dirty programs. I am not an experienced Perl hacker, but I have written enough Perl to have formed an opinion. Here it is:

    Plusses: Viewed in its best light, Perl is a language that encapsulates the best features of the shell, sed, grep, awk, tr, C and Cobol. If you are familiar with these tools, you can write very powerful Perl programs rather quickly and easily. In particular, you can write programs that duplicate the functionality of shell scripts but are all in one file (indeed, they are all in one language) and thus are much more efficient.

    Minuses: Perl is a jumble. It contains many, many features from many languages and programs. It contains many differing constructs that implement the same functionality. For example, there are at least 5 ways to perform a one-line if statement. While this is good for the programmer, it is extremely bad for everyone but the programmer (and bad for the programmer who tries to read his own program in 6 months), and has led to Perl being called a ``write-only'' language. There are other minuses as well, but I won't go into them further. You can discover them for yourself. A colleague of mine (Norman Ramsey, at Virginia) responded to an email I sent him about perl, and his response is worth quoting in it's entirity:

    The Bottom Line: My opinions aren't quite as strong as Dr. Ramsey's, but in general I am in agreement with him. The one plus that perl has going for it over sh/sed/awk is that you can do it all in one user process, thereby making it more efficient. There are things that I will do in perl in preference to using sh/sed/awk, and in preference to using C. However, in my opinion, its place is just that -- between sh/sed/awk and C -- and not as a substitute for what either does best.

    The debate as to which is better: Perl, Python or Icon (which I'm not teaching in this class) is a heated one. Python and Icon have better language design. Perl has the most familiar regular expression syntax. I won't get into it, but if you look, you can find all sorts of opinions. Of course, it's best to formulate your own opinions by learning all three....


    Perl help

    Unfortunately, perl is huge, so I'll only be able to give you but a flavor of it. The perl manual is online in the form of the perl man pages, which are broken up into a number of subsections. Do ``man perl.'' Alternatively, see http://hill.ucs.ualberta.ca/Documentation/Info/by-node/perl-5.003/perl.html for a html-ized version of the man pages.

    There are two recommended books in case you want more. First is ``Learning Perl'' by Schwartz, and ``Programming Perl'' by Schwartz and Wall. Both are published by O'Reilly & Associates.


    Calling Syntax

    Like awk, perl works on a program. You can specify the program as the first argument to perl, or you can put the path of perl's executable on the first line of the perl program, preceded by #!. (You can also specify the program on the command line -- see the man page). My perl manual says that you can expect perl to be found in /usr/bin/perl, but in our department, it's in /usr/local/bin/perl. So much for portability.

    Simple perl programs

    Perl programs are more like C than awk -- you must request lines from standard input rather than getting them by default. Here's the canonical ``hello world'' program in perl:
    UNIX> cat hw.perl
    print "hello world\n";
    UNIX> perl hw.perl
    hello world
    UNIX> 
    
    So, like awk, there is a print statement. Unlike awk, you have to provide your own newline, and it will only print out one value (i.e. you can't have it print out a bunch of comma-separated values as in awk). There are two scalar types in perl: strings and numbers. All numbers are floating point. Like awk, you can cast at will, and perl will understand.

    You can concatenate strings with the dot operator.

    Like C, perl programs are tokens separated by whitespace. I.e. commands can span lines. You must end all commands with semi-colons.

    UNIX> cat simp.perl
    
    print "Jim\n";
    print 1.55 . "\n";
    print "Jim" . " " . "Plank" . "\n";
    print (("5" + 6) . "\n");
    
    UNIX> perl simp.perl
    Jim
    1.55
    Jim Plank
    11
    UNIX> 
    

    Scalar variables

    Scalar variables must start with a dollar sign and do not need to be declared. You can insert variables into strings by using double-quotes in the style of the shell. Also, you can create strings with single quotes that treat dollar signs and backslashes like normal characters:
    UNIX> cat scalar.perl
    
    $i = 1;
    $j = "2";
    print "$i\n";
    print "$j\n";
    $k = $i + $j;
    print "$k\n";
    print $i . $j . "\n";
    print '$k\n'. "\n";
    
    UNIX> perl scalar.perl
    1
    2
    3
    12
    $k\n
    UNIX> 
    

    Booleans/undef

    In perl, undefined variables have the special value undef. This can be used in expressions, etc, and often makes life convenient. When you try to use undef as a string, you get "", and when you try to use it as a number, you get zero.

    Boolean expressions in perl are kind of odd. Undef is false, as is the null string, and anything that casts to a string containing a single zero. Everything else is true. Therefore, all numbers but zero are true, as are all strings but "" and "0".

    You compare numbers with the C comparative operators. You can use eq, ne, lt, gt, le, and ge to compare strings lexicographically.

    For/if/while

    For, if and while clauses work like their C counterparts, except the body of the clause must be enclosed in curly braces.

    There are many ways of doing if statements, but some of them are so odious that I won't divulge them. Read a perl manual.

    Instead of doing "else if" as in C, you should do "elsif". This is like elif in the Bourne shell.

    Standard input, files

    You can get a string from standard input by enclosing STDIN within angle brackets. EOF is denoted by undef. Therefore, stdin.perl copies standard input to standard output:
    UNIX> cat input
    D
    F
    C
    B
    E
    A
    UNIX> perl stdin.perl < input
    D
    F
    C
    B
    E
    A
    UNIX> 
    
    Note that you get the newline with the string. To get rid of the newline, use the chop() procedure, which modifies its argument to get rid of the last character.

    You can open a file for input and then use it like STDIN, above. Moreover, you can open a file for output and print to it. For example, catinput.perl copies the file input to the file output. It also shows use of chop():

    UNIX> perl catinput.perl
    UNIX> cat output
    D
    F
    C
    B
    E
    A
    UNIX> 
    
    You can also open a file for append, print to pipes, read input from pipes, etc.

    Arrays

    Arrays are kind of like awk: just use them, and perl takes care of the rest. When you use an element of an array, you precede the expression with a dollar sign and put the index in square brackets. When you access the array as a whole, you precede it with the at sign (@). Array indices must be integers. Interestingly, you can copy an entire array by simply assigning one array to another. There is also the sort() operator, which sorts lexicographically, and returns the sorted array.

    For example, sort1.perl sorts standard input by reading it into an array and printing the sorted array.

    UNIX> cat input
    D
    F
    C
    B
    E
    A
    UNIX> perl sort1.perl < input
    A
    B
    C
    D
    E
    F
    UNIX> 
    
    You can make this simpler: The STDIN token may be treated as an array, so you can simply print the sorted array. This is in sort2.perl:
    UNIX> perl sort2.perl < input
    A
    B
    C
    D
    E
    F
    UNIX> 
    
    Other useful things you can do is split a string into an array of its words (much like awk), and use the subroutines push() and pop() to add and remove elements from the end of an array.

    You can get at the size of an array by using the array in a place where an integer is expected.

    The program reverse.perl uses push() and pop() to reverse a file, and the program revline.perl uses split() in a typical way, and the array size to reverse each line of a file.

    (The syntax of split() is split(pattern,string), where the pattern specifies how the space between words is delimited. split(/\s+/,string) means to use contiguous blocks of whitespace as the word delimiter).

    Of course, there is also a reverse operator which returns the reverse of an array, and this can be used to make the above programs simpler. See reverse2.perl and revline2.perl. The latter makes use of the foreach construct to iterate over all elements in an array. Does revline2.perl feel like it's approaching unreadability? I agree.

    UNIX> cat input2
    I am Sam
    I am Sam
    Sam I am
    That Sam I am, that Sam I am, I do not like that Sam I am!
    UNIX> perl reverse.perl < input2
    That Sam I am, that Sam I am, I do not like that Sam I am!
    Sam I am
    I am Sam
    I am Sam
    UNIX> perl reverse2.perl < input2
    That Sam I am, that Sam I am, I do not like that Sam I am!
    Sam I am
    I am Sam
    I am Sam
    UNIX> perl revline.perl < input2
    Sam am I 
    Sam am I 
    am I Sam 
    am! I Sam that like not do I am, I Sam that am, I Sam That 
    UNIX> perl revline2.perl < input2
    Sam am I 
    Sam am I 
    am I Sam 
    am! I Sam that like not do I am, I Sam that am, I Sam That 
    UNIX> 
    

    Associative Arrays

    Like awk and python, perl has associative arrays. Again, set them by using them. When accessing a value, you precede it with a dollar sign and enclose the key in curly braces. When accessing the whole array, you precede it with a percent sign. The keys() function returns an array of the keys of the associative array. The values() function returns the values. Both of these return their keys/values in any order. So, for example, suppose you have a list of first names, last names,and phone numbers, and you want to print it sorted in the format: last name, first, phone number. Then you can do something like phone.perl. Note that perl does support printf.
    UNIX> cat input3
    Peyton Manning 423-vol-qb4u
    Phil Fulmer 423-vol-head
    Pat Summitt 423-lvl-head
    Joe Johnson 423-vol-prez
    Jim Plank 423-vol-peon
    UNIX> perl phone.perl < input3
        Fulmer,       Phil,                   423-vol-head
       Johnson,        Joe,                   423-vol-prez
       Manning,     Peyton,                   423-vol-qb4u
         Plank,        Jim,                   423-vol-peon
       Summitt,        Pat,                   423-lvl-head
    UNIX> 
    

    Listing files

    Perl lets you do directory listings with shell-style pattern matching. A simple example is ls.perl which lists the files in the current directory with the .perl extension:
    UNIX> perl ls.perl
    catinput.perl
    hw.perl
    ls.perl
    match.perl
    other.perl
    phone.perl
    reverse.perl
    reverse2.perl
    revline.perl
    revline2.perl
    scalar.perl
    simp.perl
    sort1.perl
    sort2.perl
    stdin.perl
    sub1.perl
    sub2.perl
    UNIX> 
    

    Fancy string stuff

    There are many fancy things that you can do inside double quotes for string construction. I won't go into them here.

    Perl provides regular expression matching and substitution in a form very familiar to sed/awk. The matching operator is =~ and is a boolean operator. Regular expressions are enclosed in slashes, and work pretty much like sed/awk. There are a few differences:

    • If you follow the RE with 'i', then it will ignore case.

    • If you follow a character with '+', it means one or more, and if you follow it with '?' it will match zero or one.

    • If you follow a character with '{n}', it will match exactly n occurrences. Similarly, '{n,m}' and '{n,}' have their sed-like meanings.

    • If you put '|' between two patterns, it will match either pattern.

    • You can use parentheses to have these operators work on patterns that are bigger than one character. The precedence is parens, then ``multipliers'' (*,+,?,{n,m}), then sequences/``anchoring'' (^,$), then ``alternation'' (|).

    • There are some character classes predefined:
      • Digits are '\d', and their complement (not digits) is '\D'.
      • Words ([a-zA-Z0-9_] are '\w', and their complement is '\W'.
      • Whitespace is '\s', and the complement is '\S'.

    • '\b' matches the beginning or end of a word, and '\B' matches anything but the beginning or end of a word.
    Some examples (these are in match.perl).
    $i = "Jim";
    $j = "JjJjJjJj";
    $k = "Boom Boom, out go the lights!";
    
    $i =~ /Jim/;                   True
    $i =~ /J/;                     True
    $i =~ /j/;                     False
    $i =~ /j/i;                    True
    $i =~ /\w/;                    True
    $i =~ /\W/;                    False
    
    $j =~ /j*/;                    True -- matches anything
    $j =~ /j+/;                    True -- matches the first 'j'
    $j =~ /j?/;                    True -- matches the first 'j'
    $j =~ /j{2}/;                  False
    $j =~ /j{2}/i;                 True -- ignores case
    $j =~ /(Jj){3}/;               True -- matches the entire string
    
    $k =~ /Jim|Boom/;              True -- matches Boom
    $k =~ /(Boom){2}/;             False -- there's a space between Booms
    $k =~ /(Boom ){2}/;            False -- the second Boom ends with a comma
    $k =~ /(Boom\W){2}/;           True 
    $k =~ /\bBoom\b/;              True -- shows word delimiters
    $k =~ /\bBoom.*the\b/;         True 
    $k =~ /\Bgo\B/;                False -- false, because "go" is a word
    $k =~ /\Bgh\B/;                True -- the "gh" is in the middle of "lights"
    
    Note that when you run match.perl, the falses are printed as null strings, not zeros.

    Regular Expression Substitution

    You can modify a string variable by applying a sed-like substitution. The operator is again =~, and the substitution is specified as
    s/pattern1/pattern2/
    
    So, for example, see sub1.perl:
    $j = "Jim Plank";
    
    $j =~ s/ .*/i Hendrix/;           Makes 'Jimi Hendrix'
    $j =~ s/i/I/g;                    Makes 'JImI HendrIx'
    $j =~ s/\b\w*\b/Dikembe/;         Makes 'Dikembe HendrIx'
    $j =~ s/(\b\w*\b)/Jimi "\1"/;     Makes 'Jimi "Dikembe" HendrIx'
    
    Unfortunately, you can't use =~ or ~ for that matter to apply substitution as you would, for example, addition, like:
    $k = $j ~ s/Jim/Jimi/;
    
    Which is a pity.

    You'll note in the last substitution of sub1.perl, I used the parentheses as memory. This is analagous to \( and \) in sed, except you can use the memory even in the first pattern. For example, see sub2.perl:

    $j = "Jim Plank";
    $j =~ s/(\w*) (\w*)/\1 \1 \2/;      Makes 'Jim Jim Plank'
    
    $i = "I am the the man";
    $i =~ s/(\b\w+\b) \1/\1/;           Makes 'I am the man' -- figure it out!
    

    Other constructs that you should know about

    Look at other.perl. This contains code for opening a file for append, writing to a pipe, reading from a pipe and sorting numerically. Try it out.

    Reading perl programs

    Perl lets you do lots more than what I've detailed. If you start reading random perl programs, you'll notice the use of defaults (e.g. $_) in procedures, substitutions, foreach clauses, etc. The best thing I can say is to read the manual before trying to read programs. I'm not a huge fan of these shortcuts, but perhaps I'm not the prototypical perl hacker.

    Command line arguments are in the @ARGV array.

    You can write to standard error by writing to STDERR.

    You can exit from a program with exit.

    More, more, more

    There is much more that you can do with perl. I have ommitted procedure calls, but obviously they exist in the language. There is also support for networking. The best way to learn is to explore. Enjoy.