Scripts and Utilities -- Perl lecture

This file: http://www.cs.utk.edu/~plank/plank/classes/cs494/494/notes/Perl/lecture.html

Lecture links: http://www.cs.utk.edu/~plank/plank/classes/cs494/494/notes/Perl/links.html

Perl

Perl stands for ``practical extraction and report language.'' It's yet another portable language that is useful for writing quick and dirty programs. I am not an experienced Perl hacker, but I have written enough Perl to have formed an opinion. Here it is:

Plusses: Viewed in its best light, Perl is a language that encapsulates the best features of the shell, sed, grep, awk, tr, C and Cobol. If you are familiar with these tools, you can write very powerful Perl programs rather quickly and easily. In particular, you can write programs that duplicate the functionality of shell scripts but are all in one file (indeed, they are all in one language) and thus are much more efficient.

Minuses: Perl is a jumble. It contains many, many features from many languages and programs. It contains many differing constructs that implement the same functionality. For example, there are at least 5 ways to perform a one-line if statement. While this is good for the programmer, it is extremely bad for everyone but the programmer (and bad for the programmer who tries to read his own program in 6 months), and has led to Perl being called a ``write-only'' language. There are other minuses as well, but I won't go into them further. You can discover them for yourself. A colleague of mine (Norman Ramsey, at Virginia) responded to an email I sent him about perl, and his response is worth quoting in it's entirity:

Perl is a brilliant mistake, of a kind being repeated over and over again today. It follows an unfortunate trend in computing documentation---we no longer explain how our programs or languages work, we just provide encyclopedic compendia of things you can do with them. Users don't have to understand anything; they just pattern match on these enormous documents until they see something that looks sort of like what they want, then they hack on it until it produces the right answer for some inputs, then they declare victory. (This is why some students can take 50 hours to complete a 2-hour homework assignment.) Perl accomodates this style perfectly.

So this is the mistake. The brilliance is in including so many things people want to do, and in a form that is almost familiar, so they can pattern-match more easily. In fact, the familiarity is really an illusion, and if you're going to program in perl, you can hack without understanding, or you can restrict yourself to a subset you can understand. But if you're going to do the latter, why not program in sh, awk, and sed to begin with?

I've never used apl, so I can't compare.

I have---the thing with apl is that although it is *all* weird, it is weird in a very consistent way. You don't have the illusion of familiarity. You do have a huge set of unreadable glyphs, but they come with a very small set of simple rules for decrypting them, the most important of which are the right-to-left scan rule and the fact that user-defined functions take at most two arguments.

I just got to a part in my manual where they advocate using && as an if statement.

I got past that. I gave up on perl the day I learned I couldn't write a function to return an open file handle (e.g., an open socket). Now I use it only when forced.

The Bottom Line: My opinions aren't quite as strong as Dr. Ramsey's, but in general I am in agreement with him. The one plus that perl has going for it over sh/sed/awk is that you can do it all in one user process, thereby making it more efficient. There are things that I will do in perl in preference to using sh/sed/awk, and in preference to using C. However, in my opinion, its place is just that -- between sh/sed/awk and C -- and not as a substitute for what either does best.

The debate as to which is better: Perl, Python or Icon (which I'm not teaching in this class) is a heated one. Python and Icon have better language design. Perl has the most familiar regular expression syntax. I won't get into it, but if you look, you can find all sorts of opinions. Of course, it's best to formulate your own opinions by learning all three....

Perl help

Unfortunately, perl is huge, so I'll only be able to give you but a flavor of it. The perl manual is online in the form of the perl man pages, which are broken up into a number of subsections. Do ``man perl.'' Alternatively, see http://hill.ucs.ualberta.ca/Documentation/Info/by-node/perl-5.003/perl.html for a html-ized version of the man pages.

There are two recommended books in case you want more. First is ``Learning Perl'' by Schwartz, and ``Programming Perl'' by Schwartz and Wall. Both are published by O'Reilly & Associates.

Calling Syntax

Like awk, perl works on a program. You can specify the program as the first argument to perl, or you can put the path of perl's executable on the first line of the perl program, preceded by #!. (You can also specify the program on the command line -- see the man page). My perl manual says that you can expect perl to be found in /usr/bin/perl, but in our department, it's in /usr/local/bin/perl. So much for portability.

Simple perl programs

Perl programs are more like C than awk -- you must request lines from standard input rather than getting them by default. Here's the canonical ``hello world'' program in perl:

UNIX> cat hw.perl
print "hello world\n";
UNIX> perl hw.perl
hello world
UNIX>

So, like awk, there is a print statement. Unlike awk, you have to provide your own newline, and it will only print out one value (i.e. you can't have it print out a bunch of comma-separated values as in awk). There are two scalar types in perl: strings and numbers. All numbers are floating point. Like awk, you can cast at will, and perl will understand.

You can concatenate strings with the dot operator.

Like C, perl programs are tokens separated by whitespace. I.e. commands can span lines. You must end all commands with semi-colons.

UNIX> cat simp.perl

print "Jim\n";
print 1.55 . "\n";
print "Jim" . " " . "Plank" . "\n";
print (("5" + 6) . "\n");

UNIX> perl simp.perl
Jim
1.55
Jim Plank
11
UNIX>

Scalar variables

Scalar variables must start with a dollar sign and do not need to be declared. You can insert variables into strings by using double-quotes in the style of the shell. Also, you can create strings with single quotes that treat dollar signs and backslashes like normal characters:

UNIX> cat scalar.perl

$i = 1;
$j = "2";
print "$i\n";
print "$j\n";
$k = $i + $j;
print "$k\n";
print $i . $j . "\n";
print '$k\n'. "\n";

UNIX> perl scalar.perl
1
2
3
12
$k\n
UNIX>

Booleans/undef

In perl, undefined variables have the special value undef. This can be used in expressions, etc, and often makes life convenient. When you try to use undef as a string, you get "", and when you try to use it as a number, you get zero.

Boolean expressions in perl are kind of odd. Undef is false, as is the null string, and anything that casts to a string containing a single zero. Everything else is true. Therefore, all numbers but zero are true, as are all strings but "" and "0".

You compare numbers with the C comparative operators. You can use eq, ne, lt, gt, le, and ge to compare strings lexicographically.

For/if/while

For, if and while clauses work like their C counterparts, except the body of the clause must be enclosed in curly braces.

There are many ways of doing if statements, but some of them are so odious that I won't divulge them. Read a perl manual.

Instead of doing "else if" as in C, you should do "elsif". This is like elif in the Bourne shell.

Standard input, files

You can get a string from standard input by enclosing STDIN within angle brackets. EOF is denoted by undef. Therefore, stdin.perl copies standard input to standard output:

UNIX> cat input
D
F
C
B
E
A
UNIX> perl stdin.perl < input
D
F
C
B
E
A
UNIX>

Note that you get the newline with the string. To get rid of the newline, use the chop() procedure, which modifies its argument to get rid of the last character.

You can open a file for input and then use it like STDIN, above. Moreover, you can open a file for output and print to it. For example, catinput.perl copies the file input to the file output. It also shows use of chop():

UNIX> perl catinput.perl
UNIX> cat output
D
F
C
B
E
A
UNIX>

You can also open a file for append, print to pipes, read input from pipes, etc.

Arrays

Arrays are kind of like awk: just use them, and perl takes care of the rest. When you use an element of an array, you precede the expression with a dollar sign and put the index in square brackets. When you access the array as a whole, you precede it with the at sign (@). Array indices must be integers. Interestingly, you can copy an entire array by simply assigning one array to another. There is also the sort() operator, which sorts lexicographically, and returns the sorted array.

For example, sort1.perl sorts standard input by reading it into an array and printing the sorted array.

UNIX> cat input
D
F
C
B
E
A
UNIX> perl sort1.perl < input
A
B
C
D
E
F
UNIX>

You can make this simpler: The STDIN token may be treated as an array, so you can simply print the sorted array. This is in sort2.perl:

UNIX> perl sort2.perl < input
A
B
C
D
E
F
UNIX>

Other useful things you can do is split a string into an array of its words (much like awk), and use the subroutines push() and pop() to add and remove elements from the end of an array.

You can get at the size of an array by using the array in a place where an integer is expected.

The program reverse.perl uses push() and pop() to reverse a file, and the program revline.perl uses split() in a typical way, and the array size to reverse each line of a file.

(The syntax of split() is split(pattern,string), where the pattern specifies how the space between words is delimited. split(/\s+/,string) means to use contiguous blocks of whitespace as the word delimiter).

Of course, there is also a reverse operator which returns the reverse of an array, and this can be used to make the above programs simpler. See reverse2.perl and revline2.perl. The latter makes use of the foreach construct to iterate over all elements in an array. Does revline2.perl feel like it's approaching unreadability? I agree.

UNIX> cat input2
I am Sam
I am Sam
Sam I am
That Sam I am, that Sam I am, I do not like that Sam I am!
UNIX> perl reverse.perl < input2
That Sam I am, that Sam I am, I do not like that Sam I am!
Sam I am
I am Sam
I am Sam
UNIX> perl reverse2.perl < input2
That Sam I am, that Sam I am, I do not like that Sam I am!
Sam I am
I am Sam
I am Sam
UNIX> perl revline.perl < input2
Sam am I 
Sam am I 
am I Sam 
am! I Sam that like not do I am, I Sam that am, I Sam That 
UNIX> perl revline2.perl < input2
Sam am I 
Sam am I 
am I Sam 
am! I Sam that like not do I am, I Sam that am, I Sam That 
UNIX>

Associative Arrays

Like awk and python, perl has associative arrays. Again, set them by using them. When accessing a value, you precede it with a dollar sign and enclose the key in curly braces. When accessing the whole array, you precede it with a percent sign. The keys() function returns an array of the keys of the associative array. The values() function returns the values. Both of these return their keys/values in any order. So, for example, suppose you have a list of first names, last names,and phone numbers, and you want to print it sorted in the format: last name, first, phone number. Then you can do something like phone.perl. Note that perl does support printf.

UNIX> cat input3
Peyton Manning 423-vol-qb4u
Phil Fulmer 423-vol-head
Pat Summitt 423-lvl-head
Joe Johnson 423-vol-prez
Jim Plank 423-vol-peon
UNIX> perl phone.perl < input3
    Fulmer,       Phil,                   423-vol-head
   Johnson,        Joe,                   423-vol-prez
   Manning,     Peyton,                   423-vol-qb4u
     Plank,        Jim,                   423-vol-peon
   Summitt,        Pat,                   423-lvl-head
UNIX>

Listing files

Perl lets you do directory listings with shell-style pattern matching. A simple example is ls.perl which lists the files in the current directory with the .perl extension:

UNIX> perl ls.perl
catinput.perl
hw.perl
ls.perl
match.perl
other.perl
phone.perl
reverse.perl
reverse2.perl
revline.perl
revline2.perl
scalar.perl
simp.perl
sort1.perl
sort2.perl
stdin.perl
sub1.perl
sub2.perl
UNIX>

Fancy string stuff

There are many fancy things that you can do inside double quotes for string construction. I won't go into them here.

Perl provides regular expression matching and substitution in a form very familiar to sed/awk. The matching operator is =~ and is a boolean operator. Regular expressions are enclosed in slashes, and work pretty much like sed/awk. There are a few differences:

If you follow the RE with 'i', then it will ignore case.
If you follow a character with '+', it means one or more, and if you follow it with '?' it will match zero or one.
If you follow a character with '{n}', it will match exactly n occurrences. Similarly, '{n,m}' and '{n,}' have their sed-like meanings.
If you put '|' between two patterns, it will match either pattern.
You can use parentheses to have these operators work on patterns that are bigger than one character. The precedence is parens, then ``multipliers'' (*,+,?,{n,m}), then sequences/``anchoring'' (^,$), then ``alternation'' (|).
There are some character classes predefined:
- Digits are '\d', and their complement (not digits) is '\D'.
- Words ([a-zA-Z0-9_] are '\w', and their complement is '\W'.
- Whitespace is '\s', and the complement is '\S'.
'\b' matches the beginning or end of a word, and '\B' matches anything but the beginning or end of a word.

Some examples (these are in match.perl).

$i = "Jim";
$j = "JjJjJjJj";
$k = "Boom Boom, out go the lights!";

$i =~ /Jim/;                   True
$i =~ /J/;                     True
$i =~ /j/;                     False
$i =~ /j/i;                    True
$i =~ /\w/;                    True
$i =~ /\W/;                    False

$j =~ /j*/;                    True -- matches anything
$j =~ /j+/;                    True -- matches the first 'j'
$j =~ /j?/;                    True -- matches the first 'j'
$j =~ /j{2}/;                  False
$j =~ /j{2}/i;                 True -- ignores case
$j =~ /(Jj){3}/;               True -- matches the entire string

$k =~ /Jim|Boom/;              True -- matches Boom
$k =~ /(Boom){2}/;             False -- there's a space between Booms
$k =~ /(Boom ){2}/;            False -- the second Boom ends with a comma
$k =~ /(Boom\W){2}/;           True 
$k =~ /\bBoom\b/;              True -- shows word delimiters
$k =~ /\bBoom.*the\b/;         True 
$k =~ /\Bgo\B/;                False -- false, because "go" is a word
$k =~ /\Bgh\B/;                True -- the "gh" is in the middle of "lights"

Note that when you run match.perl, the falses are printed as null strings, not zeros.

Regular Expression Substitution

You can modify a string variable by applying a sed-like substitution. The operator is again =~, and the substitution is specified as

s/pattern1/pattern2/

So, for example, see sub1.perl:

$j = "Jim Plank";

$j =~ s/ .*/i Hendrix/;           Makes 'Jimi Hendrix'
$j =~ s/i/I/g;                    Makes 'JImI HendrIx'
$j =~ s/\b\w*\b/Dikembe/;         Makes 'Dikembe HendrIx'
$j =~ s/(\b\w*\b)/Jimi "\1"/;     Makes 'Jimi "Dikembe" HendrIx'

Unfortunately, you can't use =~ or ~ for that matter to apply substitution as you would, for example, addition, like:

$k = $j ~ s/Jim/Jimi/;

Which is a pity.

You'll note in the last substitution of sub1.perl, I used the parentheses as memory. This is analagous to $ and $ in sed, except you can use the memory even in the first pattern. For example, see sub2.perl:

$j = "Jim Plank";
$j =~ s/(\w*) (\w*)/\1 \1 \2/;      Makes 'Jim Jim Plank'

$i = "I am the the man";
$i =~ s/(\b\w+\b) \1/\1/;           Makes 'I am the man' -- figure it out!

Other constructs that you should know about

Look at other.perl. This contains code for opening a file for append, writing to a pipe, reading from a pipe and sorting numerically. Try it out.

Reading perl programs

Perl lets you do lots more than what I've detailed. If you start reading random perl programs, you'll notice the use of defaults (e.g. $_) in procedures, substitutions, foreach clauses, etc. The best thing I can say is to read the manual before trying to read programs. I'm not a huge fan of these shortcuts, but perhaps I'm not the prototypical perl hacker.

Command line arguments are in the @ARGV array.

You can write to standard error by writing to STDERR.

You can exit from a program with exit.

More, more, more

There is much more that you can do with perl. I have ommitted procedure calls, but obviously they exist in the language. There is also support for networking. The best way to learn is to explore. Enjoy.