Scripts and Utilities -- Perl lecture, part 2



Perl, putting it together


We had been talking about operations and operators on the data storage types in perl last time and I had forgotten to mention how perl makes boolean decisions.

Booleans/undef

In perl, undefined variables have the special value undef. This can be used in expressions, etc, and often makes life convenient. When you try to use undef as a string, you get "", and when you try to use it as a number, you get zero.

Boolean expressions in perl are kind of odd. Undef is false, as is the null string, and anything that casts to a string containing a single zero. Everything else is true. Therefore, all numbers but zero are true, as are all strings but "" and "0".

Simple perl programs

Perl programs are more like C than awk -- you must request lines from standard input rather than getting them by default. Here's the canonical ``hello world'' program in perl:
UNIX> cat hw.perl
print "hello world\n";
UNIX> perl hw.perl
hello world
UNIX> 
So, like awk, there is a print statement. Unlike awk, you have to provide your own newline, and it will only print out one value (i.e. you can't have it print out a bunch of comma-separated values as in awk).

Like C, perl programs are tokens separated by whitespace (i.e. commands can span lines). You must end all commands with semicolons.

UNIX> cat simp.perl

print "Jim\n";
print 1.55 . "\n";
print "Jim" . " " . "Plank" . "\n";
print (("5" + 6) . "\n");

UNIX> perl simp.perl
Jim
1.55
Jim Plank
11
UNIX> 

Operations with Scalar Variables

UNIX> cat scalar.perl

$i = 1;
$j = "2";
print "$i\n";
print "$j\n";
$k = $i + $j;
print "$k\n";
print $i . $j . "\n";
print '$k\n'. "\n";

UNIX> perl scalar.perl
1
2
3
12
$k\n
UNIX> 

Fancy string stuff

Inside double quotes there are many representations that are in addition to those you might be used to from C. These can make your life simpler or more confusing depending on your point of view. If they make things confusing, just forget I mentioned them. Besides the 'normal' \n for newline and \t for tab we have
ConstructMeaning
\cC Any "control" character (here, CNTL-C)
\\ Backslash
\" Double quote
\l Lowercase next letter
\L Lowercase all following letters until \E
\u Uppercase next letter
\U Uppercase like \L
\Q Backslash-quote all non-alphanumerics until \E
\E Terminate \L, \U, or \Q
see strings.perl for a sample. (You must run it with perl5.)
UNIX>perl5 strings.perl
This is a demonstration of "double quotes" and \ (backslashes)
this is about changing case OF TEXT inside Double quotes
And\ this\ is\ the\ \"backslash\-quote\"\ option\;\ which\ is\ weird\.\
UNIX> 
There several others that don't usually seem necessary to me but if you need it it is probably available. Read the man pages or a book for more info. Also remember that double-quoted strings are what one book calls variable interpolated meaning that variables are replaced by their current values inside the strings just like in the Bourne shell.

Perl provides regular expression matching and substitution in a form very familiar to sed/awk. The matching operator is =~ and is a boolean operator. Regular expressions are enclosed in slashes, and work pretty much like sed/awk. There are a few differences:

Some examples (these are in match.perl).
$i = "Jim";
$j = "JjJjJjJj";
$k = "Boom Boom, out go the lights!";

$i =~ /Jim/;                   True
$i =~ /J/;                     True
$i =~ /j/;                     False
$i =~ /j/i;                    True
$i =~ /\w/;                    True
$i =~ /\W/;                    False

$j =~ /j*/;                    True -- matches anything
$j =~ /j+/;                    True -- matches the first 'j'
$j =~ /j?/;                    True -- matches the first 'j'
$j =~ /j{2}/;                  False
$j =~ /j{2}/i;                 True -- ignores case
$j =~ /(Jj){3}/;               True -- matches the entire string

$k =~ /Jim|Boom/;              True -- matches Boom
$k =~ /(Boom){2}/;             False -- there's a space between Booms
$k =~ /(Boom ){2}/;            False -- the second Boom ends with a comma
$k =~ /(Boom\W){2}/;           True 
$k =~ /\bBoom\b/;              True -- shows word delimiters
$k =~ /\bBoom.*the\b/;         True 
$k =~ /\Bgo\B/;                False -- false, because "go" is a word
$k =~ /\Bgh\B/;                True -- the "gh" is in the middle of "lights"
Note that when you run match.perl, the false results are printed as null strings, not zeros.

Regular Expression Substitution

You can modify a string variable by applying a sed-like substitution. The operator is again =~ , and the substitution is specified as
s/pattern1/pattern2/
So, for example, see sub1.perl:
$j = "Jim Plank";

$j =~ s/ .*/i Hendrix/;           Makes 'Jimi Hendrix'
$j =~ s/i/I/g;                    Makes 'JImI HendrIx'
$j =~ s/\b\w*\b/Dikembe/;         Makes 'Dikembe HendrIx'
$j =~ s/(\b\w*\b)/Jimi "\1"/;     Makes 'Jimi "Dikembe" HendrIx'
Unfortunately, you can't use =~ or !~ for that matter to apply substitution as you would, for example, addition, like:
$k = $j =~ s/Jim/Jimi/;
Which is a pity.

You'll note in the last substitution of sub1.perl, I used the parentheses as memory. This is analogous to \( and \) in sed, except you can use the memory even in the first pattern. For example, see sub2.perl:

$j = "Jim Plank";
$j =~ s/(\w*) (\w*)/\1 \1 \2/;      Makes 'Jim Jim Plank'

$i = "I am the the man";
$i =~ s/(\b\w+\b) \1/\1/;           Makes 'I am the man' -- figure it out!

For/if/while

For, if and while clauses work like their C counterparts, except the body of the clause must be enclosed in curly braces.

There are many ways of doing if statements, but some of them are so odious that I won't divulge them. Read a perl manual.

Instead of doing "else if" as in C, you should do "elsif". This is like elif in the Bourne shell.

Standard input, files

In perl files are accessed using "filehandles" (their term). There are 3 provided; STDIN, STDOUT, STDERR. To use these , enclose them in angle brackets (< >). (As filehandles do not have any special character preceding them, as do arrays and hashes, it is recommended that you always use all caps for filehandles to prevent them from interfering from any future reserved words.) EOF is denoted by undef. Therefore, stdin.perl copies standard input to standard output:
UNIX> cat input
D
F
C
B
E
A
UNIX> perl stdin.perl < input
D
F
C
B
E
A
UNIX> 
Note that you get the newline with the string. To get rid of the newline, use the chop() procedure, which modifies its argument to get rid of the last character.

You can open a file for input and then use it like STDIN, above. Moreover, you can open a file for output and print to it. For example, catinput.perl copies the file input to the file output. It also shows use of chop():

UNIX> perl catinput.perl
UNIX> cat output
D
F
C
B
E
A
UNIX> 
Note that when the file output was opened, a > was included in the string. This tells perl to open the file as output. Had we wanted to append to the file we would have used >> instead. You can also print to pipes, read input from pipes, etc.

Arrays

We talked about arrays in the last lecture and now we will look at some operations performed with them

For example, sort1.perl sorts standard input by reading it into an array and printing the sorted array.

UNIX> cat input
D
F
C
B
E
A
UNIX> perl sort1.perl < input
A
B
C
D
E
F
UNIX> 
You can make this simpler: The STDIN token may be treated as an array, so you can simply print the sorted array. This is in sort2.perl:
UNIX> perl sort2.perl < input
A
B
C
D
E
F
UNIX> 
Be careful with this type of usage as if the input is huge there could be memory problems.

Other useful things you can do is split a string into an array of its words (much like awk), and use the subroutines push() and pop() to add and remove elements from the end of an array and shift and unshift to add and remove elements from the front of the array.

You can get at the size of an array by using the array in a place where an integer is expected.

The program reverse.perl uses push() and pop() to reverse a file, and the program revline.perl uses split() in a typical way, and the array size to reverse each line of a file.

(The syntax of split() is split(pattern,string), where the pattern specifies how the space between words is delimited. split(/\s+/,string) means to use contiguous blocks of whitespace as the word delimiter).

Of course, there is also a reverse operator which returns the reverse of an array, and this can be used to make the above programs simpler. See reverse2.perl and revline2.perl. The latter makes use of the foreach construct to iterate over all elements in an array. Does revline2.perl feel like it's approaching unreadability? I agree.

UNIX> cat input2
I am Sam
I am Sam
Sam I am
That Sam I am, that Sam I am, I do not like that Sam I am!
UNIX> perl reverse.perl < input2
That Sam I am, that Sam I am, I do not like that Sam I am!
Sam I am
I am Sam
I am Sam
UNIX> perl reverse2.perl < input2
That Sam I am, that Sam I am, I do not like that Sam I am!
Sam I am
I am Sam
I am Sam
UNIX> perl revline.perl < input2
Sam am I 
Sam am I 
am I Sam 
am! I Sam that like not do I am, I Sam that am, I Sam That 
UNIX> perl revline2.perl < input2
Sam am I 
Sam am I 
am I Sam 
am! I Sam that like not do I am, I Sam that am, I Sam That 
UNIX> 

Associative Arrays

Like awk and python, perl has associative arrays. Again, set them by using them. When accessing a value, you precede it with a dollar sign and enclose the key in curly braces. When accessing the whole array, you precede it with a percent sign. The keys() function returns an array of the keys of the associative array. The values() function returns the values. Both of these return their keys/values in any order. So, for example, suppose you have a list of first names, last names,and phone numbers, and you want to print it sorted in the format: last name, first, phone number. Then you can do something like phone.perl. Note that perl does support printf.
UNIX> cat input3
Peyton Manning 423-vol-qb4u
Phil Fulmer 423-vol-head
Pat Summitt 423-lvl-head
Joe Johnson 423-vol-prez
Jim Plank 423-vol-peon
UNIX> perl phone.perl < input3
    Fulmer,       Phil,                   423-vol-head
   Johnson,        Joe,                   423-vol-prez
   Manning,     Peyton,                   423-vol-qb4u
     Plank,        Jim,                   423-vol-peon
   Summitt,        Pat,                   423-lvl-head
UNIX> 

Listing files

Perl lets you do directory listings with shell-style pattern matching. A simple example is ls.perl which lists the files in the current directory with the .perl extension:
UNIX> perl ls.perl
catinput.perl
hw.perl
ls.perl
match.perl
other.perl
phone.perl
reverse.perl
reverse2.perl
revline.perl
revline2.perl
scalar.perl
simp.perl
sort1.perl
sort2.perl
stdin.perl
sub1.perl
sub2.perl
UNIX> 

Other constructs that you should know about

Look at other.perl. This contains code for opening a file for append, writing to a pipe, reading from a pipe and sorting numerically. Try it out.

The last part needs some explanation. The <=> operator, called in the perl books the spaceship operator, compares the two values $a and $b (don't worry about their old values they are protected) numerically returning 1 if $a > $b, -1 if $a < $b and 0 otherwise. This lets sort sort numerically instead of lexicographically.

Command line arguments are in the @ARGV array.

You can exit from a program with exit.

Reading perl programs

Perl lets you do lots more than what I've detailed. If you start reading random perl programs, you'll notice the use of defaults (e.g. $_) in procedures, substitutions, foreach clauses, etc. The best thing I can say is to read the manual before trying to read programs. I'm not a huge fan of many of these shortcuts because I find it tends to destroy readability, but you make your own decisions.

More, more, more

There is much more that you can do with perl. I have omitted procedure calls, but obviously they exist in the language. There is also support for networking. The best way to learn is to explore. Enjoy.