Scripts and Utilities -- Perl lecture, part 2



Booleans/undef

In perl, undefined variables have the special value undef. This can be used in expressions, etc, and often makes life convenient. When you try to use undef as a string, you get "", and when you try to use it as a number, you get zero.

Boolean expressions in perl are kind of odd. Undef is false, as is the null string, and anything that casts to a string containing a single zero. Everything else is true. Therefore, all numbers but zero are true, as are all strings but "" and "0".

If you want to know whether a value is undef or a variable is undefined, you can use the defined function (e.g., defined($a)). If the value is anything but undef, defined will return 1. Otherwise it returns undef.


Simple perl programs

Here's the canonical ``hello world'' program in perl:

 
UNIX> cat hw.perl
print "hello world\n";
UNIX> perl hw.perl
hello world
UNIX> 

So, like Python there is a print statement. Unlike Python, you have to provide your own newline, and it will only print out one value (i.e. you can't have it print out a bunch of comma-separated values as in Python).

Like C, perl programs are tokens separated by whitespace (i.e. commands can span lines). You must end all commands with semicolons.

 
UNIX> cat simp.perl
 
print "Jim\n";
print 1.55 . "\n";
print "Jim" . " " . "Plank" . "\n";
print (("5" + 6) . "\n");
 
UNIX> perl simp.perl
Jim
1.55
Jim Plank
11
UNIX> 


Operations with Scalar Variables

 
UNIX> cat scalar.perl
 
$i = 1;
$j = "2";
print "$i\n";
print "$j\n";
$k = $i + $j;
print "$k\n";
print $i . $j . "\n";
print '$k\n'. "\n";
 
UNIX> perl scalar.perl
1
2
3
12
$k\n
UNIX> 
 
  
 
 
 
 


 
 
 
 
  

Fancy string stuff

Inside double quotes there are many representations that are in addition to those you might be used to from C. These can make your life simpler or more confusing depending on your point of view. If they make things confusing, just forget I mentioned them. Besides the 'normal' \n for newline and \t for tab we have

 

Construct

Meaning

\cC

Any "control" character (here, CNTL-C)

\\

Backslash

\"

Double quote

\l

Lowercase next letter

\L

Lowercase all following letters until \E

\u

Uppercase next letter

\U

Uppercase like \L

\Q

Backslash-quote all non-alphanumerics until \E

\E

Terminate \L, \U, or \Q

 

see strings.perl for a sample. (You must run it with perl5.)

 
UNIX>perl5 strings.perl
This is a demonstration of "double quotes" and \ (backslashes)
this is about changing case OF TEXT inside Double quotes
And\ this\ is\ the\ \"backslash\-quote\"\ option\;\ which\ is\ weird\.\
UNIX> 
 

There are several others that are not mentioned here. Read the man pages or a book for more info. Also remember that double-quoted strings are what one book calls variable interpolated meaning that variables are replaced by their current values inside the strings just like in the Bourne shell.

Perl provides regular expression matching and substitution in a form very familiar to sed/awk. The matching operator is =~ and is a boolean operator. Regular expressions are enclosed in slashes, and work pretty much like sed/awk. There are a few differences:

·       If you follow a character with '+', it means one or more, and if you follow it with '?' it will match zero or one.

·       If you follow a character with '{n}', it will match exactly n occurrences. Similarly, '{n,m}' and '{n,}' have their sed-like meanings.

·       If you put '|' between two patterns, it will match either pattern.

·       You can use parentheses to have these operators work on patterns that are bigger than one character. The precedence is parens, then ``multipliers'' (*,+,?,{n,m}), then sequences/``anchoring'' (^,$), then ``alternation'' (|). We will look at a few examples later on.

·       There are some character classes that are predefined:

o      Digits are '\d', and their complement (not digits) is '\D'.

o      Words ([a-zA-Z0-9_] are '\w', and their complement is '\W'.

o      Whitespace is '\s', and the complement is '\S'.

·       '\b' matches the beginning or end of a word, and '\B' matches anything but the beginning or end of a word.

Some examples (these are in match.perl).

 
$i = "Jim";
$j = "JjJjJjJj";
$k = "Boom Boom, out go the lights!";
 
$i =~ /Jim/;                   True
$i =~ /J/;                     True
$i =~ /j/;                     False
$i =~ /j/i;                    True
$i =~ /\w/;                    True
$i =~ /\W/;                    False
 
$j =~ /j*/;                    True -- matches anything
$j =~ /j+/;                    True -- matches the first 'j'
$j =~ /j?/;                    True -- matches the first 'j'
$j =~ /j{2}/;                  False
$j =~ /j{2}/i;                 True -- ignores case
$j =~ /(Jj){3}/;               True -- matches the entire string
 
$k =~ /Jim|Boom/;              True -- matches Boom
$k =~ /(Boom){2}/;             False -- there's a space between Booms
$k =~ /(Boom ){2}/;            False -- the second Boom ends with a comma
$k =~ /(Boom\W){2}/;           True 
$k =~ /\bBoom\b/;              True -- shows word delimiters
$k =~ /\bBoom.*the\b/;         True 
$k =~ /\Bgo\B/;                False -- false, because "go" is a word
$k =~ /\Bgh\B/;                True -- the "gh" is in the middle of "lights"

Note that when you run match.perl, the false results are printed as null strings, not zeros.


Regular Expression Substitution

You can modify a string variable by applying a sed-like substitution. The operator is again “=~” , and the substitution is specified as

s/pattern1/pattern2/

So, for example, see sub1.perl:

 
$j = "Jim Plank";
 
$j =~ s/ .*/i Hendrix/;           Makes 'Jimi Hendrix'
$j =~ s/i/I/g;                    Makes 'JImI HendrIx'
$j =~ s/\b\w*\b/Dikembe/;         Makes 'Dikembe HendrIx'
$j =~ s/(\b\w*\b)/Jimi "\1"/;     Makes 'Jimi "Dikembe" HendrIx'
 

In addition, the operator “!~” is used for spotting a non-match.

 
$sentence = "The quick brown fox";
$sentence !~ /the/
 

The above RE is true because the string “the” does not appear in $sentence.

Unfortunately, you can't use =~ or !~ for that matter to apply substitution as you would, for example, addition, like:

$k = $j =~ s/Jim/Jimi/;

Which is a pity.

You'll note in the last substitution of sub1.perl, I used the parentheses as memory. This is analogous to \( and \) in sed, except you can use the memory even in the first pattern. For example, see sub2.perl:

 
$j = "Jim Plank";
$j =~ s/(\w*) (\w*)/\1 \1 \2/;      Makes 'Jim Jim Plank'
 
$i = "I am the the man";
$i =~ s/(\b\w+\b) \1/\1/;           Makes 'I am the man' -- figure it out!
 
 
  
 
 
 
 


 
 
 
 
  

Translation

In addition to the regular substitution operator, perl has a translation operator given by “tr”. The tr function allows character-by-character translation.

The following expression replaces each a with e, each b with d, and each c with f in the variable $sentence. The expression returns the number of substitutions made.

 
$sentence =~ tr/abc/edf/

Most of the special RE codes do not apply in the tr function. For example, the statement here counts the number of asterisks in the $sentence variable and stores that in the $count variable.

$count = ($sentence =~ tr/*/*/);


For/if/while

For, if and while clauses work like their C counterparts, except the body of the clause must be enclosed in curly braces.

There are many ways of doing if statements, but some of them are so odious that I won't divulge them. Read a perl manual.

Instead of doing "else if" as in C, you should do "elsif". This is like elif in the Bourne shell.


Standard input, files

In perl, files are accessed using "filehandles" (their term). 3 types are provided: STDIN, STDOUT, STDERR. To use these, enclose them in angle brackets (< >). By default <> also represents STDIN (As filehandles do not have any special character preceding them, as do arrays and hashes, it is recommended that you always use all caps for filehandles to prevent them from interfering from any future reserved words.) EOF is denoted by undef. Therefore, stdin.perl copies standard input to standard output:

 
UNIX> cat input
D
F
C
B
E
A
UNIX> perl stdin.perl < input
D
F
C
B
E
A
UNIX> 

Note that you get the newline with the string. To get rid of the newline, use the chomp() procedure, which modifies its argument to get rid of the last character.

You can open a file for input and then use it like STDIN, above. Moreover, you can open a file for output and print to it.

The open function opens a file for input (i.e. for reading). The first parameter is the filehandle, which allows Perl to refer to the file in future. The second parameter is an expression denoting the filename. If the filename was given in quotes then it is taken literally without shell expansion. So the expression '~/notes/todolist' will not be interpreted successfully. If you want to force shell expansion then use angled brackets: that is, use <~/notes/todolist> instead. The close function tells Perl to finish with that file.

In addition, the open statement can also specify a file for output and for appending as well as for input. To do this, prefix the filename with a “>” for output and a “>>” for appending:

open(INFO, $file);     # Open for input
open(INFO, ">$file");  # Open for output
open(INFO, ">>$file"); # Open for appending
open(INFO, "<$file");  # Also open for input

To print something to a file that has already been opened, use the print statement with an extra parameter. For e.g.,

print INFO "This line goes to the file.\n";  # Writes line to file specified by filehandle INFO

You can return an open file handle from a subroutine and store it in a scalar variable. For example:

 
sub openHandle {
  ($filename) = $_[0];
  open MYHANDLE, ">output";
  return MYHANDLE
}
 
$handle = &openHandle($ARGV[0]);
print $handle "hi brad\n";
close $handle;

Finally, open can be used to access the standard input (usually the keyboard) and standard output (usually the screen) respectively:

open(INFO, '-');       # Open standard input
open(INFO, '>-');      # Open standard output

For example, catinput.perl copies the file input to the file output. It also shows use of chomp():

UNIX> perl catinput.perl
UNIX> cat output
D
F
C
B
E
A
UNIX> 


Arrays

We talked about arrays in the last lecture and now we will look at some operations performed with them

For example, sort1.perl sorts standard input by reading it into an array and printing the sorted array.

 
UNIX> cat input
D
F
C
B
E
A
UNIX> perl sort1.perl < input
A
B
C
D
E
F
UNIX> 

You can make this simpler: The STDIN token may be treated as an array, so you can simply print the sorted array. This is in sort2.perl:

 
UNIX> perl sort2.perl < input
A
B
C
D
E
F
UNIX> 

Be careful with this type of usage as if the input is huge there could be memory problems.

A very useful function in Perl is split, which splits up a string and places it into an array. The function expects a regular expression and works on the $_ variable unless otherwise specified.

For example:

$info = "Balajee, Patrick, Michael, Josh, Leaf";
 
@personal = split(/,/, $info);

 

This is same as saying,

 

@personal = ("Balajee", "Patrick", "Michael", "Josh", "Leaf");

 

In the case of  $_ variable, we can rewrite the above fn as:

 
@personal = split(/,/);

Note: A word can be split into characters, a sentence split into words and a paragraph split into sentences, depending on its usefulness

Other useful things you can do is use the subroutines push() and pop() to add and remove elements from the end of an array and shift and unshift to add and remove elements from the front of the array. You can get at the size of an array by using the array in a place where an integer is expected.

The program reverse.perl uses push() and pop() to reverse a file, and the program revline.perl uses split() in a typical way, and the array size to reverse each line of a file.

Note: The syntax of split() is split(pattern, string), where the pattern specifies how the space between words is delimited.

Of course, there is also a reverse operator, which returns the reverse of an array, and this can be used to make the above programs simpler. See reverse2.perl and revline2.perl. The latter makes use of the foreach construct to iterate over all elements in an array. Does revline2.perl feel like it's approaching unreadability?

UNIX> cat input2
I am Sam
I am Sam
Sam I am
That Sam I am, that Sam I am, I do not like that Sam I am!
UNIX> perl reverse.perl < input2
That Sam I am, that Sam I am, I do not like that Sam I am!
Sam I am
I am Sam
I am Sam
UNIX> perl reverse2.perl < input2
That Sam I am, that Sam I am, I do not like that Sam I am!
Sam I am
I am Sam
I am Sam
UNIX> perl revline.perl < input2
Sam am I 
Sam am I 
am I Sam 
am! I Sam that like not do I am, I Sam that am, I Sam That 
UNIX> perl revline2.perl < input2
Sam am I 
Sam am I 
am I Sam 
am! I Sam that like not do I am, I Sam that am, I Sam That 
UNIX> 

 

This seems as good a place as any to plug in the foreach construct. To go through each line of an array or other list-like structure (such as lines in a file) Perl uses the “foreach” structure.

This has the form

 
foreach $var (@array)          # Visit each item in turn under the name of $var
{
        print "$var\n";        # Print the item
}
 

The actions to be performed each time are enclosed in a block of curly braces. If array is empty to start with then the block of statements is never executed.

---

Associative Arrays

Like awk and python, perl has associative arrays. When accessing a value, you precede it with a dollar sign and enclose the key in curly braces ($array_name{"value"};). When accessing the whole array, you precede it with a percent sign (%array_name). The keys() function returns an array of the keys of the associative array. The values() function returns the values. Both of these return their keys/values in any order. So, for example, suppose you have a list of first names, last names, and phone numbers, and you want to print it sorted in the format: last name, first, phone number. Then you can do something like phone.perl. Note that perl does support printf.

 
UNIX> cat input3
Peyton Manning 423-vol-qb4u
Phil Fulmer 423-vol-head
Pat Summitt 423-lvl-head
Joe Johnson 423-vol-prez
Jim Plank 423-vol-peon
UNIX> perl phone.perl < input3
    Fulmer,       Phil,                   423-vol-head
   Johnson,        Joe,                   423-vol-prez
   Manning,     Peyton,                   423-vol-qb4u
     Plank,        Jim,                   423-vol-peon
   Summitt,        Pat,                   423-lvl-head
UNIX> 


Subroutines

Perl allows the user to define their own functions, called subroutines. They may be placed anywhere in your program but it's probably best to put them all at one place, either at the beginning of the file or at the end. A subroutine has the form

sub subroutine_name
{
        print "Well, Hullo there!!!\n Isn’t it absolutely peachy today?\n";
}

 

Note: We do not specify any parameters that we may want to pass to it.

 

The following are 3 different ways of calling the same subroutine:

 
subroutine_name;               # Call the subroutine
subroutine_name($_);           # Call it with a parameter
subroutine_name(1+2, $_);      # Call it with two parameters
 

In older Perl code you may see subroutine calls prefixed with an & character in front of the function name. It is no longer necessary to specify the & character.

---

Parameters

In the above case the parameters are ignored. When the subroutine is called, parameters are passed as a list in the special @_ list array variable. The following subroutine merely prints out the list that it was called with.

 
sub sub_printargs
{
        print "@_\n";
}
 
sub_printargs("balajee", "kannan");     # Example prints "balajee kannan"
sub_printargs("Balajee", "and", "Kannan"); # Prints "Balajee and Kannan"
 

Just like any other list array the individual elements of @_ can be accessed with the square bracket:

 
sub subroutine_list
{
        print "Your first argument was $_[0]\n";
        print "and $_[1] was your second \n";
}

 

Note: The array variable @_  is different from the $_ scalar variable. Also, the indexed scalars $_[0] and $_[1] and so on have nothing to with the scalar $_ which can also be used without fear of a clash.

 

---

Returning values

Result of a subroutine is always the last thing evaluated.

 

sub max
{
        if ($_[0] > $_[1])
        {
               $_[0];
        }
        else
        {
               $_[1];
        }
}
 
$biggest = max(37, 24);        # what is going on here?
print "$biggest";              # what is the value printed?

 

Note: The subroutine_list subroutine above also returns a value, in this case 1. This is because the last thing that subroutine did was a print statement and the result of a successful print statement is always 1.

 

---

Local variables

Similar to C, perl allows you to differentiate between local and global variables. The @_ variable is local to the current subroutine, as also are $_[0], $_[1], $_[2]. It is very useful to be able to limit a variable's scope to a single function. This can be done using the my operator:

 

sub sub_scopetest
{
        my($a, $b);                    # Make local variables
        ($a, $b) = ($_[0], $_[1]);     # Assign values
        $a = $a + $b;
        $b = $b * $a;
        printf "Value of a is %d and b is %d", $a, $b;
}
 
$a=23;
$b=45;
sub_scopetest($a, $b);         # true
printf "Value of a is %d and b is %d", $a, $b;
 

In older Perl code you may see the local operator used to declare local variables instead of the my operator. The local operator will also declare a variable as local but it will establish dynamic, rather than static scoping for the variable. Dynamic scoping means that when a function refers to a non-local variable, the variable’s value is determined by searching through the call stack and locating the first function that declares that variable. The value for that variable will then be returned. In contrast static scoping determines the variable’s value by looking for the variable in an enclosing function, or the top-level if there is no enclosing function. All modern, compiled languages, including C and Java, using static scoping.


Listing files

Perl lets you do directory listings with shell-style pattern matching. A simple example is ls.perl, which lists the files in the current directory with the .perl extension:

 
UNIX> perl ls.perl
catinput.perl
hw.perl
ls.perl
match.perl
other.perl
phone.perl
reverse.perl
reverse2.perl
revline.perl
revline2.perl
scalar.perl
simp.perl
sort1.perl
sort2.perl
stdin.perl
sub1.perl
sub2.perl
UNIX> 


Sorting

Perl provides a handy function called sort for sorting lists. You have already seen the basic version, which performs alphabetical sorting of lists. However, suppose you wanted to sort a list of numbers. The following perl code would not produce the desired result:

 

@a = (1000, 100, 200, 10);
@a = sort @a;
print join(", ", @a);
print "\n";  # output is 10, 100, 1000, 200
 

You can tell Perl to perform numerical sorting using the so-called spaceship operator <=>:

@a = (1000, 100, 200, 10);
@a = sort {$a <=> b} @a;
print join(", ", @a);
print "\n";  # output is 10, 100, 200, 1000

 

The <=> operator compares the two values $a and $b (don't worry about their old values they are protected) numerically returning 1 if $a > $b, -1 if $a < $b and 0 otherwise. The cmp operator does the same thing for alphabetical sorting. For example, suppose I have strings of the form “firstname lastname phone” and I want to sort by lastname, with ties being broken by sorting firstname. The following code would do the trick:

 

@a = ("brad vanderzanden 269-8596", "joe camel 658-5869", "aaron camel 685-3969");
@a = sort { @one = split / /, $a;
            @two = split / /, $b;
            if ($one[1] eq $two[1]) {
                return $one[0] cmp $two[0];
            }            
            else {
                return $one[1] cmp $two[1];
            }
          } @a;
print join("\n", @a);
print "\n";
 

The output is:

 
aaron camel 685-3969
joe camel 658-5869
brad vanderzanden 269-8596


Other constructs that you should know about

Look at other.perl. This contains code for opening a file for append, writing to a pipe, reading from a pipe and sorting numerically. Try it out.

Command line arguments are in the @ARGV array.

You can exit from a program with exit.


Reading perl programs

Perl lets you do lots more than what I've detailed. If you start reading random perl programs, you'll notice the use of defaults (e.g. $_) in procedures, substitutions, foreach clauses, etc. The best thing I can say is to read the manual before trying to read programs. I'm not a huge fan of many of these shortcuts because I find it tends to destroy readability, but you make your own decisions.

More, more, more

There is much more that you can do with perl. I have omitted the object-oriented concepts in perl. There is also support for networking. The best way to learn is to explore. Enjoy.