Perl lecture 2


Boolean and Undef

In Perl, undefined variables have the special value undef. This can be used in expressions, etc, and often makes life convenient. When you try to use undef as a string, you get "", and when you try to use it as a number, you get zero.

Boolean expressions in Perl are kind of odd. Undef is false, as is the null string, and anything that casts to a string containing a single zero. Everything else is true. Therefore, all numbers but zero are true, as are all strings but "" and "0".

Back to top


Arrays and Associative Arrays

We talked about arrays in the last lecture and now we will look at some operations performed with them For example, sort1.perl sorts standard input by reading it into an array and printing the sorted array.

UNIX>cat sort1.perl
$i = 0;
while ($l = ) { $a[$i++] = $l; }
print sort(@a);
UNIX>
UNIX> cat input
D
F
C
B
E
A
UNIX> perl sort1.perl < input
A
B
C
D
E
F
UNIX> 

You can make this simpler: The STDIN token may be treated as an array, so you can simply print the sorted array. That is,

UNIX>cat sort2.perl
print sort();

UNIX> perl sort2.perl < input
A
B
C
D
E
F
UNIX> 

Be careful with this type of usage as if the input is huge there could be memory problems.

A very useful function in Perl is split, which splits up a string and places it into an array. The function expects a regular expression and works on the $_ variable unless otherwise specified.

For example

$info = "Balajee, Patrick, Michael, Josh, Leaf";
 
@personal = split(/,/, $info);
 
This is same as saying,
@personal = ("Balajee", "Patrick", "Michael", "Josh", "Leaf");

In the case of $_ variable, we can rewrite the above fn as:
@personal = split(/,/);

Note: A word can be split into characters, a sentence split into words and a paragraph split into sentences, depending on its usefulness.

Associative Arrays

Like awk and python, Perl has associative arrays. When accessing a value, you precede it with a dollar sign and enclose the key in curly braces ($array_name{"value"};). When accessing the whole array, you precede it with a percent sign (%array_name). The keys() function returns an array of the keys of the associative array. The values() function returns the values. Both of these return their keys/values in any order. So, for example, suppose you have a list of first names, last names, and phone numbers, and you want to print it sorted in the format: last name, first, phone number. Then you can do something like phone.perl. Note that perl does support printf.

UNIX>cat phone.perl
while ($l = ) {
  @a = split(/\s+/, $l);
  $fn{$a[1]} = $a[0];
  $pn{$a[1]} = $a[2];
}

foreach $i (sort(keys(%fn))) {
  printf "%10s, %10s, %30s\n", $i, $fn{$i}, $pn{$i};
}

UNIX> cat input3
Peyton Manning 423-vol-qb4u
Phil Fulmer 423-vol-head
Pat Summitt 423-lvl-head
Joe Johnson 423-vol-prez
Jim Plank 423-vol-peon
UNIX> perl phone.perl < input3
    Fulmer,       Phil,                   423-vol-head
   Johnson,        Joe,                   423-vol-prez
   Manning,     Peyton,                   423-vol-qb4u
     Plank,        Jim,                   423-vol-peon
   Summitt,        Pat,                   423-lvl-head
UNIX> 

Back to top


Regular Expression Substitution

You can modify a string variable by applying a sed-like substitution. The operator is again "=~", and the substitution is specified as
s/pattern1/pattern2/
Regular expressions are enclosed in slashes, and work pretty much like sed/awk. There are a few differences: For example,
UNIX>cat match.perl
$i = "Jim";
$j = "JjJjJjJj";
$k = "Boom Boom, out go the lights!";

$m = $i =~ /Jim/;    print "$m\n";
$m = $i =~ /J/;      print "$m\n";
$m = $i =~ /j/;      print "$m\n";
$m = $i =~ /j/i;     print "$m\n";
$m = $i =~ /\w/;     print "$m\n";
$m = $i =~ /\W/;     print "$m\n";

$m = $j =~ /j*/;     print "$m\n";
$m = $j =~ /j+/;     print "$m\n";
$m = $j =~ /j?/;     print "$m\n";
$m = $j =~ /j{2}/;   print "$m\n";
$m = $j =~ /j{2}/i;  print "$m\n";
$m = $j =~ /(Jj)+/;  print "$m\n";

$m = $k =~ /Jim|Boom/;          print "$m\n";
$m = $k =~ /(Boom){2}/;         print "$m\n";
$m = $k =~ /(Boom ){2}/;        print "$m\n";
$m = $k =~ /(Boom\W){2}/;       print "$m\n";
$m = $k =~ /\bBoom\b/;          print "$m\n";
$m = $k =~ /\bBoom.*the\b/;     print "$m\n";
$m = $k =~ /\Bgo\B/;            print "$m\n";
$m = $k =~ /\Bgh\B/;            print "$m\n";
UNIX>

$i =~ /Jim/;                   True
$i =~ /J/;                     True
$i =~ /j/;                     False
$i =~ /j/i;                    True
$i =~ /\w/;                    True
$i =~ /\W/;                    False
 
$j =~ /j*/;                    True -- matches anything
$j =~ /j+/;                    True -- matches the first 'j'
$j =~ /j?/;                    True -- matches the first 'j'
$j =~ /j{2}/;                  False
$j =~ /j{2}/i;                 True -- ignores case
$j =~ /(Jj){3}/;               True -- matches the entire string
 
$k =~ /Jim|Boom/;              True -- matches Boom
$k =~ /(Boom){2}/;             False -- there's a space between Booms
$k =~ /(Boom ){2}/;            False -- the second Boom ends with a comma
$k =~ /(Boom\W){2}/;           True 
$k =~ /\bBoom\b/;              True -- shows word delimiters
$k =~ /\bBoom.*the\b/;         True 
$k =~ /\Bgo\B/;                False -- false, because "go" is a word
$k =~ /\Bgh\B/;                True -- the "gh" is in the middle of "lights"
Note that when you run match.perl, the false results are printed as null strings, not zeros. More examples on regular expression:
UNIX> cat sub.perl
$j = "Jim Plank";

$j =~ s/ .*/i Hendrix/;
print "$j\n";

$j =~ s/i/I/g;
print "$j\n";

$j =~ s/\b\w*\b/Dikembe/;
print "$j\n";

$j =~ s/(\b\w*\b)/Jimi "\1"/;
print "$j\n";
UNIX>
$j =~ s/ .*/i Hendrix/;           Makes 'Jimi Hendrix'
$j =~ s/i/I/g;                    Makes 'JImI HendrIx'
$j =~ s/\b\w*\b/Dikembe/;         Makes 'Dikembe HendrIx'
$j =~ s/(\b\w*\b)/Jimi "\1"/;     Makes 'Jimi "Dikembe" HendrIx'

In addition, the operator "!~" is used for spotting a non-match.

$sentence = "The quick brown fox";
$sentence !~ /the/
The above RE is true because the string "the" does not appear in $sentence. Unfortunately, you can't use =~ or !~ for that matter to apply substitution as you would, for example, addition, like:
$k = $j =~ s/Jim/Jimi/;

You'll note in the last substitution of sub1.perl, I used the parentheses as memory. This is analogous to \( and \) in sed, except you can use the memory even in the first pattern. For example, see sub2.perl:

UNIX>cat sub2.perl
$j = "Jim Plank";
$j =~ s/(\w*) (\w*)/\1 \1 \2/;
print "$j\n";

$i = "I am the the man";
$i =~ s/(\b\w+\b) \1/\1/;
print "$i\n";

UNIX>
$j = "Jim Plank";
$j =~ s/(\w*) (\w*)/\1 \1 \2/;      Makes 'Jim Jim Plank'

$i = "I am the the man";
$i =~ s/(\b\w+\b) \1/\1/;           Makes 'I am the man' 

Translation

In addition to the regular substitution operator, Perl has a translation operator given by tr. The tr function allows character-by-character translation. The following expression replaces each a with e, each b with d, and each c with f in the variable $sentence. The expression returns the number of substitutions made.

$sentence =~ tr/abc/edf/
Most of the special RE codes do not apply in the tr function. For example, the statement here counts the number of asterisks in the $sentence variable and stores that in the $count variable.

$count = ($sentence =~ tr/*/*/);

Fancy string stuff

Inside double quotes there are many representations that are in addition to those you might be used to from C. These can make your life simpler or more confusing depending on your point of view. If they make things confusing, just forget I mentioned them. Besides the 'normal' \n for newline and \t for tab we have
Construct Meaning 
 
\cC	Any "control" character (here, CNTL-C) 
\\ 	Backslash 
\" 	Double quote 
\l 	Lowercase next letter 
\L 	Lowercase all following letters until \E 
\u 	Uppercase next letter 
\U 	Uppercase like \L 
\Q 	Backslash-quote all non-alphanumerics until \E 
\E 	Terminate \L, \U, or \Q 

For example,

UNIX>cat strings.perl
print "This is a demonstration of \"double quotes\" and \\ (backslashes)\n";
print "\lThis is about \LCHANGING CASE\E \Uof text\E inside \udouble quotes\n";
print "\QAnd this is the \"backslash-quote\" option; which is wierd.\n";
UNIX>perl5 strings.perl
This is a demonstration of "double quotes" and \ (backslashes)
this is about changing case OF TEXT inside Double quotes
And\ this\ is\ the\ \"backslash\-quote\"\ option\;\ which\ is\ weird\.\
UNIX> 
 


Back to top


Standard Input and Files

In perl, files are accessed using "filehandles" (their term). 3 types are provided: STDIN, STDOUT, STDERR. To use these, enclose them in angle brackets (< >). (As filehandles do not have any special character preceding them, as do arrays and hashes, it is recommended that you always use all caps for filehandles to prevent them from interfering from any future reserved words.) EOF is denoted by undef. Therefore, stdin.perl copies standard input to standard output:

UNIX>cat stdin.perl
while ($k = ) { print $k; }
UNIX>

UNIX> cat input
D
F
C
B
E
A
UNIX> perl stdin.perl < input
D
F
C
B
E
A
UNIX> 
Note that you get the newline with the string. To get rid of the newline, use the chop() procedure, which modifies its argument to get rid of the last character.

You can open a file for input and then use it like STDIN, above. Moreover, you can open a file for output and print to it.

The open function opens a file for input (i.e. for reading). The first parameter is the filehandle, which allows Perl to refer to the file in future. The second parameter is an expression denoting the filename. If the filename was given in quotes then it is taken literally without shell expansion. So the expression '~/notes/todolist' will not be interpreted successfully. If you want to force shell expansion then use angled brackets: that is, use <~/notes/todolist> instead. The close function tells Perl to finish with that file.

In addition, the open statement can also specify a file for output and for appending as well as for input. To do this, prefix the filename with a ">" for output and a ">>" for appending:

open(INFO, $file);     # Open for input
open(INFO, ">$file");  # Open for output
open(INFO, ">>$file"); # Open for appending
open(INFO, "<$file");  # Also open for input

To print something to a file that has already been opened, use the print statement with an extra parameter. For e.g.,

# Writes line to file specified by filehandle INFO
print INFO "This line goes to the file.\n";      

Finally, open can be used to access the standard input (usually the keyboard) and standard output (usually the screen) respectively:

open(INFO, '-');       # Open standard input
open(INFO, '>-');      # Open standard output

For example, catinput.perl copies the file input to the file output. It also shows use of chop():

UNIX>cat catinput.perl
open(F, "input");
open(FOUT, ">output");
while ($k = ) { 
  chop($k); 
  print FOUT "$k\n"; 
}
UNIX>
UNIX> perl catinput.perl
UNIX> cat output
D
F
C
B
E
A
UNIX> 

Back to top


Subroutines

Perl allows the user to define their own functions, called subroutines. They may be placed anywhere in your program but it's probably best to put them all at one place, either at the beginning of the file or at the end. A subroutine has the form
sub subroutine_name
{
        print "Well, Hello there!!!\n Isn't it absolutely peachy today?\n";
}

Note: We do not specify any parameters that we may want to pass to it.

Notice that a subroutine is called with an & character in front of the name. The following are 3 different ways of calling the same subroutine:

&subroutine_name;              # Call the subroutine
&subroutine_name($_);          # Call it with a parameter
&subroutine_name(1+2, $_);     # Call it with two parameters

Back to top


Returning Values

Result of a subroutine is always the last thing evaluated.

sub max
{
        if ($_[0] > $_[1])
        {
               $_[0];
        }
        else
        {
               $_[1];
        }
}


$biggest = &max(37, 24);       # calling subroutine
print "$biggest";              
Note: By default, a subroutine always returns the value of that last successful statement it executed.

Back to top


Local Variables

Similar to C, perl allows you to differentiate between local and global variables. The @_ variable is local to the current subroutine, as also are $_[0], $_[1], $_[2]. It is very useful to be able to limit a variable's scope to a single function.
sub sub_scopetest
{
        local($a, $b);                 # Make local variables
        ($a, $b) = ($_[0], $_[1]);     # Assign values
        $a = $a + $b;
        $b = $b * $a;
        printf "Value of a is %d and b is %d", $a, $b;
}
 
$a=23;
$b=45;
&sub_scopetest($a, $b);               # true
printf "Value of a is %d and b is %d", $a, $b;

Back to top


Directory Listings

Perl lets you do directory listings with shell-style pattern matching. A simple example is ls.perl, which lists the files in the current directory with the .perl extension:
UNIX>cat ls.perl
foreach $i (<*.perl>) { print "$i\n"; }
UNIX>
UNIX> perl ls.perl
catinput.perl
hw.perl
ls.perl
match.perl
other.perl
phone.perl
reverse.perl
reverse2.perl
revline.perl
revline2.perl
scalar.perl
simp.perl
sort1.perl
sort2.perl
stdin.perl
sub1.perl
sub2.perl
We can use File::Find, a Perl Module, to parse all the files in a directory and it's subdirectories. This module will work on Unix and Windows machines as well as Mac OS machines but Mac users will want to consult the File::Find documents to see a few of the issues that Mac's have with it and their work around.

For example, dir_tr.perl will traverse directories. If there is a subdirectory in the directory that you tell it to run on then this script will parse all the files in that subdirectory and all the subdirectories

UNIX> cat dir_tr.perl 
#!/usr/bin/perl
use File::Find;
use strict;
my $directory = "/home/yli/cs365";
find(\&process, $directory);
sub process{
        my @outlines; #data we are going to output
        my $line;    #data we are reading line by line

        #print "processing $_/$File::Find::name\n";
        #only parse files that end in .html
        if($File::Find::name=~/\.html$/){
                
                open(FILE, $File::Find::name)or
                die "Cannot open file: $!";

                print"\n".$File::Find::name . "\n";
                while($line=){
                        $line =~ s/]*)>//i;
                        push(@outlines, $line);
                }
                close FILE;

                open(OUTFILE, ">$File::Find::name") or
                die "Cannot open file: $!";

                print(OUTFILE@outlines);
                close(OUTFILE);
        
                undef(@outlines);
        }
}
UNIX> pwd                                                               
/home/yli/cs365
UNIX> ./dir_tr.perl                                                     
/home/yli/cs365/lecture/lab3/index.html

/home/yli/cs365/lecture/lab2/lab2.html

/home/yli/cs365/lecture/lab1/lab1.html

Back to top


Misc

Look at other.perl. This contains code for opening a file for append, writing to a pipe, reading from a pipe and sorting numerically. Try it out.

UNIX>cat other.perl
# Opening a file for append:

open(F, ">>logfile");
print F "Writing to the logfile\n";

# Writing to a pipe  -- this will sort input2 and put the result in output2

open(F2, "|sort > output2");
open(IN, "input2");
print F2 ();

# Reading from a pipe -- this will sort input3 and put the result in output3

open(F3, "sort input3|");
open(OUT, ">output3");
print OUT ();

# This sorts input4 numerically and puts the output into output4

open(F4, "input4");
open(OUT4, ">output4");
print OUT4 sort { $a <=> $b } ;
UNIX>

The last part needs some explanation. The <=> operator, called in the perl books the spaceship operator, compares the two values $a and $b (don't worry about their old values they are protected) numerically returning 1 if $a > $b, -1 if $a < $b and 0 otherwise. This lets sort, sort numerically instead of lexicographically.

Command line arguments are in the @ARGV array.

You can exit from a program with exit.

Back to top


Exercises

Now it is time to put this knowledge into a more realistic situation.  Suppose you have a comma separated value (CSV) file of some music albums.  The file has each string representing the artist, album, and song for each album delineated by quotations on each line.  You want to sort this information in a hurry so you whip you decide to throw together a Perl script.  This Perl script takes two command lines arguments: the file to parse and the field to sort on. 

Write a Perl script (musicParser.pl) that takes an input CSV file (Music_Listing.csv) and a command line argument denoting the field to sort on.  Based on the parameters, the script will output the desired field to sort to standard out.  Be sure to:

- Error check the second command line argument to ensure it is "artist", "album", or "song" (don't worry about checking the validity of the input file).
  If not one of these values, have your script exit.

- Utilize arrays to store the fields while parsing the input file.

- Truncate the line feed that comes at the end of each line (hint: chop() function).

- Close the input file when done.

Here is a sample run for artists:

UNIX>perl musicParser.pl Music_Listing.csv artist
"Blind Faith"
"Cream"
"Derek and the Dominos"
"John Mayall & The Blues Breakers"

 

For songs:

UNIX>perl musicParser.pl Music_Listing.csv song
"Acoustic Jam"
"All Your Love"
"Another Man"
"Anyday"
"Bell Bottom Blues"
"Blue Condition"
"Can't Find My Way Home [Electric Version]"
"Can't Find My Way Home"

...
 

And for a bad command line parameter for the search type:

UNIX>perl musicParser.pl Music_Listing.csv foo

Error. Proper Usage: musicParser.pl INPUTFILE.csv artist|album|song
 

Be sure to comment your code sufficiently.  If you want an extra challenge, strip the quotes that delineate each field in the file before you output the information.

Music Trivia (no credit value): What factor do these albums have in common?

Due date: Wed, April 13th at 11:59pm