Scripts and Utilities -- Perl lecture 1

Adapted from notes first prepared by James Plank and then modified by various CS300 instructors

Perl

Perl stands for ``practical extraction and report language.'' It's yet another portable language that is useful for writing quick and dirty programs.

Pluses: Viewed in its best light, Perl is a language that encapsulates the best features of the shell, sed, grep, awk, tr, C and Cobol. If you are familiar with these tools, you can write very powerful Perl programs rather quickly and easily. In particular, you can write programs that duplicate the functionality of shell scripts but are all in one file (indeed, they are all in one language) and thus are much more efficient.

Minuses: Perl is a jumble. It contains many, many features from many languages and programs. It contains many differing constructs that implement the same functionality. For example, there are at least 5 ways to perform a one-line if statement. While this is good for the programmer, it is extremely bad for everyone but the programmer (and bad for the programmer who tries to read his own program in 6 months).

Perl also has made many poor syntactic decisions that lead to both language inconsistencies and awkward constructs. For example, most built-in Perl commands, such as the open command, require commas to be placed between the arguments. However, the print command requires that commas not be placed between its first two arguments when writing to a file. This decision was made to make it possible for the Perl interpreter to know whether output was being written to a file or stdout but the inconsistency of requiring that there not be a comma can cause a programmer grief when the programmmer automatically inserts a comma as a matter of habit. Examples of awkward program constructs include the lack of named parameters for subroutines and the requirement that variables be "declared" as scalars, arrays or dictionaries by prefixing them with a $, @, or % sign.

These language issues have led to Perl being called a ``write-only'' language. There are other minuses as well, but I won't go into them further. You can discover them for yourself. A colleague of Dr. Plank (Norman Ramsey, at Virginia) responded to an email sent to him about perl, and his response is worth quoting in it's entirety:

Perl is a brilliant mistake, of a kind being repeated over and over again today. It follows an unfortunate trend in computing documentation---we no longer explain how our programs or languages work, we just provide encyclopedic compendia of things you can do with them. Users don't have to understand anything; they just pattern match on these enormous documents until they see something that looks sort of like what they want, then they hack on it until it produces the right answer for some inputs, then they declare victory. (This is why some students can take 50 hours to complete a 2-hour homework assignment.) Perl accommodates this style perfectly.

So this is the mistake. The brilliance is in including so many things people want to do, and in a form that is almost familiar, so they can pattern-match more easily. In fact, the familiarity is really an illusion, and if you're going to program in perl, you can hack without understanding, or you can restrict yourself to a subset you can understand. But if you're going to do the latter, why not program in sh, awk, and sed to begin with?

I've never used apl, so I can't compare.

I have---the thing with apl is that although it is *all* weird, it is weird in a very consistent way. You don't have the illusion of familiarity. You do have a huge set of unreadable glyphs, but they come with a very small set of simple rules for decrypting them, the most important of which are the right-to-left scan rule and the fact that user-defined functions take at most two arguments.

I just got to a part in my manual where they advocate using && as an if statement.

I got past that. I gave up on perl the day I learned I couldn't write a function to return an open file handle (e.g., an open socket) (bvz-that shortcoming was subsequently fixed). Now I use it only when forced.

The Bottom Line: The debate as to which is better: Perl, Python or the latest and greatest scripting language is a heated one. Python has better language design. Perl has the most familiar regular expression syntax. I won't get into it, but if you look, you can find all sorts of opinions. Of course, it's best to formulate your own opinions by learning all of them....

Perl help

Unfortunately, perl is huge, so I'll only be able to give you but a flavor of it. The perl manual is online in the form of the perl man pages, which are broken up into a number of subsections. Do ``man perl.'' There are tons of online help available for Perl. Here is a good starting page http://perl.about.com/od/perlmanuals/. There are many such sites, I suggest you make use of them when required.

There are two recommended books in case you want more. First is ``Learning Perl'' by Schwartz, and ``Programming Perl'' by Schwartz and Wall. Both are published by O'Reilly & Associates.

Calling Syntax

Like awk, perl works on a program. You can specify the program as the first argument to perl, or you can put the path of perl's executable on the first line of the perl program, preceded by #!. In the latter case you can make the file executable in Unix by performing a chmod +x. man page). My perl manual says that you can expect perl to be found in /usr/bin/perl, but in our department, it's in /usr/local/bin/perl. So much for portability. There are also differences between versions of perl. If you are uncertain of which version of perl you are using you can type perl -v and it will print out version information and exit.

There is not an interactive interpreter for perl but you can simulate interactivity by typing perl -de 42 which invokes perl's debugger. You can then type a single line of perl commands and execute them. That is handy when you have just a few commands to test but isn't so handy if you want to define a test function.

Finally the conventional suffix for a perl file is .perl.

Perl Basics

Rather than leap into how to write perl scripts, we will discuss some variable types, usage, syntax and concepts first. The next lecture will bring it all together.

Comments

Comments begin with the pound sign (#) just as in the other scripts we have studied. However, they can begin anywhere on a line and extend to the newline.


	#this is a comment at the left margin
	print $red;	# and this is a comment starting elsewhere

Note the semicolon after the print statement. Just like in C, all simple statements end with a semicolon.

Variables and Data Types

Perl has 3 basic variable types we will cover; scalars, arrays, hashes. Variables do not have be explicitly declared, just used and the type is inferred, nor do variables have to be initialized. If a variable is used prior to an assignment of a value perl gives it the value undef which is NOT null but is the undefined value. Later we will talk about testing for this value. For now you just need to know that using such a value will not usually cause an error but may not give the desired results.

Scalar variables are variables which contain a single value such as the numbers 4, 984, -432.625 or the strings "a","a rat is in the henhouse", "\tquisling\n". Perl does not really care what kind of data a scalar contains because it will change the representation of the data depending on the context of the usage and in fact does not really have traditionally typed data. Scalar variables can be assigned values using the = just like in other programming languages. Some typical assignments might be
```
	$x = 19;
	$str = "this is the house";
	$dog = "collie";
	$comment = "I love my $dog";
	$pi = 3.1415;
	$arc = $pi * $x;
	$dir = `ls`;
	$str2 = "There are $x files in this dir\n";
```
Note that wherever a scalar is used it is always preceded by a dollar sign. The quotation marks work pretty much as you would expect from what you know of the Bourne shell. Double quotes denote a string and substitutions are performed. Single quotes are a string but substitutions are not performed. Backquotes result in the quoted string being evaluated as though it were a command and the result is then used in the assignment. Also note that perl performs arithmetic operations without the use of an exterior utility, very handy.
Arrays in perl are best described as ordered lists. There is no need to preplan for the size of an array as perl will modify the size to fit the situation. This does not mean that perl will automatically reduce the size of an array but you can force that to happen. There are 2 aspects to arrays. The elements of the array and the array itself. When referring to the array as a whole use the @ notation (@myarray). When refering to an element of an array, we are actually referring to a scalar so use the $ notation ($myarray[2]). The array elements are always accessed using integers. Arrays are 0-indexed which means that the first element is 0 ($myarray[0]).
To make assignments to array elements we can use the type as for scalars
```
	$x[0] = 19;
	$str[2] = "this is the house";
	$dog[1] = "collie";
	$dog[2] = $dog[1];
```
or we can use lists to perform the assignment to the entire array
```
	@dogs=("collie","sheppard","hound","mutt");
	@nums=(2,4,6,9);
	@another=();	# an empty array
	@dir = `ls`;    # an array of the directory listing
```
The first form is similar to that used in most programming languages the second my not be familiar to you but is more like that used by other scripting languages. It is very handy to be able the manipulate the elements of the array as a whole. And to support this, perl carries the ability to manipulate arrays even farther. Assignments can be made from arrays (or lists) to lists containing variables.
```
	($a,$b,$c) = @nums;  # $a=2 $b=4 $c=6
	($str1,$str2) = ("this", "that");
	($x,@fewer) = @nums; # $x=2 @fewer=(4,6,9)
	($t1,$t2,$t3,$t4) = @fewer # $t1=4 $t2=6 $t3=9 $t4 is undef
	("this","that") = @dogs;  # MAKES NO SENSE
```
As this demonstrates, perl is very nice about how it makes the assignments. If there are more elements on the right side than on the left the extra are ignored. If there are more on the left than on the right, the extra are assigned the undefined value. And perl will not issue you any warnings. The last example really makes no sense and perl should complain.
A very handy aspect of perl is that you can get the length of the array by using the keyword scalar to cast an array to a scalar:
```
	$len = scalar @nums; # $len=3
	($len) = @nums  # treats the left hand side as a list and
			# $len=2, the first element of @nums
```
Alternatively you can get the last index of the array using the notation $#arrayname. For example $#nums returns 2.
Being a language that really tries to cover all the bases when it comes to everyday programming, perl has some functions which manipulate arrays.
- pop returns the "top" element of the array and shortens the array
  @dogs=("collie","sheppard","hound","mutt");
  pop(@dogs); # @dogs=("collie","sheppard","hound")
- push "pushes" a new element onto the "top" of the array
  push(@dogs,"bowser"); # @dogs = ("collie","sheppard","hound","bowser")
- shift works like shift in the Bourne shell only on any array
  shift(@dogs); # @dogs = ("sheppard","hound","bowser")
- unshift does the opposite of shift
  unshift(@dogs,"queeny"); # @dogs = ("queeny","sheppard","hound","bowser")
And these can be used on the right hand side of an assignment as well as by themselves.
Here is a good place to put in a plug for the -w option. This option causes perl to warn you about things which may be legal but like the fourth example above could be potential for erroneous results. When developing the script I recommend that you use the -w but be sure that you take it out of anything that you submit for the labs or actually put into use. The reason for that last is that if you are running the script in the background, spurious output can result in the script being halted because it is not connected to an output device.
Hashes are associative arrays. They have no real concept of order but instead consist of key-value pairs. The keys and the data stored in the hash can be of any type and are similar to arrays in assignment and accessing. An entire hash is referenced using the percent notation (%myhash) and an element is referenced using the dollar sign ($myhash{"john"}). Because hashes have no order but instead are key-value pairs, when we initialize hashes or elements of hashes we need to consider this.
```
	%dogs=("black","labrador","red","setter","white","poodle");
	$grades{"john"}=45.6;
        %grades = ("john",45.6,"sarah",90.4);
	%grades = 45.6		#this causes an error
```
Note that when referencing a single element of a hash, we use braces {} instead of brackets.
We can also make assignments from hashes to other types but the results may not be what you hope for when dealing with assignments to arrays. That order thing again.
```
	@doggy=%dogs; 	
	# @doggy = "black","labrador","red","setter","white","poodle"
	# or "red","setter","white","poodle","black","labrador"
	# or "white","poodle", "black","labrador","red","setter"
	$dog=$dogs{"black"}; # $dog="labrador"
	$dogs{"black"}="newfoundland";
```
The point with ordering of the assignment from a hash to an array is that it all depends on the keys and how they "hash". The only thing you can depend on is that the key-value pairs will be together. There may come a time when you want just the keys to a hash. Well perl can help you there. There is a function, keys which when given a hash returns a list of the keys, in some arbitrary order. When given an empty hash keys returns an empty list.
```
	@colors=keys(%dogs) # @colors = ("black","white","red") in some
			    # order
```

Operators

Perl uses many operators to compare strings and numeric values. Those of you who have done C programming are familiar with many of these and the rest of you have seen similar usages when using sh, sed and awk. For numeric values the following are the most common operators and their purpose

Operator Purpose Example

+ addition $a= 3 + 4; #$a = 7
- subtraction $b= 7 - 3; #$b = 4
* multiplication $c= 3 * 4; #$c =12
/ floating point division $d= 10 / 3; #$d = 3.333333...
% modulo (always integer) $e= 10 % 3; #$e = 1
$e= 10.43 % 3.9; #$e = 1
** exponentiation $f= 5 ** 3; #$f = 125

Operator	Purpose	Example
+	addition	$a= 3 + 4; #$a = 7
-	subtraction	$b= 7 - 3; #$b = 4
*	multiplication	$c= 3 * 4; #$c =12
/	floating point division	$d= 10 / 3; #$d = 3.333333...
%	modulo (always integer)	$e= 10 % 3; #$e = 1 $e= 10.43 % 3.9; #$e = 1
**	exponentiation	$f= 5 ** 3; #$f = 125

Also available are the normal range of numerical comparisons operators. These operators return 1 if the comparison is true and something which, if coerced to an integer, evaluates as 0 if false;

Operator Purpose

== equal
!= not equal
< less than
> greater than
<= less than or equal to
>= greater than or equal to

Operator	Purpose
==	equal
!=	not equal
<	less than
>	greater than
<=	less than or equal to
>=	greater than or equal to

Like C, you cannot do numerical comparisons with strings. If you try, perl will convert the string to some numeric value. Leading whitespace is ignored and trailing non-numeric values are discarded. Thus " 23.928asldf543" becomes 23.928 and "johnson" and "johnson4" become 0. This same thing will happen if you try to use a string any place perl is expecting a numeric value. There is only one operator which corresponds to any of the arithmetic operators and that is "." which concatenates 2 strings. It does not modify either string rather it returns a new string which is the two operands together

 
	$line="This"." is"; # $line = "This is"
	$line=$line." my country"; # $line = "This is my country

And, sensibly, the comparison operators for strings are different from those for numeric values, they are in fact string representations of those other operators.

Operator Purpose

eq equal
ne not equal
lt less than
gt greater than
le less than or equal to
ge greater than or equal to

Operator	Purpose
eq	equal
ne	not equal
lt	less than
gt	greater than
le	less than or equal to
ge	greater than or equal to

This is about as far as I want to go today. We will cover fancy string stuff and give some examples of the things we have and will have covered, a "putting it all together" type thing, next week.