CS360 Lecture notes -- A Unix Programmer

  • Jian Huang
  • CS360
  • Directory: ~huangj/cs360/notes
  • Lecture notes: http://www.cs.utk.edu/~huangj/cs360/360/notes/programmer.html
    A few disjoint topics for the final lectures in the course.

    How to be a Unix Programmer

    Different people have different understandings of the term "Unix programmer". To many, it simply means a programmer who programs on a Unix operating system. This really does not mean much, since in most cases we want to write portable codes with standard C or C++ and little has to do with Unix specific system calls, etc.

    To the professionals, being a Unix programmer means something quite different and goes well beyond just programming. It is better to use the term "Unix engineer" to emphasize the fact that one should be doing engineering while doing programming. But this term does read a little funny, so let's stay with the original wording of Unix programmer. Just that you should remember this important distinction. Therefore, knowing all about system calls does not technically make you a Unix programmer.

    From an engineering point of view, true Unix programmers are a bunch that mostly aim to develop software that is either basic tools or allows reuse by other people via an API. To cover all aspects of being a Unix programmer, one should read the book "The Art of Unix Programming", by Eric Raymond. Here is the URL for the book's online edition. Here, to wrap up our course, I will go over a few most important points in Raymond's book and try to open up a few new doors for you to explore.

    There are four important traits of Unix programmers. First, they use very high-level languages. They are the ones that devised C to replace assembly. They are the ones the developed Perl etc. to replace C also. Secondly, they mostly apply the methodology of data driven programming, and largely detest object oriented programming. When doing data-driven programming, one clearly distinguishes code from the data structures on which it acts, and designs both so that one can make changes to the logic of the program by editing not the code but the data structure. Data-driven (DD) and object-oriented(OO) are different in that:

     a. one is that in DD, the data is not merely the state of some object,
        but actually defines the control flow of the program
     b. the primary concern in OO is encapsulation, the primary concern
        in DD is writing as little fixed code as possible
    

    As confirmed in most cases, the is-a and has-a relationships introduced in OO come with them a large amount of complexity that normally we don't want to see and really don't see if given a well-defined API. This is very important when building complex systems that ought to be reliable, extensible, easy to use. Operating systems and development environments happen to fall in this category.

    Third, they use code generators whenever possible. Good examples include using lex and yacc to develop a parser without actually write one line of actual code. Fourth, domain-specific minilanguages. One of the most consistent results from large-scale studies of error patterns in software is that programmer error rates in defects per hundreds of lines are largely independent of the language in which the programmers are coding. Higher-level languages, which allow you to get more done in fewer lines, mean fewer bugs as well. Unix has a long tradition of hosting little languages specialized for a particular application domain, languages that can enable you to drastically reduce the line count of your programs. Domain-specific language examples include the numerous Unix typesetting languages (troff, eqn, tbl, pic, grap), shell utilities (awk, sed, dc, bc), and software development tools (make, yacc, lex).

    Domain-specific little languages are an extremely powerful design idea. Doing so allows programmers to push complexity one level upward by defining your own higher-level language to specify the appropriate methods, rules, and algorithms for the task. This leads to significant reduction of global complexity relative to a design that uses hardwired lower-level code for the same ends. The subject of how to design a good minilanguage and most importantly, how to realize that you do need a minilanguage are beyond cs360. You should go to Raymond's book to get the full dosage of enlightment. In short, whenever the domain primitives in your application area are simple and stereotyped, but the ways in which users are likely to want to apply them are fluid and varying, a minilanguage would be beneficial. Besides the examples above, please also realize that postscript, yes, every *.ps file you generate for printers really is a minilanguage program containing lots of data, of course.


    Sanity Checks

    Very often we hear the "hey, why don't you run some sanity checks on your code?". Then, you see a puzzled face. This relates to a tiny bit (seriously) of being a professional software engineer. In general a good development practice should include the following:

       - don't rely on proprietary codes unless there are some
         overwhelmingly good reasons
       - write your code to be portable, use GNU autotools for instance
       - test your code thoroughly before release (should this even be a question?)
       - sanity-check your code 
    

    By sanity-check, we mean to use every tool available that has a reasonable chance of catching errors a human would be prone to overlook

               - "gcc -Wall"
               - run tools that look for memory leaks and
                 other runtime errors (electric fence, Valgrind, checkergcc, mcheck, mpr, etc.)
               - for python projects, run PyChecker
                      sourceforge.net/projects/pychecker
               - perl, check your code with perl -c (maybe -T, if applicable)
                      use perl -w and "use strict" religiously
    

    To give you a taste of an example sanity check tools, let's take a brief look at Electic Fence. On any linux box, you can type in man libefence and see a very well written description of Electric Fence. It is used to look for the following kinds of memory problems, no matter in globals, heap or stack:

       memory leak: malloc'ed memory that is not free'ed
       memory overruns: accessing memory after the end of a malloc'ed buffer
       memory underruns: accessing memory before the start of a malloc'ed buffer
       other: e.g. accessing a buffer that is already free'ed
    

    Electric Fence works by changing the normal malloc and free calls to special versions. Electric Fence uses the virtual memory hardware of your computer to place an inaccessible memory page immediately after (or before, at the user's option) each memory allo- cation. When software reads or writes this inaccessible page, the hardware issues a segmentation fault, stopping the program at the offending instruction. It is then triv- ial to find the erroneous statement using your favorite debugger. In a similar manner, memory that has been released by free() is made inaccessible, and any code that touches it will get a segmentation fault. One of the major problem with using Electric Fence is resource consumption. As we know, most moderm hardware allocates and protects memory on the basis of a page (say, 4KB or 8KB, etc.). Electric Fence basically make each allocation reside on a page and locate the buffer such that overrun and underrun, etc. will go on a separate page, thus causing memory error. Using Electric Fence is very simple, just do:

          gcc -o myexecutable .... -lefence
    

    Then, running myexecutable would output all memory errors if there are any. Using with gdb, detailed information about the memory error can be located.

    For example, broken.c is a program that should run fine. I expect everyone of you to know exactly why that is. Let's then take a look at how we run the sanity check:

    UNIX> cat broken.c
    #include 
    #include 
    
    main()
    {
    
      char * s;
      s = malloc(5);
      strcpy(s,"what hu");
      printf("%s\n", s);
      free(s);
    }
    UNIX> gcc broken.c 
    UNIX> a.out
    what hu
    UNIX> gcc broken.c -lefence
    UNIX> a.out
    
      Electric Fence 2.2.0 Copyright (C) 1987-1999 Bruce Perens 
    Segmentation fault
    UNIX> gcc -g broken.c -lefence
    UNIX> gdb a.out
    GNU gdb Red Hat Linux (5.2-2)
    Copyright 2002 Free Software Foundation, Inc.
    GDB is free software, covered by the GNU General Public License, and you are
    welcome to change it and/or distribute copies of it under certain conditions.
    Type "show copying" to see the conditions.
    There is absolutely no warranty for GDB.  Type "show warranty" for details.
    This GDB was configured as "i386-redhat-linux"...
    (gdb) r
    Starting program: /home/huangj/a.out 
    [New Thread 1024 (LWP 30991)]
    
      Electric Fence 2.2.0 Copyright (C) 1987-1999 Bruce Perens 
    
    Program received signal SIGSEGV, Segmentation fault.
    [Switching to Thread 1024 (LWP 30991)]
    0x420807b6 in strcpy () from /lib/i686/libc.so.6
    (gdb) where
    #0  0x420807b6 in strcpy () from /lib/i686/libc.so.6
    #1  0x080485f8 in main () at broken1.c:9
    #2  0x42017589 in __libc_start_main () from /lib/i686/libc.so.6
    

    Open Source

    Unix programmers know well about Open Source. As we already know, the saga of Unix started being an open source effort from the 70's, and continued to succeed whenever open source was upheld and stumbled whenever proprietary practices kicked in. Of course, this very term of "Open Source" was not developed until the end of last century.

    Being a modern Unix programmer, at least you should understand some basic points about open source, mostly on the issue of licensing a software. When it comes to licensing, there are rights to copy and redistribute, rights to use, rights to modify for personal use, and rights to redistribute modified copies.

    The Open source definition's (www.opensource.org/osd.html) constraints on licensing impose the following requirement:

     - an unlimited right to copy be granted
     - an unlimited right to redistribute in unmodified form be granted
     - an unlimited right to modify for personal user be granted
    

    This guidance prohibit restrictions on redistribution of modified binaries; this meets the need of software distributors, who need to be able to ship working code without encumberance. It allow authors to require that modified sources be redistributed as prisine sources plus patches, thus establishing the author's intentions and an "audit trail" of any changes by others

    All of the standard licenses (MIT or X Consortium License, BSD, Artistic, GPL/LGPL and Mozilla Public License) meet it.