CS360 Lecture notes -- Cat and its variants. Buffering.


As machines and devices get faster, you need to make the file named "large" bigger and bigger to make the programs in this lecture exhibit large running times. In 2016, the file was roughly 8 MB. If you are running the code from this directory, you may need to make "large" bigger.
This lecture gives more detail on writing "cat" with unix system calls and with the C standard I/O library. It also motivates buffering for performance.

Simpcat

Here are three equivalent ways of writing a simple cat program, which just reads from standard input, and writes to standard output.

simpcat1.c
#include <stdio.h>
#include <fcntl.h>
#include <stdio.h>     
                       
int main()                 
{                      
  char c;              
                       
  c = getchar();       
  while(c != EOF) {    
    putchar(c);        
    c = getchar();     
  }                    
}                      
simpcat2.c
int main()                 
{                      
  char c;           
  int i;               
                       
  i = read(0, &c, 1);   
  while(i > 0) {       
    write(1, &c, 1);    
    i = read(0, &c, 1); 
  }                    
}                      
simpcat3.c
#include <stdio.h>

int main()
{
  char c[1];
  int i;

  i = fread(c, 1, 1, stdin);
  while(i > 0) {    
    fwrite(c, 1, 1, stdout);
    i = fread(c, 1, 1, stdin);
  }
}

Let's look at these a little closer. Copy *.c and makefile to one of your directories, and type "make". Now do the following:

UNIX> sh
sh-2.05b$ time simpcat1 < large > /dev/null

real    0m0.151s
user    0m0.137s
sys     0m0.012s
sh-2.05b$ time simpcat2 < large > /dev/null

real    0m34.675s
user    0m10.037s
sys     0m24.594s
sh-2.05b$ time simpcat3 < large > /dev/null

real    0m0.971s
user    0m0.543s
sys     0m0.014s
sh-2.05b$ exit
UNIX> 
Depending on what machine you're using, you may likely to get different times than the above -- those were on my 2.16 GHz MacBook Pro in 2010. Regardless of the numbers that you get, the ratios between simpcat1, simpcat2 and simpcat3 should be roughly the same.

So, what's going on? /dev/null is a special file in Unix that you can write to, but it never stores anything on disk. We're using it so that you don't create 7.5M files in your home directory as this wastes disk space. "Large" is a 7,500,000-byte file. This means that in simpcat1.c, getchar() and putchar() are being called 7.5 million times each, as are read() and write() in simpcat2.c, and fread() and fwrite() in simpcat3.c. Obviously, the culprit in simpcat2.c is the fact that the program is making system calls instead of library calls. Remember that a system call is a request made to the operating system. This means at each read/write call, the operating system has to take over the CPU (this means saving the state of the simpcat2 program), process the request, and return (which means restoring the state of the simpcat2 program). This is evidently far more expensive than what simpcat1.c and simpcat3.c do. Now, look at simpcat4.c and simpcat5.c:

simpcat4.c
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
  int bufsize;
  char *c;
  int i;

  bufsize = atoi(argv[1]);
  c = (char *) malloc(bufsize*sizeof(char));
  i = 1;
  while (i > 0) {
    i = read(0, c, bufsize);
    if (i > 0) write(1, c, i);
  }
}
simpcat5.c
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
  int bufsize;
  char *c;
  int i;

  bufsize = atoi(argv[1]);
  c = (char *) malloc(bufsize*sizeof(char));
  i = 1;
  while (i > 0) {
    i = fread(c, 1, bufsize, stdin);
    if (i > 0) fwrite(c, 1, i, stdout);
  }
}

These let us read in more than one byte at a time. This is called buffering: You allocate a region of memory in which to store things, so that you can make fewer system/procedure calls. Note that fread() and fwrite() are just like read() and write(), except that they go to the standard I/O library instead of the operating system.

The graph below shows their relative speeds (this was in 2016 on my MacBook Pro, running on a roughly 8MB input file):

First, what can we infer now about the standard I/O library? It uses buffering! In other words, when you first call getchar() or fread(), it performs a read() of a large number of bytes into a buffer. Thus, subsequent getchar() or fread() calls will be fast. When you attempt to fread() large segments of memory, the two exhibit the same behavior, as fread() doesn't need to buffer -- you are doing it for the subroutine.

Why then is getchar() faster than fread(c, 1, 1, stdin)? Because getchar() is optimized for reading one character, and fread() is not.


What's the lesson behind this?

The same is true for writes, even though we didn't go through them in detail in class.

Standard I/O vs System calls.

Each system call has analogous procedure calls from the standard I/O library:
System Call			Standard I/O call
-----------			-----------------
open				fopen
close				fclose
read/write			getchar/putchar
				getc/putc
				fgetc/fputc
				fread/fwrite
				gets/puts
				fgets/fputs
				scanf/printf
				fscanf/fprintf
lseek				fseek
System calls work with integer file descriptors. Standard I/O calls define a structure called a FILE, and work with pointers to these structs.

To exemplify, the following are versions of the program cat which must be called with filename as their arguments. Cat1.c uses system calls, and cat2.c uses the standard I/O library. Read the man page for open ("man 2v open") and fopen ("man 3s fopen") to understand their arguments.

Try:

UNIX> sh
$ time cat1 large > /dev/null
        0.9 real         0.0 user         0.3 sys  
$ time cat2 large > /dev/null
        1.2 real         0.1 user         0.4 sys  
$ exit
UNIX>
How do these compare to the first numbers?

Finally, fullcat.c contains a version of cat which works much like the real version -- if you omit a filename, then it prints standard input to standard output. Otherwise, it prints out each file specified in the command line arguments. Note how it is similar to both simpcat1.c and cat2.c.

Type 'make clean' when you're done to save disk space, and remove any temporary files. You can erase all the files created from this lecture, since you can re-copy them from my directory.


Chars vs ints

You'll note that getchar() is defined to return an int and not a char. Relatedly, look at simpcat1a.c:
#include <stdio.h>
                       
int main()                 
{                      
  int c;              
                       
  c = getchar();       
  while(c != EOF) {    
    putchar(c);        
    c = getchar();     
  }                    
}                      

The only difference between simpcat1a.c and simpcat1.c is that c is an int instead of a char. Now, why would that matter? Look at the following:
UNIX>  ls -l simpcat1.c simpcat1
-rwxr-xr-x   1 plank       10864 Sep  8 14:03 simpcat1
-rw-r--r--   1 plank         526 Sep 13  1996 simpcat1.c
UNIX>  simpcat1 < simpcat1 > tmp1
UNIX>  simpcat1 < simpcat1.c > tmp2
UNIX>  ls -l tmp1 tmp2
-rw-r--r--   1 plank        1746 Sep  8 14:10 tmp1
-rw-r--r--   1 plank         526 Sep  8 14:10 tmp2
UNIX> 
Notice anything wierd? Now:
UNIX>  simpcat1a < simpcat1 > tmp3
UNIX>  ls -l tmp3 
-rw-r--r--   1 plank       10864 Sep  8 14:12 tmp3
UNIX> 
This has to do with what happens when getchar() reads the character 255. We'll talk about it in class. See if you can figure it out.