CS360 Lecture notes -- Cat and its variants. Buffering.


As machines and devices get faster, you need to make the file named large.txt bigger and bigger to make the programs in this lecture exhibit large running times. In 2018, the file was roughly 25 MB. If you are running the code from this directory, you may need to make large.txt bigger. Do this in the makefile.
This lecture gives more detail on writing "cat" with unix system calls and with the C standard I/O library. It also motivates buffering for performance.

Simpcat

Here are three equivalent ways of writing a simple cat program, which just reads from standard input, and writes to standard output.

simpcat1.c
#include <stdio.h>
#include <fcntl.h>
#include <stdio.h>     
                       
int main()                 
{                      
  char c;              
                       
  c = getchar();       
  while(c != EOF) {    
    putchar(c);        
    c = getchar();     
  }                    
  return 0;
}                      
simpcat2.c
#include <unistd.h> 


                       
int main()                 
{                      
  char c;           
  int i;               
                       
  i = read(0, &c, 1);   
  while(i > 0) {       
    write(1, &c, 1);    
    i = read(0, &c, 1); 
  }                    
  return 0;
}                      
simpcat3.c
#include <stdio.h>



int main()
{
  char c[1];
  int i;

  i = fread(c, 1, 1, stdin);
  while(i > 0) {    
    fwrite(c, 1, 1, stdout);
    i = fread(c, 1, 1, stdin);
  }
  return 0;
}

Let's look at these a little closer. Copy *.c and makefile to one of your directories, and type "make". (BTW, when you're done, you'll want to type "make clean", because large.txt is 25MB, and you don't need to waste space on it.) Now do the following:

UNIX> sh
sh-4.2$ time ./simpcat1 < large.txt > /dev/null

real  0m0.440s
user  0m0.435s
sys 0m0.003s
sh-4.2$ time ./simpcat2 < large.txt > /dev/null

real  0m34.017s
user  0m8.044s
sys 0m25.967s
sh-4.2$ time ./simpcat3 < large.txt > /dev/null

real  0m0.976s
user  0m0.951s
sys 0m0.009s
sh-4.2$ exit
UNIX> 
Depending on what machine you're using, you may likely to get different times than the above -- those were on my Dell Linux box in 2018. Regardless of the numbers that you get, the ratios between simpcat1, simpcat2 and simpcat3 should be roughly the same.

So, what's going on? /dev/null is a special file in Unix that you can write to, but it never stores anything on disk. We're using it so that you don't create 25M files in your home directory as this wastes disk space. "Large.txt" is a 25,000,000-byte file. This means that in simpcat1.c, getchar() and putchar() are being called 25 million times each, as are read() and write() in simpcat2.c, and fread() and fwrite() in simpcat3.c. Obviously, the culprit in simpcat2.c is the fact that the program is making system calls instead of library calls. Remember that a system call is a request made to the operating system. This means at each read/write call, the operating system has to take over the CPU (this means saving the state of the simpcat2 program), process the request, and return (which means restoring the state of the simpcat2 program). This is evidently far more expensive than what simpcat1.c and simpcat3.c do. Now, look at simpcat4.c and simpcat5.c:

simpcat4.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char **argv)
{
  int bufsize;
  char *c;
  int i;

  bufsize = atoi(argv[1]);
  c = (char *) malloc(bufsize*sizeof(char));
  i = 1;
  while (i > 0) {
    i = read(0, c, bufsize);
    if (i > 0) write(1, c, i);
  }
  return 0;
}
simpcat5.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char **argv)
{
  int bufsize;
  char *c;
  int i;

  bufsize = atoi(argv[1]);
  c = (char *) malloc(bufsize*sizeof(char));
  i = 1;
  while (i > 0) {
    i = fread(c, 1, bufsize, stdin);
    if (i > 0) fwrite(c, 1, i, stdout);
  }
  return 0;
}

These let us read in more than one byte at a time. This is called buffering: You allocate a region of memory in which to store things, so that you can make fewer system/procedure calls. Note that fread() and fwrite() are just like read() and write(), except that they go to the standard I/O library instead of the operating system.

The graph below shows their relative speeds (this was in 2016 on my MacBook Pro, running on a roughly 8MB input file. The numbers will be different when you run this on the current version of large.txt, but the shape of the graph will be the same):

First, what can we infer now about the standard I/O library? It uses buffering! In other words, when you first call getchar() or fread(), it performs a read() of a large number of bytes into a buffer. Thus, subsequent getchar() or fread() calls will be fast. When you attempt to fread() large segments of memory, the two exhibit the same behavior, as fread() doesn't need to buffer -- it simply calls read().

Why then is getchar() faster than fread(c, 1, 1, stdin)? Because getchar() is optimized for reading one character, and fread() is not.

Think about it -- fread() needs to read four arguments, and if it's executing code for small values of the size, it at the very least needs to figure out that the size is small before executing the code. getchar() has been written to be really fast for single characters.


What's the lesson behind this?

The same is true for writes, even though we didn't go through them in detail in class.

Standard I/O vs System calls.

Each system call has analogous procedure calls from the standard I/O library:
System Call			Standard I/O call
-----------			-----------------
open				fopen
close				fclose
read/write			getchar/putchar
				getc/putc
				fgetc/fputc
				fread/fwrite
				gets/puts
				fgets/fputs
				scanf/printf
				fscanf/fprintf
lseek				fseek
System calls work with integer file descriptors. Standard I/O calls define a structure called a FILE, and work with pointers to these structs.

To exemplify, the following are versions of the program cat which must be called with filename as their arguments. Cat1.c uses system calls, and cat2.c uses the standard I/O library. Both use an 8K buffer for the read()/fread() and write()/fwrite() calls. Read the man page for open ("man 2v open") and fopen ("man 3s fopen") to understand their arguments.

Try:

UNIX> sh -c "time ./cat1 large.txt > /dev/null"

real  0m0.012s
user  0m0.001s
sys 0m0.010s
UNIX> sh -c "time ./cat2 large.txt > /dev/null"

real  0m0.015s
user  0m0.003s
sys 0m0.011s
UNIX> 
How do these compare to the first numbers?

Finally, fullcat.c contains a version of cat which works much like the real version -- if you omit a filename, then it prints standard input to standard output. Otherwise, it prints out each file specified in the command line arguments. Note how it is similar to both simpcat1.c and cat2.c.

Type 'make clean' when you're done to save disk space, and remove any temporary files.


Chars vs ints

You'll note that getchar() is defined to return an int and not a char. Relatedly, look at simpcat1a.c:

#include <stdio.h>
#include <fcntl.h>
                       
int main()                 
{                      
  int c;              
                       
  c = getchar();       
  while(c != EOF) {    
    putchar(c);        
    c = getchar();     
  }                    
  return 0;
}                      

The only difference between simpcat1a.c and simpcat1.c is that c is an int instead of a char. Now, why would that matter? Look at the following:

UNIX> ls -l simpcat1.c simpcat1
-rwxr-xr-x. 1 plank loci  9632 Feb  1 11:28 simpcat1
-rw-r--r--. 1 plank guest  547 Feb  1 11:05 simpcat1.c
UNIX> ./simpcat1 < simpcat1 > tmp1
UNIX> ./simpcat1 < simpcat1.c > tmp2
UNIX> ls -l tmp1 tmp2
-rw-r--r--. 1 plank loci 660 Feb  1 11:29 tmp1
-rw-r--r--. 1 plank loci 547 Feb  1 11:29 tmp2
UNIX> 
Notice anything wierd? Now:
UNIX> ./simpcat1a < simpcat1 > tmp3
UNIX> ls -l tmp3
-rw-r--r--. 1 plank loci 9632 Feb  1 11:29 tmp3
UNIX> 
This has to do with what happens when getchar() reads the character 255. We'll talk about it in class. See if you can figure it out.