CS360 Lecture notes -- Cat and its variants. Buffering.


As machines and devices get faster, you need to make the file named large.txt bigger and bigger to make the programs in this lecture exhibit large running times. In 2018, the file was roughly 25 MB. If you are running the code from this directory, you may need to make large.txt bigger. Do this in the makefile.
This lecture gives more detail on writing "cat" with unix system calls and with the C standard I/O library. It also motivates buffering for performance.

Simpcat

Here are three equivalent ways of writing a simple cat program, which just reads from standard input, and writes to standard output.

src/simpcat1.c
/* Using getchar()/putchar(). */

#include <stdio.h>
#include <fcntl.h>
#include <stdio.h>     
                       
int main()                 
{                      
  char c;              
                       
  c = getchar();       
  while(c != EOF) {    
    putchar(c);        
    c = getchar();     
  }                    
  return 0;
}                      
src/simpcat2.c
/* Using read()/write(). */

#include <unistd.h> 
                       
int main()                 
{                      
  char c;           
                       
  while(read(0, &c, 1) == 1) {
    write(1, &c, 1);    
  }                    
  return 0;
}                      
src/simpcat3.c
/* Using fread()/fwrite(). */

#include <stdio.h>

int main()
{
  char c;
  int i;

  i = fread(&c, 1, 1, stdin);
  while(i > 0) {    
    fwrite(&c, 1, 1, stdout);
    i = fread(&c, 1, 1, stdin);
  }
  return 0;
}

Let's look at these a little closer. If you have pulled the lecture note repo, cd to this directory and type "make". (BTW, when you're done, you'll want to type "make clean", because large.txt is 25MB, and you don't need to waste space on it.) Now do the following:

UNIX> sh
sh-3.2$ time bin/simpcat1 < data/large.txt > /dev/null

real	0m1.685s
user	0m1.676s
sys	0m0.007s
sh-3.2$ time bin/simpcat2 < data/large.txt > /dev/null

real	0m23.045s
user	0m9.073s
sys	0m13.798s
sh-3.2$ time bin/simpcat3 < data/large.txt > /dev/null

real	0m2.151s
user	0m1.970s
sys	0m0.006s
sh-3.2$ 
UNIX> 
Depending on what machine you're using, you may likely to get different times than the above -- those were on my Mac in 2021. Regardless of the numbers that you get, the ratios between simpcat1, simpcat2 and simpcat3 should be roughly the same.

So, what's going on? /dev/null is a special file in Unix that you can write to, but it never stores anything on disk. We're using it so that you don't create 25M files in your home directory as this wastes disk space. "data/large.txt" is a 25,000,000-byte file. This means that in simpcat1.c, getchar() and putchar() are being called 25 million times each, as are read() and write() in simpcat2.c, and fread() and fwrite() in simpcat3.c. Obviously, the culprit in simpcat2.c is the fact that the program is making system calls instead of library calls. Remember that a system call is a request made to the operating system. This means at each read/write call, the operating system has to take over the CPU (this means saving the state of the simpcat2 program), process the request, and return (which means restoring the state of the simpcat2 program). This is evidently far more expensive than what simpcat1.c and simpcat3.c do. Now, look at src/simcat4.c and src/simcat5.c:

src/simpcat4.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char **argv)
{
  int bufsize;
  char *c;
  int i;

  bufsize = atoi(argv[1]);
  c = (char *) malloc(bufsize*sizeof(char));
  i = 1;
  while (i > 0) {
    i = read(0, c, bufsize);
    if (i > 0) write(1, c, i);
  }
  return 0;
}
src/simpcat5.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char **argv)
{
  int bufsize;
  char *c;
  int i;

  bufsize = atoi(argv[1]);
  c = (char *) malloc(bufsize*sizeof(char));
  i = 1;
  while (i > 0) {
    i = fread(c, 1, bufsize, stdin);
    if (i > 0) fwrite(c, 1, i, stdout);
  }
  return 0;
}

These let us read in more than one byte at a time. This is called buffering: You allocate a region of memory in which to store things, so that you can make fewer system/procedure calls. Note that fread() and fwrite() are just like read() and write(), except that they go to the standard I/O library instead of the operating system.

The graph below shows their relative speeds (this was in 2016 on my MacBook Pro, running on a roughly 8MB input file. The numbers will be different when you run this on the current version of large.txt, but the shape of the graph will be the same):

First, what can we infer now about the standard I/O library? It uses buffering! In other words, when you first call getchar() or fread(), it performs a read() of a large number of bytes into a buffer. Thus, subsequent getchar() or fread() calls will be fast. When you attempt to fread() large segments of memory, the two exhibit the same behavior, as fread() doesn't need to buffer -- it simply calls read().

Why then is getchar() faster than fread(c, 1, 1, stdin)? Because getchar() is optimized for reading one character, and fread() is not.

Think about it -- fread() needs to read four arguments, and if it's executing code for small values of the size, it at the very least needs to figure out that the size is small before executing the code. getchar() has been written to be really fast for single characters.


What's the lesson behind this?

The same is true for writes, even though we didn't go through them in detail in class.

Standard I/O vs System calls.

Each system call has analogous procedure calls from the standard I/O library:
System Call			Standard I/O call
-----------			-----------------
open				fopen
close				fclose
read/write			getchar/putchar
				getc/putc
				fgetc/fputc
				fread/fwrite
				gets/puts
				fgets/fputs
				scanf/printf
				fscanf/fprintf
lseek				fseek
System calls work with integer file descriptors. Standard I/O calls define a structure called a FILE, and work with pointers to these structs.

To exemplify, the following are versions of the program cat which must be called with filename as their arguments. Cat1.c uses system calls, and cat2.c uses the standard I/O library. Both use an 8K buffer for the read()/fread() and write()/fwrite() calls. Read the man page for open ("man 2v open") and fopen ("man 3s fopen") to understand their arguments.

Try:

UNIX> sh -c "time bin/cat1 data/large.txt > /dev/null"         # As you can see,

real	0m0.010s
user	0m0.003s
sys	0m0.006s
UNIX> sh -c "time bin/cat2 data/large.txt > /dev/null"         # Their performance is the same.

real	0m0.010s
user	0m0.003s
sys	0m0.006s
UNIX> 
How do these compare to the first numbers?

Finally, src/fullcat.c contains a version of cat which works much like the real version -- if you omit a filename, then it prints standard input to standard output. Otherwise, it prints out each file specified in the command line arguments. Note how it is similar to both simpcat1.c and cat2.c.

Type 'make clean' when you're done to save disk space, and remove any temporary files.


Chars vs ints

You'll note that getchar() is defined to return an int and not a char. Relatedly, look at simpcat1a.c:

#include <stdio.h>
#include <fcntl.h>
                       
int main()                 
{                      
  int c;              
                       
  c = getchar();       
  while(c != EOF) {    
    putchar(c);        
    c = getchar();     
  }                    
  return 0;
}                      

The only difference between simpcat1a.c and simpcat1.c is that c is an int instead of a char. Now, why would that matter? Look at the following:

UNIX> ls -l src/simpcat1.c bin/simpcat1
-rwxr-xr-x  1 plank  staff  12604 Feb  5 12:17 bin/simpcat1
-rw-r--r--  1 plank  staff    466 Feb  5 12:15 src/simpcat1.c
UNIX> bin/simpcat1 > tmp1.txt
^C
UNIX> bin/simpcat1 < bin/simpcat1 > tmp1.txt
UNIX> bin/simpcat1 < src/simpcat1.c > tmp2.txt
UNIX> ls -l tmp1.txt tmp2.txt
-rw-r--r--  1 plank  staff  3919 Feb  7 23:37 tmp1.txt                     # This file should be 12,604 bytes
-rw-r--r--  1 plank  staff   466 Feb  7 23:38 tmp2.txt
UNIX> 
Notice anything wierd? Now:
UNIX> bin/simpcat1a < bin/simpcat1 > tmp3.txt
UNIX> ls -l tmp3.txt
-rw-r--r--  1 plank  staff  12604 Feb  7 23:38 tmp3.txt         # Now the output file is the same size as the input file
UNIX> 
This has to do with what happens when getchar() reads the character 255. We'll talk about it in class. See if you can figure it out.