src/simpcat1.c
/* Using getchar()/putchar(). */ #include <stdio.h> #include <fcntl.h> #include <stdio.h> int main() { char c; c = getchar(); while(c != EOF) { putchar(c); c = getchar(); } return 0; } |
src/simpcat2.c
/* Using read()/write(). */ #include <unistd.h> int main() { char c; while(read(0, &c, 1) == 1) { write(1, &c, 1); } return 0; } |
src/simpcat3.c
/* Using fread()/fwrite(). */ #include <stdio.h> int main() { char c; int i; i = fread(&c, 1, 1, stdin); while(i > 0) { fwrite(&c, 1, 1, stdout); i = fread(&c, 1, 1, stdin); } return 0; } |
Let's look at these a little closer. If you have pulled the lecture note repo, cd to this directory and type "make". (BTW, when you're done, you'll want to type "make clean", because large.txt is 25MB, and you don't need to waste space on it.) Now do the following:
UNIX> sh sh-3.2$ time bin/simpcat1 < data/large.txt > /dev/null real 0m1.685s user 0m1.676s sys 0m0.007s sh-3.2$ time bin/simpcat2 < data/large.txt > /dev/null real 0m23.045s user 0m9.073s sys 0m13.798s sh-3.2$ time bin/simpcat3 < data/large.txt > /dev/null real 0m2.151s user 0m1.970s sys 0m0.006s sh-3.2$ UNIX>Depending on what machine you're using, you may likely to get different times than the above -- those were on my Mac in 2021. Regardless of the numbers that you get, the ratios between simpcat1, simpcat2 and simpcat3 should be roughly the same.
So, what's going on? /dev/null is a special file in Unix that you can write to, but it never stores anything on disk. We're using it so that you don't create 25M files in your home directory as this wastes disk space. "data/large.txt" is a 25,000,000-byte file. This means that in simpcat1.c, getchar() and putchar() are being called 25 million times each, as are read() and write() in simpcat2.c, and fread() and fwrite() in simpcat3.c. Obviously, the culprit in simpcat2.c is the fact that the program is making system calls instead of library calls. Remember that a system call is a request made to the operating system. This means at each read/write call, the operating system has to take over the CPU (this means saving the state of the simpcat2 program), process the request, and return (which means restoring the state of the simpcat2 program). This is evidently far more expensive than what simpcat1.c and simpcat3.c do. Now, look at src/simcat4.c and src/simcat5.c:
src/simpcat4.c
#include <stdio.h> #include <stdlib.h> #include <unistd.h> int main(int argc, char **argv) { int bufsize; char *c; int i; bufsize = atoi(argv[1]); c = (char *) malloc(bufsize*sizeof(char)); i = 1; while (i > 0) { i = read(0, c, bufsize); if (i > 0) write(1, c, i); } return 0; } |
src/simpcat5.c
#include <stdio.h> #include <stdlib.h> #include <unistd.h> int main(int argc, char **argv) { int bufsize; char *c; int i; bufsize = atoi(argv[1]); c = (char *) malloc(bufsize*sizeof(char)); i = 1; while (i > 0) { i = fread(c, 1, bufsize, stdin); if (i > 0) fwrite(c, 1, i, stdout); } return 0; } |
These let us read in more than one byte at a time. This is called buffering: You allocate a region of memory in which to store things, so that you can make fewer system/procedure calls. Note that fread() and fwrite() are just like read() and write(), except that they go to the standard I/O library instead of the operating system.
The graph below shows their relative speeds (this was in 2016 on my MacBook Pro, running on a roughly 8MB input file. The numbers will be different when you run this on the current version of large.txt, but the shape of the graph will be the same):
First, what can we infer now about the standard I/O library? It uses buffering! In other words, when you first call getchar() or fread(), it performs a read() of a large number of bytes into a buffer. Thus, subsequent getchar() or fread() calls will be fast. When you attempt to fread() large segments of memory, the two exhibit the same behavior, as fread() doesn't need to buffer -- it simply calls read().
Why then is getchar() faster than fread(c, 1, 1, stdin)? Because getchar() is optimized for reading one character, and fread() is not.
Think about it -- fread() needs to read four arguments, and if it's executing code for small values of the size, it at the very least needs to figure out that the size is small before executing the code. getchar() has been written to be really fast for single characters.
System Call Standard I/O call ----------- ----------------- open fopen close fclose read/write getchar/putchar getc/putc fgetc/fputc fread/fwrite gets/puts fgets/fputs scanf/printf fscanf/fprintf lseek fseekSystem calls work with integer file descriptors. Standard I/O calls define a structure called a FILE, and work with pointers to these structs.
To exemplify, the following are versions of the program cat which must be called with filename as their arguments. Cat1.c uses system calls, and cat2.c uses the standard I/O library. Both use an 8K buffer for the read()/fread() and write()/fwrite() calls. Read the man page for open ("man 2v open") and fopen ("man 3s fopen") to understand their arguments.
Try:
UNIX> sh -c "time bin/cat1 data/large.txt > /dev/null" # As you can see, real 0m0.010s user 0m0.003s sys 0m0.006s UNIX> sh -c "time bin/cat2 data/large.txt > /dev/null" # Their performance is the same. real 0m0.010s user 0m0.003s sys 0m0.006s UNIX>How do these compare to the first numbers?
Finally, src/fullcat.c contains a version of cat which works much like the real version -- if you omit a filename, then it prints standard input to standard output. Otherwise, it prints out each file specified in the command line arguments. Note how it is similar to both simpcat1.c and cat2.c.
Type 'make clean' when you're done to save disk space, and remove any temporary files.
#include <stdio.h> #include <fcntl.h> int main() { int c; c = getchar(); while(c != EOF) { putchar(c); c = getchar(); } return 0; } |
The only difference between simpcat1a.c and simpcat1.c is that c is an int instead of a char. Now, why would that matter? Look at the following:
UNIX> ls -l src/simpcat1.c bin/simpcat1 -rwxr-xr-x 1 plank staff 12604 Feb 5 12:17 bin/simpcat1 -rw-r--r-- 1 plank staff 466 Feb 5 12:15 src/simpcat1.c UNIX> bin/simpcat1 > tmp1.txt ^C UNIX> bin/simpcat1 < bin/simpcat1 > tmp1.txt UNIX> bin/simpcat1 < src/simpcat1.c > tmp2.txt UNIX> ls -l tmp1.txt tmp2.txt -rw-r--r-- 1 plank staff 3919 Feb 7 23:37 tmp1.txt # This file should be 12,604 bytes -rw-r--r-- 1 plank staff 466 Feb 7 23:38 tmp2.txt UNIX>Notice anything wierd? Now:
UNIX> bin/simpcat1a < bin/simpcat1 > tmp3.txt UNIX> ls -l tmp3.txt -rw-r--r-- 1 plank staff 12604 Feb 7 23:38 tmp3.txt # Now the output file is the same size as the input file UNIX>This has to do with what happens when getchar() reads the character 255. We'll talk about it in class. See if you can figure it out.