CS360 Lecture notes -- Prsize: recursive directory traversal


To "follow along" with these lecture notes, if you are on the lab machines, simply do:
UNIX> cd /home/plank/cs360/notes/Prsize/test1
If you are on your own machine, then follow these instructions:

Pull the repo, and cd to the Prsize directory. You'll need to change the makefile so that it works on your machine. Here's what my makefile looks like on my home machine:

UNIX> head -n 4 makefile
CC = gcc

CFLAGS  =       -g -I$(HOME)/include
LIB = $(HOME)/lib/libfdr.a
UNIX> make
gcc -g -I/home/plank/include -c prsize1.c
gcc -g -I/home/plank/include -o prsize1 prsize1.o
...
UNIX> 
Next, untar the test directories. This is because bitbucket doesn't do a good job of preserving soft links and directory protections:
UNIX> tar xpfv test-directories.tar
test1/
test1/d1/
test1/d1/f3
test1/f1
test1/f2
test2/
test2/f4-soft
.....
UNIX>
Finally:
UNIX> cd test1
Now, you're set up.
This lecture covers the writing of a command prsize. What prsize does is return the number of bytes taken up by all files reachable from the current directory (excluding soft links). It is a good program as it illustrates using opendir/readdir/closedir, stat, recursion, building path names, and finding hard links.

First, I wrote prsize1.c. This prints the total size of all files in the current directory. It is a simple use of stat and opendir/readdir/closedir. Test it out on your current directory (which is test1):

UNIX> ../prsize1 
4170
You may get a different value (e.g. on my home machine, I got 12314), but if you do a long listing of all files, your value should equal the sum of all of the file sizes. For example:
UNIX> ls -la
total 12
drwxr-xr-x. 3 plank guest   33 Feb 11 14:00 .
drwxr-xr-x. 7 plank guest 4096 Feb 11 14:00 ..
drwxr-xr-x. 2 plank guest   15 Sep 23  1994 d1
-rw-r--r--. 1 plank guest   11 Sep 23  1994 f1
-rw-r--r--. 1 plank guest   15 Sep 23  1994 f2
UNIX> echo "33 4096 + 15 + 11 + 15 + p" | dc
4170
UNIX> 
Now, the next step we'd like to take is to get the program to sum up the sizes of all files reachable from the current directory. To do this, we need to make the program recursive. Instead of putting all our code in the main() routine, we'll instead bundle it into a function, and call that function. Prsize2.c does this. It provides the same functionality as prsize1.c, except that it makes a call to get_size() to find the size. Note there is no recursion yet -- that is for prsize3.c. If you test prsize2, you'll see that it does the same thing as prsize1.
UNIX> ../prsize2 
4170
UNIX>
Now, we want to make prsize2 recursive. Whenever we encounter a directory, we want to find out the size of everything in that directory, so we will call get_size() recursively on that directory. This is done in prsize3.c. Try it out:
UNIX> ../prsize3
prsize: Too many open files
UNIX>
So, what's happening? Well, to check, I put a print statement into prsize3a.c to see when it's making the recursive calls:
UNIX> ../prsize3a
Making recursive call on directory .
Making recursive call on directory .
Making recursive call on directory .
Making recursive call on directory .
....
prsize: Too many open files
UNIX>
Now you can see what's happening. When enumerating files in ".", you come across the file ".". This is a directory, so you make a recursive call on it. This goes into an infinite loop until you run out of open file discriptors at which point opendir() fails. To fix this, you need to check and see whether or not you are trying to make a recursive call to the "." directory. You need to check for ".." as well. Prsize4.c puts in this code. Now try it out:
UNIX> ../prsize4
Couldn't stat f3
UNIX>
Ok, now what's the problem? Well, the program is trying to stat f3 in the directory d1, but it's not working in the directory d1. In other words, prsize4 is called from the directory /home/plank/cs360/notes/Prsize/test1, and makes the call "exists = stat("f3", &buf)". Of course stat is going to return -1, because there is no file f3 in the directory. Instead, we need to look for "d1/f3". In other words, our code has a bug -- we need to be looking for fn/de->d_name in get_size(), and not just de->d_name. Prsize5.c makes this change.
UNIX> ../prsize5
4170
So, this looks ok, except there's still something wrong:
UNIX> ls -la
total 12
drwxr-xr-x. 3 plank guest   33 Feb 11 14:00 .
drwxr-xr-x. 7 plank guest 4096 Feb 11 14:00 ..
drwxr-xr-x. 2 plank guest   15 Sep 23  1994 d1
-rw-r--r--. 1 plank guest   11 Sep 23  1994 f1
-rw-r--r--. 1 plank guest   15 Sep 23  1994 f2
UNIX> echo "33 4096 + 15 + 11 + 15 + p" | dc
4170
UNIX> ls -la d1
total 4
drwxr-xr-x. 2 plank guest 15 Sep 23  1994 .
drwxr-xr-x. 3 plank guest 33 Feb 11 14:00 ..
-rw-r--r--. 1 plank guest 17 Sep 23  1994 f3
UNIX> echo "4170 15 + 33 + 17 + p" | dc
4235
UNIX> 
As you can see, prsize5 is counting d1 and d1/. as separate files, and adding both of their sizes into the total. Same for . and d1/..

This is a drag. To be clearer, look in test2:

UNIX> cd ../test2
UNIX> ../prsize5
4152
UNIX> ls -lai
total 12
486036031 drwxr-xr-x. 2 plank guest   34 Feb 11 14:15 .
962309197 drwxr-xr-x. 7 plank guest 4096 Feb 11 14:00 ..
486036512 -rw-r--r--. 2 plank guest   11 Sep 23  1994 f4
486036512 -rw-r--r--. 2 plank guest   11 Sep 23  1994 f4-hard-link
UNIX> echo "34 4096 + 11 + 11 + p" | dc
4152
UNIX> 
The files f4 and f4-hard-link are links to the same file, so we really shouldn't count them twice. However, prsize5 counts them as being different. So, what we need is for prsize to be able to recognize hard links, and only count them once.

How do you recognize whether two files are links to the same disk file? You use the inode number. This is held in buf.st_ino.

Now, the way we check for duplicate inodes is to maintain a rb-tree of inodes that we have seen so far. Before adding in the size of any file, we check to see if its inode is in the rb-tree. If so, we do nothing. Otherwise, we add in the size, and put the inode into the rb-tree. As inodes are ints, we can use jrb_insert_int and jrb_find_int to access and modify the red-black tree. (Caveat starting in 2015 -- inode numbers are now longs on some systems, so using jrb_find_int() is not technically the right thing to use any more. However, I'm not going to change the notes because it won't matter functionally unless you have two inodes whose lower 32 bits are identical. I'm tempted to memcpy() the inode into a double and use jrb_find_dbl() -- I'll probably do that when I teach it, even though it's kind of disgusting. What I need to do is add _long() to the JRB library, but I don't see myself doing that any time soon.)

The code is in prsize6.c.

UNIX> ../prsize6
4141                    This is 11 less than before, so it's correct.
UNIX> cd ../test1
UNIX> ../prsize6
4187                    This is 48 less than before, so it's correct as it's not double-counting . and d1.
UNIX>
Now, soft links present a small problem. Look at the test3 directory.
UNIX> cd ../test3
UNIX> ls -la
total 8
drwxr-xr-x. 2 plank guest   55 Sep 24  1996 .
drwxr-xr-x. 7 plank guest 4096 Feb 11 14:00 ..
-rw-r--r--. 1 plank guest   11 Sep 23  1994 f5
lrwxrwxrwx. 1 plank loci     2 Aug  1  2014 f5-soft-link -> f5
lrwxrwxrwx. 1 plank loci     1 Aug  1  2014 soft-link-to-. -> .
UNIX> ../prsize6
Couldn't stat ./soft-link-to-./soft-link-to-./soft-link-to-./soft-link-to-.........
prsize: Too many levels of symbolic links
UNIX> 
So, what has happened? Since we're using stat(), prsize6 doesn't recognize soft links, and thus we have the same infinite loop problem as before. It should be clear what we want -- instead of traversing the link to ".", we want prsize to count the size of the link itself (2 bytes for f5-soft-link and 1 byte for soft-link-to-.). Thus, all we need to do in prsize7.c is use lstat() instead of stat(). This gives information about the soft link itself, instead of the file to which the link points:
UNIX> ../prsize7
4165
UNIX> echo "55 4096 + 11 + 2 + 1 + p" | dc
4165
UNIX> 
Finally, there's one more bug in this program, and it's really subtle. It has to do with open file descriptors. First, go to the directory test4. Below, I use the find command to show that it is composed of 10 nested directories. You can see that prsize7 works just fine on it:
UNIX> cd ../test4
UNIX> find . -print
.
./1
./1/2
./1/2/3
./1/2/3/4
./1/2/3/4/5
./1/2/3/4/5/6
./1/2/3/4/5/6/7
./1/2/3/4/5/6/7/8
./1/2/3/4/5/6/7/8/9
UNIX> ../prsize7
4228
UNIX>
The reason that it works is that our defaults typically allow for 256 open files per process:
UNIX> limit | grep descriptors
descriptors  256 
UNIX>
Let's use the limit command to set this number to ten instead of 256. Now, prsize7 fails because of too many open files:
UNIX> limit descriptors 10
UNIX> ../prsize7
prsize: Too many open files
UNIX> 
What's happening is that the recursive calls to get_size() are made in between the opendir() and closedir() calls. That means that each time we make a recursive call, we add one to the number of open files. With only ten open files (and three open to start the process), we run out of file descriptors when we try to open "./1/2/3/4/5/6/7".

The solution to this is to make sure that there are no open files when we make the recursive call. How do we do this? When enumerating the files in a directory, we put all directories into a dllist, and then after closing the directory file, we traverse the list and make the recursive calls. We need to do a strdup() when we put the directories into the dllist. Why? Think it over, or see what happens when you don't do it, and you try run the program on the test5 directory.

The correct and final version of prsize is in prsize8.c.

UNIX> ../prsize8
4228
UNIX> cd ../test5
UNIX> ls -la
total 4
drwxr-xr-x. 5 plank guest   33 Sep 24  1996 .
drwxr-xr-x. 7 plank guest 4096 Feb 11 14:00 ..
drwxr-xr-x. 2 plank guest   15 Sep 23  1994 d1
drwxr-xr-x. 2 plank guest   15 Sep 23  1994 d2
drwxr-xr-x. 2 plank guest   15 Sep 23  1994 d3
UNIX> ls -la d1 d2 d3
d1:
total 4
drwxr-xr-x. 2 plank guest 15 Sep 23  1994 .
drwxr-xr-x. 5 plank guest 33 Sep 24  1996 ..
-rw-r--r--. 1 plank guest 14 Sep 23  1994 f1

d2:
total 4
drwxr-xr-x. 2 plank guest 15 Sep 23  1994 .
drwxr-xr-x. 5 plank guest 33 Sep 24  1996 ..
-rw-r--r--. 1 plank guest 40 Sep 23  1994 f2

d3:
total 4
drwxr-xr-x. 2 plank guest 15 Sep 23  1994 .
drwxr-xr-x. 5 plank guest 33 Sep 24  1996 ..
-rw-r--r--. 1 plank guest 42 Sep 23  1994 f3
UNIX> echo "33 4096 + 15 + 15 + 15 + 14 + 40 + 42 + p" | dc
4270
UNIX> ../prsize8
4270
UNIX> 
UNIX> As an aside, find has the same bug as prsize7.c:
UNIX> limit | grep descriptors
descriptors  10 
UNIX> find . -print
.
./1
./1/2
./1/2/3
./1/2/3/4
find: './1/2/3/4': Too many open files
UNIX> 
On the flip side, tar handles it correctly:
UNIX> tar cvf ~/junk.tar .
./
./1/
./1/2/
./1/2/3/
./1/2/3/4/
./1/2/3/4/5/
./1/2/3/4/5/6/
./1/2/3/4/5/6/7/
./1/2/3/4/5/6/7/8/
./1/2/3/4/5/6/7/8/9/
UNIX> 
When I first wrote this lecture, in the mid 1990's, I had test4 have 257 subdirectories, rather than 10. That way, I didn't have mess with the limit command. Within a day, I had an email from our system administrator, complaining that the directory broke the system backup program. It also broke tar. So, I changed the directory to its current structure. I like to think that the good folks who write system tools fixed tar because they stumbled upon my lecture notes. A man can dream, can't he?
Here's prsize8.c, commented for your enjoyment.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <dirent.h>
#include <sys/stat.h>
#include "jrb.h"
#include "dllist.h"

/* This procedure returns the number of bytes in files that are
   reachable from fn.  It does not double-count hard links to
   the same file, and it counts the size of soft links, not the 
   files to which they point. */

int get_size(char *fn, JRB inodes)
{
  DIR *d;
  struct dirent *de;
  struct stat buf;
  int exists;
  int total_size;
  char *s;
  Dllist directories, tmp;

  /* Open the directory for reading. */

  d = opendir(fn);
  if (d == NULL) {
    perror("prsize");
    exit(1);
  }
 
  /* We use s to store file names that are of the form 
     "directory/filename" -- the maximum length filename is
     256 bytes, so this makes sure that the buffer s is 
     big enough. */

  total_size = 0;
  directories = new_dllist();
  s = (char *) malloc(sizeof(char)*(strlen(fn)+258));

  /* Read each filename in the current directory.  
     Construct s as "directory/filename" and call lstat to
     get the inode information about the file. */

  for (de = readdir(d); de != NULL; de = readdir(d)) {
    sprintf(s, "%s/%s", fn, de->d_name);
    exists = lstat(s, &buf);
    if (exists < 0) {
      fprintf(stderr, "Couldn't stat %s\n", s);
      exit(1);

    /* Look up the inode in the inodes tree.  If it's there,
       you ignore it, because you've seen it before.  Otherwise,
       put it into the tree and process it. */

    } else {
      if (jrb_find_int(inodes, buf.st_ino) == NULL) {
        total_size += buf.st_size;
        jrb_insert_int(inodes, buf.st_ino, new_jval_i(0));
      }
    }

    /* If the file is a directory, and not . or .., then append
       it to the directories list so that you don't make recursive
       calls while the directory is opened. */

    if (S_ISDIR(buf.st_mode) && strcmp(de->d_name, ".") != 0 && 
        strcmp(de->d_name, "..") != 0) {
      dll_append(directories, new_jval_s(strdup(s)));
    }
  }

  /* Close the directory, and then make recursive calls to all of
     the directories.  You'll note, I free the directory name after
     the recursive call returns.  I do this to avoid having a memory
     leak due to the strdup() calls above. */

  closedir(d);
  dll_traverse(tmp, directories) {
     total_size += get_size(tmp->val.s, inodes);
     free(tmp->val.s);
  }
   
  /* Perform final free() calls again to avoid memory leaks. */

  free_dllist(directories);
  free(s);

  return total_size;
}

int main()
{
  JRB inodes;

  inodes = make_jrb();
  printf("%d\n", get_size(".", inodes));
  return 0;
}