CS360 Lecture notes -- Red-Black Trees (JRB)

  • Jim Plank
  • Directory: ~plank/plank/classes/cs360/360/www-home/notes/JRB
  • Lecture notes: http://www.cs.utk.edu/~plank/plank/classes/cs360/360/notes/JRB
  • Fri Aug 27 16:03:04 EDT 1999

    Compiling

    In order to use the red-black tree library, you should include the file "jrb.h", which can be found in /blugreen/homes/plank/cs360/include. Instead of including the full path name in your C file, just do:
    #include "jrb.h",
    
    and then compile the program with:
    gcc -I/blugreen/homes/plank/cs360/include
    
    When you link your object files to make an executable, follow the directions in the libfdr lecture notes.

    The makefile in this directory does both of these things for you.

    Red-Black Trees

    Rb-trees are data structures based on balanced binary trees. You don't need to know how they work -- just that they do work, and all operations are in O(log(n)) time, where n is the number of elements in the tree. (If you really want to know more about red-black trees, let me know and I can point you to some texts on them).

    The main struct for rb-trees is the JRB. Like dllists, all rb-trees have a header node. You create a rb-tree by calling make_jrb(), which returns a pointer to the header node of an empty rb-tree. This header points to the main body of the rb-tree, which you don't need to care about, and to the first and last external nodes of the tree. These external nodes are hooked together with flink and blink pointers, so that you can view rb-trees as being dlists with the property that they are sorted, and you can find any node in the tree in O(log(n)) time.

    Like dllists, each node in the tree has a val field, which is a Jval. Additionally, each node has a key field, which is also a Jval. The rb-tree tree makes sure that the keys are sorted. How they are sorted depends on the tree.


    _str, _int, _dbl, _gen

    The jrb tree routines in jrb.h/jrb.c implement four types of insertion/searching routines. The insertion routines are: You can't mix and match comparison functions within the same tree. In other words, you shouldn't insert some keys with jrb_insert_str() and some with jrb_insert_int(). To do so will be begging for a core dump.

    To find keys, you use one of jrb_find_str(), jrb_find_int(), jrb_find_dbl() or jrb_find_gen(). Obviously, if you inserted keys with jrb_insert_str(), then you should use jrb_find_str() to find them. If the key that you're looking for is not in the tree, then jrb_find_xxx() returns NULL.

    Finally, there are also: jrb_find_gte_str(), jrb_find_gte_int(), jrb_find_gte_dbl() and jrb_find_gte_gen(). These return the jrb tree node whose key is either equal to the specified key, or whose key is the smallest one greater than the specified key. If the specified key is greater than any in the tree, it will return a pointer to the sentinel node. It has an argument found that is set to tell you if the key was found or not.


    You may use the macros jrb_first(), jrb_last(), jrb_prev() and jrb_next(), just like their counterparts in the dllist library.

    Example programs:

    A two-level tree example

    Suppose we want to sort lines of text by their atoi() value, but when two strings have the same atoi() value, to sort them lexicographically. One way to do this is go use a beefed up comparison function and then insert lines with jrb_insert_gen(), as in nsort2.c. Try it on input_n2.

    A second way to do this is to have a two-level tree. The first tree has integers as keys and is based on the atoi() value of each line. The val field of each node, however, is another red-black tree. This red-black tree contains each line whose atoi() value is equal to the key of the node, sorted lexicographically. Thus, when you read a line, you first see if its atoi() value is in the tree. If so, you get a pointer to the val field of that node. Of not, you insert a new node into the tree whose key is the atoi(), and whose val field is a new, empty red-black tree. Now, you have a pointer to the red-black tree in the val field of the node whose key is the atoi() value of the string. What you do now is insert the string into this second red-black tree using jrb_insert_str(). When you're done, you have a big two-level red-black tree. You traverse it by traversing the top level tree, and for each node in that tree, you traverse the tree in its val field and print out the strings. See the code. It is in nsort3.c.

    Another Example: ``Golf''

    Here's a typical example of using a red-black tree. Suppose we have a bunch of files with golf scores. Examples are in 1998_Majors and 1999_Majors. The format of these files is:
    Name     sunday-score F total-score
    
    For example, the first few lines of 1999_Majors/Masters are:
    Jose Maria Olazabal                 -1 F -8
    Davis Love III                      -1 F -6
    Greg Norman                         +1 F -5
    Bob Estes                           +0 F -4
    Steve Pate                          +1 F -4
    David Duval                         -2 F -3
    Phil Mickelson                      -1 F -3
    ...
    
    Note that the name can have any number of words.

    Now, suppose that we want to do some data processing on these files. For example, suppose we'd like to sort each player so that we first print out the players that have played the most tournaments, and then within that, we sort by the player with the lowest average score.

    This is what golf.c does. It takes score files on the command line, then reads in all the players and scores. Then it sorts them by number of tournaments/average score, and prints them out in that order, along with their score for each tournament. For example, look at score1:

    Jose Maria Olazabal                 -1 F -8
    Davis Love III                      -1 F -6
    Greg Norman                         +1 F -5
    
    and score2:
    Greg Norman                          +1  F +9
    David Frost                          +3  F +10
    Davis Love III                       -2  F +11
    
    The golf program reads in these two files, and ranks the four players by number of tournaments, and then average score:
    UNIX> golf score1 score2
    Greg Norman                              :   2 tournaments :    2.00
       -5 : score1
        9 : score2
    Davis Love III                           :   2 tournaments :    2.50
       -6 : score1
       11 : score2
    Jose Maria Olazabal                      :   1 tournament  :   -8.00
       -8 : score1
    David Frost                              :   1 tournament  :   10.00
       10 : score2
    

    Ok, now how does golf work? Well it works in three phases. In the first phase, it reads the input files to create a struct for each golfer. The data structure for this is a red-black tree keyed on the golfer's name, and whose val fields are Golfer structs that have the following defintion:

    typedef struct {
      char *name;
      int ntourn;
      int tscore;
      Dllist scores;
    } Golfer;
    
    The first three fields are obvious. The last field is a list of the golfer's scores. Each element of the list points to a Score struct with the following definition:
    typedef struct {
      char *tname;             /* File name */
      int score;               /* Total score */
    } Score;
    
    Note, in each file, we are going to ignore the ``sunday score.''

    So, to read in the golfers, we create the jrb tree golfers, and then read in each line of each input file. For each line, we construct the golfer's name, and then we look to see if the golfer has an entry in the golfers tree. If there is no such entry, then one is created. Once the entry is found/created, the score for that file is added. When all the files have been read, phase 1 is completed:

      Golfer *g;
      Score *s;
      JRB golfers, rnode;
      int i, fn;
      int tmp;
      IS is;
      char name[1000];
      Dllist dnode;
    
      golfers = make_jrb();
    
      for (fn = 1; fn < argc; fn++) {
        is = new_inputstruct(argv[fn]);
        if (is == NULL) { perror(argv[fn]); exit(1); }
    
        while(get_line(is) >= 0) {
    
          /* Error check each line */
    
          if (is->NF < 4 || strcmp(is->fields[is->NF-2], "F") != 0 ||
              sscanf(is->fields[is->NF-1], "%d", &tmp) != 1 ||
              sscanf(is->fields[is->NF-3], "%d", &tmp) != 1) {
            fprintf(stderr, "File %s, Line %d: Not the proper format\n",
              is->name, is->line);
            exit(1);
          }
          
          /* Construct the golfer's name */
          strcpy(name, is->fields[0]);
          for (i = 1; i < is->NF-3; i++) {
            strcat(name, " ");
            strcat(name, is->fields[i]);
          }
          
          /* Search for the name */
    
          rnode = jrb_find_str(golfers, name);
    
          /* Create an entry if none exists. */
    
          if (rnode == NULL) {
            g = (Golfer *) malloc(sizeof(Golfer));
            g->name = strdup(name);
            g->ntourn = 0;
            g->tscore = 0;
            g->scores = new_dllist();
            jrb_insert_str(golfers, g->name, new_jval_v(g));
          } else {
            g = (Golfer *) rnode->val.v;
          }
    
          /* Add the information to the golfer's struct */
    
          s = (Score *) malloc(sizeof(Score));
          s->tname = argv[fn];
          s->score = atoi(is->fields[is->NF-1]);
          g->ntourn++;
          g->tscore += s->score;
          dll_append(g->scores, new_jval_v(s));
        }
    
        /* Go on to the next file */
    
        jettison_inputstruct(is);
      }
    
    
    Now, this gives us all the information on the golfers, but they are sorted by the golfers' names, not by number of tournaments / average score. Thus, in phase 2, we construct a second red-black tree which will sort the golfers correctly. To do this, we need to construct our own comparison function that compares golfers by number of tournaments / average score. Here is the comparison function:
    int golfercomp(Jval j1, Jval j2)
    {
      Golfer *g1, *g2;
    
      g1 = (Golfer *) j1.v;
      g2 = (Golfer *) j2.v;
    
      if (g1->ntourn > g2->ntourn) return 1;
      if (g1->ntourn < g2->ntourn) return -1;
      if (g1->tscore < g2->tscore) return 1;
      if (g1->tscore > g2->tscore) return -1;
      return 0;
    }
    
    And here is the part of main where the second red-black tree is built:
    
      sorted_golfers = make_jrb();
    
      jrb_traverse(rnode, golfers) {
        jrb_insert_gen(sorted_golfers, rnode->val, JNULL, golfercomp);
      }
    
    
    Note, you pass a Jval to jrb_insert_gen.

    Finally, the third phase is to traverse the sorted_golfers tree, printing out the correct information for each golfer. This is straightforward, and done below:

      jrb_rtraverse(rnode, sorted_golfers) {
        g = (Golfer *) rnode->key.v;
        printf("%-40s : %3d tournament%1s : %7.2f\n", g->name, g->ntourn,
               (g->ntourn == 1) ? "" : "s", 
               (float) g->tscore / (float) g->ntourn);
        dll_traverse(dnode, g->scores) {
          s = (Score *) dnode->val.v;
          printf("  %3d : %s\n", s->score, s->tname);
        }
      }
    
    Try it out. You'll see that Tiger Woods did the best in all four majors this year:
    UNIX> golf 1999_Majors/*
    Tiger Woods                              :   4 tournaments :    0.25
       10 : 1999_Majors/British_Open
        1 : 1999_Majors/Masters
      -11 : 1999_Majors/PGA_Champ
        1 : 1999_Majors/US_Open
    Colin Montgomerie                        :   4 tournaments :    3.75
       12 : 1999_Majors/British_Open
       -1 : 1999_Majors/Masters
       -6 : 1999_Majors/PGA_Champ
       10 : 1999_Majors/US_Open
    Davis Love III                           :   4 tournaments :    4.50
       10 : 1999_Majors/British_Open
       -6 : 1999_Majors/Masters
        5 : 1999_Majors/PGA_Champ
        9 : 1999_Majors/US_Open
    Jim Furyk                                :   4 tournaments :    4.50
       11 : 1999_Majors/British_Open
        0 : 1999_Majors/Masters
       -4 : 1999_Majors/PGA_Champ
       11 : 1999_Majors/US_Open
    Nick Price                               :   4 tournaments :    4.75
       17 : 1999_Majors/British_Open
       -3 : 1999_Majors/Masters
       -7 : 1999_Majors/PGA_Champ
       12 : 1999_Majors/US_Open
    ...