CS302 Lecture Notes - Data Structures That Point To Each Other & Traversal


This is a slightly contrived example. You have every DVD starring Kurt Russell from 1980 to 2008, and you have downloaded information about each from allmovie.com. Specifically, for each movie, you have a file with the movie's title that contains the year of the movie, and then all of the actors/actresses and their roles. An example is in Big-Trouble-In-Little-China.txt:

1986
Kurt Russell  - Jack Burton
Kim Cattrall  - Gracie Law
Dennis Dun  - Wang Chi
James Hong  - Lo Pan
Victor Wong  - Egg Shen
...

All the files are in this directory (I want it to be known that this is just a contrived example. The only Kurt Russell DVD I own is Big-Trouble-In-Little-China, one of the best movies of all times, BTW):

UNIX> ls *.txt
3000-Miles-to-Graceland.txt     Miracle.txt
Amber-Waves.txt                 Overboard.txt
Backdraft.txt                   Poseidon.txt
Best-Of-Times.txt               Silkwood.txt
Big-Trouble-In-Little-China.txt Sky-High.txt
Breakdown.txt                   Soldier.txt
Captain-Ron.txt                 Stargate.txt
Dark-Blue.txt                   Swing-Shift.txt
Death-Proof.txt                 Tango-And-Cash.txt
Dreamer.txt                     Tequila-Sunrise.txt
Escape-From-L.A..txt            The-Thing.txt
Escape-From-New-York.txt        Tombstone.txt
Executive-Decision.txt          Unlawful-Entry.txt
Interstate.txt                  Used-Cars.txt
Jiminy-Glick-in-La-La-Wood.txt  Vanilla-Sky.txt
Mean-Season.txt                 Winter-People.txt
UNIX> 
For each movie, there are three pieces of information -- the title, which is in the filename, the year, which is in the first line of the file, and the actors/roles, which are in the remaining lines. We are going to write a program that reads in all of this information into three sets of data structures:

We are then going to print out each set of data structures. The point of this exercise is to get further practice with maps, and also to see how to get data structures to point to each other.

We'll build the program incrementally. Our first pass is in kurtproc1.cpp. This defines all of the data structures that we will use:

#include <iostream>
#include <map>
#include <fstream>
#include <string>
#include <cstdlib>
using namespace std;

class Actor {
  public:
    string name;
    map <string, class Movie *> movies;
    map <int, class Year *> years;
};

class Movie {
  public:
    string name;
    int year;
    map <string, Actor *> actors;
};

class Year {
  public:
    int year;
    map <string, Movie *> movies;
    map <string, Actor *> actors;
};

These are straightforward, however there is a quirk. The Actor class has a map of Movie's, but we haven't defined an Movie yet. This is a forward reference. To deal with this, we say "class Movie *" instead of "Movie *." This tells the compiler that we will define a Movie later. to struct actor. We also do that with the Year class. There's no way we can avoid doing this, as each data structure points to the others.

Note also that all of the maps contain pointers to the various classes rather than instances of the classes themselves. Why? Because if we didn't use pointers, each map would hold a copy of each class instance, and the maps would not point to the same things. By using a pointer, if a map cointains a pointer to the actor Kurt Russell, that is the same pointer that all of the other maps use.

Our first pass doesn't deal with the data structures -- it just deals with constructing the movie's name from its filename:

main(int argc, char **argv)
{
  ifstream fin;
  int i, j;
  string s;

  for (i = 1; i < argc; i++) {
    fin.open(argv[i]);
    if (fin.fail()) {
      cerr << "Problem opening " << argv[i] << endl;
      exit(1);
    }
    s = argv[i];
    j = s.find(".txt");
    if (j == string::npos) {
      cerr << "File does not have a .txt extension: " << s << endl;
      exit(1);
    }
    s.resize(j);
    for (j = 0; j < s.length(); j++) {
      if (s[j] == '-') s[j] = ' ';
    }

    cout << s << endl;

    fin.close();
    fin.clear();
  }
  exit(0);
}

For each file, we substitute a space for each hyphen, and then we strip out the ".txt" suffix. To do that, we use the find() method of the string class to get the index of the ".txt" substring. Then we resize the string, which cuts off the ".txt". After that, we print it out.

We also open and close each file. One quirk of many C++ implementations is that we have to call fin.clear() after closing the file . We have to do this because some C++ implementations don't clear the fin.fail() flag when they close a file, and if you don't call fin.clear(), the subsequent fin.open() call will fail. Remember this, because it will come up from time to time, and it's really irritating.

In class, I showed how you can double-check the output to make sure it is correct. I did the following sequence of operations:

UNIX> kurtproc1 *.txt | head
3000 Miles to Graceland
Amber Waves
Backdraft
Best Of Times
Big Trouble In Little China
Breakdown
Captain Ron
Dark Blue
Death Proof
Dreamer
UNIX> ls *.txt | sed 's/-/ /g' | sed 's/.txt//' | head
3000 Miles to Graceland
Amber Waves
Backdraft
Best Of Times
Big Trouble In Little China
Breakdown
Captain Ron
Dark Blue
Death Proof
Dreamer
UNIX> kurtproc1 *.txt > output1
UNIX> ls *.txt | sed 's/-/ /g' | sed 's/.txt//' > output2
UNIX> diff output1 output2
UNIX> 
I'll leave it to you to read the sed man page - I am using the command to convert all hyphens to spaces and strip out the ".txt" suffix. I then compare the output of that to the output of kurtproc1. They are the same, so I am satisfied.

kurtproc2.cpp - Reading the movie files

For the next step, we'll read the movie file, extracting the year and the actor lines. Again, we'll test this before messing with the data structures. Here's the new code (kurtproc2.cpp) in the variable declarations:

  int year;

And in the movie file reading:

    fin >> year;
    if (fin.fail()) {
      cerr << "The first line of " << s << " should be the year\n";
      exit(1);
    }
    cout << "Movie: " << s << ". Year: " << year << endl;
    while (!fin.fail()) {
      getline(fin, s);
      if (!fin.fail()) {
        j = s.find(" - ");
        if (j == string::npos) {
          cerr << "Actor specifications should be actor name '-' role name\n";
          cerr << "S: " << s << endl;
          exit(1);
        }
        while (s[j] == ' ') j--;
        s.resize(j+1);
        cout << "Actor: " << s << endl;
      }
    }

Our first test reveals a bug:

UNIX> kurtproc2 *.txt
Movie: 3000 Miles to Graceland. Year: 2001
Actor specifications should be actor name '-' role name
S: 
UNIX> 
It appears that s is an empty string. Why? Because after reading the year, the fin pointer is to the end of the first line, meaning that the getline() call reads the empty string at the end of the first line. We fix this in kurtproc3.cpp, which gets rid of the end of the first line:

   ...
    fin >> year;
    if (fin.fail()) {
      cerr << "The first line of " << s << " should be the year\n";
      exit(1);
    }
    cout << "Movie: " << s << ". Year: " << year << endl;
    getline(fin, s);    // Just one new line of code.
   ...

Now, it works like it should. Eyeballing the output, it seems to be working fine:

UNIX> kurtproc3 *.txt | head -n 20
Movie: 3000 Miles to Graceland. Year: 2001
Actor: Kurt Russell
Actor: Kevin Costner
Actor: Courteney Cox Arquette
Actor: Christian Slater
Actor: Kevin Pollak
Actor: David Arquette
Actor: Jon Lovitz
Actor: Howie Long
Actor: Thomas Haden Church
Actor: Bokeem Woodbine
Actor: Ice-T
Actor: David Kaye
Actor: Louis Lombardi
Actor: Shawn Michael Howard
Actor: Peter Kent
Actor: Robert "Bobby Z" Zajonc
Movie: Amber Waves. Year: 1982
Actor: Fran Brill
Actor: Wilford Brimley
UNIX>
UNIX> cat *.txt | head -n 20
2001
Kurt Russell  - Michael Zane
Kevin Costner  - Thomas Murphy
Courteney Cox Arquette  - Cybil Waingrow
Christian Slater  - Joseph Hanson
Kevin Pollak  - Federal Marshal Damitry
David Arquette  - Gus Watson
Jon Lovitz  - Jay Peterson
Howie Long  - Jack
Thomas Haden Church  - Federal Marshal Quigley
Bokeem Woodbine  - Benjamin Franklin
Ice-T  - Hamilton
David Kaye  - Jesse Waingrow
Louis Lombardi  - Otto Sinclair
Shawn Michael Howard  - Roller Elvis
Peter Kent  - SWAT Leader
Robert "Bobby Z" Zajonc  - Helicoptor Pilot 
1982
Fran Brill  - Suze Winter
Wilford Brimley  - Pete Alberts
UNIX> 


kurtproc4.cpp - Making instances of Movie classes

Our next change creates an instance of the Movie class once for every movie file. We'll keep these in a map keyed on the movie's name, and we'll print out the map after we're done reading the files. We won't bother inserting the actors into the movie yet -- we'll just initialize the movie's name and year fields. The code is in kurtproc4.cpp. Here are the new variable declarations:

main(int argc, char **argv)
{
  ifstream fin;
  int i, j;
  string s;
  int year;
  Movie *m;
  map <string, Movie *> movies;
  map <string, Movie *>::iterator mit;

Here is the code that creates the instance of the movie class and inserts it into the new map:

    m = new Movie;
    m->name = s;

    fin >> year;
    if (fin.fail()) {
      cerr << "The first line of " << s << " should be the year\n";
      exit(1);
    }
    getline(fin, s);

    m->year = year;
    movies.insert(make_pair(m->name, m));

And here is the code that prints out the movies after reading everything:

  for (mit = movies.begin(); mit != movies.end(); mit++) {
    m = mit->second;
    cout << "Movie: " << m->name << ". Year: "<< m->year << ".\n";
  }

When we run it, all looks good:

UNIX> kurtproc4 *.txt | head 
Movie: 3000 Miles to Graceland. Year: 2001.
Movie: Amber Waves. Year: 1982.
Movie: Backdraft. Year: 1991.
Movie: Best Of Times. Year: 1986.
Movie: Big Trouble In Little China. Year: 1986.
Movie: Breakdown. Year: 1997.
Movie: Captain Ron. Year: 1992.
Movie: Dark Blue. Year: 2003.
Movie: Death Proof. Year: 2007.
Movie: Dreamer. Year: 2005.
UNIX> 

kurtproc5.cpp - Making actors and putting them into their proper places

Our next change is to create instances of the Actor class. We want to do this once for each actor, so when we read an actor from the file, the first thing we must do is test to see if we've read that actor before. To do that, we'll maintain an actors map. We only call new and create an actor if a newly read actor is not in the actorsmap.

Once we have an Actor * for the actor, we will insert the actor into the movie's actor map, and we will insert the movie into the actor's movie map.

The changes are in kurtproc5.cpp: Here are the variables:

  Actor *a;
  map <string, Actor *> actors;
  map <string, Actor *>::iterator ait;

Here is the new code where we read actors:

    /* Read the actors */

    while (!fin.fail()) {
      getline(fin, s);
      if (!fin.fail()) {
        j = s.find(" - ");
        if (j == string::npos) {
          cerr << "Actor specifications should be actor name '-' role name\n";
          cerr << "S: " << s << endl;
          exit(1);
        }
        while (s[j] == ' ') j--;
        s.resize(j+1);

        /* Check the actors map to see if the actor exists already.  
           If not, create a new actor and put it into the map. */
        
        ait = actors.find(s);
        if (ait == actors.end()) {
          a = new Actor;
          a->name = s;
          actors.insert(make_pair(a->name, a));
        } else {
          a = ait->second;
        }
  
        /* Put movies & actors into each others' maps. */
  
        m->actors.insert(make_pair(a->name, a));
        a->movies.insert(make_pair(m->name, m)); 
      }
    }

And here's the code at the end that prints out each movie, followed by its actors sorted by name:

  /* Print out the movies */

  for (mit = movies.begin(); mit != movies.end(); mit++) {
    m = mit->second;
    cout << "Movie: " << m->name << ". Year: "<< m->year << ".\n";
    for (ait = m->actors.begin(); ait != m->actors.end(); ait++) {
      a = ait->second;
      cout << "  Actor: " << a->name << endl;
    }
    cout << endl;
  }

A quick scan of some output looks good:

UNIX>  kurtproc5 *.txt | head
Movie: 3000 Miles to Graceland. Year: 2001.
  Actor: Bokeem Woodbine
  Actor: Christian Slater
  Actor: Courteney Cox Arquette
  Actor: David Arquette
  Actor: David Kaye
  Actor: Howie Long
  Actor: Ice-T
  Actor: Jon Lovitz
  Actor: Kevin Costner
UNIX> kurtproc5 *.txt | sed -n '/Vanilla Sky/,/^$/p'
Movie: Vanilla Sky. Year: 2001.
  Actor: Alicia Witt
  Actor: Cameron Diaz
  Actor: Jason Lee
  Actor: Johnny Galecki
  Actor: Kurt Russell
  Actor: Michael Shannon
  Actor: Noah Taylor
  Actor: Penelope Cruz
  Actor: Tilda Swinton
  Actor: Timothy Spall
  Actor: Tom Cruise

UNIX> 
UNIX> sort 3000-Miles-to-Graceland.txt | head
2001
Bokeem Woodbine  - Benjamin Franklin
Christian Slater  - Joseph Hanson
Courteney Cox Arquette  - Cybil Waingrow
David Arquette  - Gus Watson
David Kaye  - Jesse Waingrow
Howie Long  - Jack
Ice-T  - Hamilton
Jon Lovitz  - Jay Peterson
Kevin Costner  - Thomas Murphy
UNIX> sort Vanilla-Sky.txt 
2001
Alicia Witt  - Libby
Cameron Diaz  - Julie Gianni
Jason Lee  - Brian Shelby
Johnny Galecki  - Peter Brown
Kurt Russell  - McCabe
Michael Shannon  - Aaron 
Noah Taylor  - Edmund Ventura
Penelope Cruz  - Sofia Serrano
Tilda Swinton  - Rebecca Dearborn
Timothy Spall  - Thomas Tipp
Tom Cruise  - David Aames
UNIX> 


kurtproc6.cpp - Printing the map of actors at the end

Just to make sure that we have our actors right, kurtproc6.cpp comments out the printing of the movies, and instead prints out each actor, plus the number of movies in which the actor appears. Here's that new code:

  /* Print out the actors */

  for (ait = actors.begin(); ait != actors.end(); ait++) {
    a = ait->second;
    cout << "Actor: " << a->name << ". # Movies: " << a->movies.size() << ".\n";
  }

Again, we can check some of the output to make sure that it is ok:

UNIX> kurtproc6 *.txt | head
Actor: A.J. Langer. # Movies: 1.
Actor: Adam Tomei. # Movies: 1.
Actor: Adrienne Barbeau. # Movies: 1.
Actor: Al Cerullo. # Movies: 1.
Actor: Al Leong. # Movies: 1.
Actor: Al Lewis. # Movies: 1.
Actor: Alan Davidson. # Movies: 1.
Actor: Alan Toy. # Movies: 1.
Actor: Alana Stewart. # Movies: 1.
Actor: Alexis Cruz. # Movies: 1.
UNIX> kurtproc6 *.txt | grep 'Kurt Russell'
Actor: Kurt Russell. # Movies: 32.
UNIX> kurtproc6 *.txt | wc
     911    5573   31877
UNIX> kurtproc6 *.txt | awk '{ n += $NF; print n}' | tail -n 1
977
UNIX>
UNIX> cat *.txt | sed -n 's/  *- .*//p' | sort | head
A.J. Langer
Adam Tomei
Adrienne Barbeau
Al Cerullo
Al Leong
Al Lewis
Alan Davidson
Alan Toy
Alana Stewart
Alexis Cruz
UNIX> grep 'Kurt Russell' *.txt | wc 
      32     160    1501
UNIX> cat *.txt | sed -n 's/  *- .*//p' | sort -u | wc
     911    1929   12745
UNIX> cat *.txt | grep ' - ' | wc
     977    4919   29219
UNIX>


kurtproc7.cpp - Adding the year instances

The final program: kurtproc7.cpp adds instances of the Year class, creating a new one whenever we read a movie from a new year. The code is familiar by this point and we make the actors point to the years, and have the years maintain maps to movies and actors. Here is the final code:

#include <iostream>
#include <map>
#include <fstream>
#include <string>
#include <cstdlib>
using namespace std;

class Actor {
  public:
    string name;
    map <string, class Movie *> movies;
    map <int, class Year *> years;
};

class Movie {
  public:
    string name;
    int year;
    map <string, Actor *> actors;
};

class Year {
  public:
    int year;
    map <string, Movie *> movies;
    map <string, Actor *> actors;
};

main(int argc, char **argv)
{
  ifstream fin;
  int i, j;
  string s;
  int year;
  Movie *m;
  map <string, Movie *> movies;
  map <string, Movie *>::iterator mit;
  Actor *a;
  map <string, Actor *> actors;
  map <string, Actor *>::iterator ait;
  Year *y;
  map <int, Year *> years;
  map <int, Year *>::iterator yit;

  for (i = 1; i < argc; i++) {

    /* Open the movie file */
    fin.open(argv[i]);
    if (fin.fail()) {
      cerr << "Problem opening " << argv[i] << endl;
      exit(1);
    }

    /* Construct the movie's name from the file name */

    s = argv[i];
    j = s.find(".txt");
    if (j == string::npos) {
      cerr << "File does not have a .txt extension: " << s << endl;
      exit(1);
    }
    s.resize(j);
    for (j = 0; j < s.length(); j++) {
      if (s[j] == '-') s[j] = ' ';
    }

    /* Create the movie instance, read the year and insert the movie */

    m = new Movie;
    m->name = s;

    fin >> year;
    if (fin.fail()) {
      cerr << "The first line of " << s << " should be the year\n";
      exit(1);
    }
    getline(fin, s);

    m->year = year;
    movies.insert(make_pair(m->name, m));

    /* Find/Create the year and add the movie to the year map */

    yit = years.find(year);
    if (yit == years.end()) {
      y = new Year;
      y->year = year;
      years.insert(make_pair(year, y));
    } else {
      y = yit->second;
    }
    y->movies.insert(make_pair(m->name, m));

    /* Read the actors */

    while (!fin.fail()) {
      getline(fin, s);
      if (!fin.fail()) {
        j = s.find(" - ");
        if (j == string::npos) {
          cerr << "Actor specifications should be actor name '-' role name\n";
          cerr << "S: " << s << endl;
          exit(1);
        }
        while (s[j] == ' ') j--;
        s.resize(j+1);

        /* Check the actors map to see if the actor exists already.  
           If not, create a new actor and put it into the map. */
        
        ait = actors.find(s);
        if (ait == actors.end()) {
          a = new Actor;
          a->name = s;
          actors.insert(make_pair(a->name, a));
        } else {
          a = ait->second;
        }
  
        /* Put movies & actors into each others' maps. */
  
        m->actors.insert(make_pair(a->name, a));
        a->movies.insert(make_pair(m->name, m)); 

        /* Put years & actors into each others' maps. 
           Duplicates will be ignored, but that's ok */
  
        y->actors.insert(make_pair(a->name, a));
        a->years.insert(make_pair(year, y));   

      }
    }

    fin.close();
    fin.clear();
  }

  /* Print out the movies */

/*  for (mit = movies.begin(); mit != movies.end(); mit++) {
    m = mit->second;
    cout << "Movie: " << m->name << ". Year: "<< m->year << ".\n";
    for (ait = m->actors.begin(); ait != m->actors.end(); ait++) {
      a = ait->second;
      cout << "  Actor: " << a->name << endl;
    }
    cout << endl;
  }
 */

  /* Print out the actors */

  for (ait = actors.begin(); ait != actors.end(); ait++) {
    a = ait->second;
    cout << "Actor: " << a->name << ". # Movies: " << a->movies.size() << ". Years:";
    for (yit = a->years.begin(); yit != a->years.end(); yit++) {
      cout << " " << yit->first;
    }
    cout << ".\n";
  }

  /* Print out years: */

  for (yit = years.begin(); yit != years.end(); yit++) {
    y = yit->second;
    cout << "Year: " << y->year << ".  Actors: " << y->actors.size() << ".\n";
    for (mit = y->movies.begin(); mit != y->movies.end(); mit++) {
      m = mit->second;
      cout << "  Movie: " << m->name << ".\n";
    }
  }
}

We can sanity check this a little:

UNIX> kurtproc7 *.txt | head
Actor: A.J. Langer. # Movies: 1. Years: 1996.
Actor: Adam Tomei. # Movies: 1. Years: 2005.
Actor: Adrienne Barbeau. # Movies: 1. Years: 1981.
Actor: Al Cerullo. # Movies: 1. Years: 1981.
Actor: Al Leong. # Movies: 1. Years: 1986.
Actor: Al Lewis. # Movies: 1. Years: 1980.
Actor: Alan Davidson. # Movies: 1. Years: 2003.
Actor: Alan Toy. # Movies: 1. Years: 1984.
Actor: Alana Stewart. # Movies: 1. Years: 1984.
Actor: Alexis Cruz. # Movies: 1. Years: 1994.
UNIX> kurtproc7 *.txt | tail
Year: 2004.  Actors: 45.
  Movie: Jiminy Glick in La La Wood.
  Movie: Miracle.
Year: 2005.  Actors: 47.
  Movie: Dreamer.
  Movie: Sky High.
Year: 2006.  Actors: 29.
  Movie: Poseidon.
Year: 2007.  Actors: 31.
  Movie: Death Proof.
UNIX> grep 2004 *.txt
Jiminy-Glick-in-La-La-Wood.txt:2004
Miracle.txt:2004
UNIX> grep 2005 *.txt
Dreamer.txt:2005
Sky-High.txt:2005
UNIX> kurtproc7 *.txt | grep 'Kurt Russell'
Actor: Kurt Russell. # Movies: 32. Years: 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1991 1992 1993 1994 1996 1997 1998 2001 2002 2003 2004 2005 2006 2007.
UNIX> kurtproc7 *.txt | grep 'Kurt Russell' | sed 's/.*Years://'
 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1991 1992 1993 1994 1996 1997 1998 2001 2002 2003 2004 2005 2006 2007.
UNIX> kurtproc7 *.txt | grep 'Kurt Russell' | sed 's/.*Years://' | awk '{ print NF }'
24
UNIX> cat *.txt | grep '^....$' | sort -u | wc
      24      24     120
UNIX>