CS302 Lecture Notes - STL Sets and Maps Example


Sets and Maps are two very powerful parts of the STL. They let you do sorting and searching in log time. They are typically implemented with trees, but you don't see the underlying implementation.

Sets

A set is an ordered collection of data, such as ints or strings. You may insert elements into the set, and then you may find them, or traverse the set in order. As a simple example, the program simple_set.cpp uses a set to sort the lines of standard input:

#include <set>
#include <iostream>
using namespace std;

main()
{
  string s;
  set <string> names;
  set <string>::iterator nit;

  while(!cin.fail()) {
    getline(cin, s);
    if (!cin.fail()) names.insert(s);
  }

  for (nit = names.begin(); nit != names.end(); nit++) {
    cout << *nit << endl;
  }
} 

Instead of using push_back(), like you do with lists or vectors, you use insert(), which puts the string into the right place. The traversal is exactly like traversing a list.

UNIX> cat input-1.txt
Tim
David
Adrian
Hamza
UNIX> simple_set < input-1.txt
Adrian
David
Hamza
Tim
UNIX> 
The first question you should have is: "What about duplicate entries?" For example, let's try input-2.txt:
UNIX> cat input-2.txt
Tim
David
Adrian
Hamza
Tim
UNIX> simple_set < input-2.txt
Adrian
David
Hamza
Tim
UNIX> 
As you can see, it does not insert duplicates. If you want to allow duplicates, you use a multiset, as in simple_multiset.cpp:

#include <set>
#include <iostream>
using namespace std;

main()
{
  string s;
  multiset <string> names;
  multiset <string>::iterator nit;

  while(!cin.fail()) {
    getline(cin, s);
    if (!cin.fail()) names.insert(s);
  }

  for (nit = names.begin(); nit != names.end(); nit++) {
    cout << *nit << endl;
  }
} 

UNIX> simple_multiset < input-2.txt
Adrian
David
Hamza
Tim
Tim
UNIX> 

Maps

Although sets are nice, they are a little limited. Often we want to store key-value pairs, where we can search on the key and have data associated with a value. For that, we use a map. When you declare a map, you specify the type of the key and the value. For example, the following declaration is for a map whose keys are strings and whose values are integers. I also include the declaration for the map iterator as well.

map <string, int> names;
map <string, int>iterator nit;

We'll write a simple example. This example assumes that input is as in Roster.txt: it is composed of first and last names of people. (Our example is all the NFL players in 2009 whose last names begin with "A", in random order). We'll use a map as declared above, and what we are going to do is keep track of the last names, and how many players have each last name. The program for this is in simple_map.cpp

#include <stdio.h>
#include <iostream>
#include <string>
#include <map>
using namespace std;

main()
{
  map <string, int> names;
  map <string, int>::iterator nit;
  string fn, ln;
  
  while (!cin.eof()) {
    cin >> fn >> ln;
    if (!cin.fail()) {
      nit = names.find(ln);
      if (nit == names.end()) {
        names.insert(make_pair(ln, 1));
      } else {
        nit->second++;
      }
    }
  }

  for (nit = names.begin(); nit != names.end(); nit++) {
    cout << "Last name: " << nit->first << ". Number of players: " << nit->second << endl;
  }
}

When you insert into a map, since you are inserting two things (a key and value), you must combine them into a pair with the make_pair() procedure. The types of the arguments must match the types specified in the declaration -- in this case, they must be a string and an integer.

The iterator for a map is different, too. Instead of simply specifying it with pointer indirection, you can grab they key from an iterator with "->first" and the value with "->second". Yes, I wish they were called key and val, but that is life. When we run it on Roster.txt, we get:

UNIX> simple_map < Roster.txt
Last name: Abdallah. Number of players: 1
Last name: Abdullah. Number of players: 2
Last name: Abiamiri. Number of players: 1
Last name: Abraham. Number of players: 1
Last name: Adams. Number of players: 7
.....
We can check for correctness with grep:
UNIX> grep Abdallah Roster.txt
Nader Abdallah
UNIX> grep Adams Roster.txt
Gaines Adams
Jamar Adams
Anthony Adams
Michael Adams
Titus Adams
Flozell Adams
Mike Adams
UNIX> grep Adams Roster.txt | wc
       7      14      90
UNIX> 
Like sets, you traverse the maps in ascending order, and you can't insert duplicate keys. Since simple_map.cpp performs the find() and only performs the insert() when the key is not found, the limitation on duplicate keys is not a problem. If you need duplicate keys, use a multimap.


Maps and Sets together

The next program is a more detailed example. This program reads an input file like Roster.txt and prints out the players sorted by last name. When two players have the same last name, they are sorted by first name. The program is in sort_names_1.cpp:

#include <stdio.h>
#include <iostream>
#include <string>
#include <set>
#include <map>
using namespace std;

typedef set <string> fnset;

main()
{
  map <string, fnset *> lnames;
  map <string, fnset *>::iterator lnit;
  fnset *fnames;
  fnset::iterator fnit;
  int i;
  string fn, ln, name;
  
  while (!cin.eof()) {
    cin >> fn;
    if (!cin.fail()) {
      cin >> ln;
      lnit = lnames.find(ln);
      if (lnit == lnames.end()) {
        fnames = new fnset;
        lnames.insert(make_pair(ln, fnames));
      } else {
        fnames = lnit->second;
      }
      fnames->insert(fn);
    }
  }

  for (lnit = lnames.begin(); lnit != lnames.end(); lnit++) {
    fnames = lnit->second;
    for (fnit = fnames->begin(); fnit != fnames->end(); fnit++) {
      cout << *fnit << " " << lnit->first << endl;
    }
  }
}

The program uses a map to sort the last names. The "second" field of the map is a pointer to a set, which sorts the first names that belong to that last name. When you read in a name, you check the last name to see if it's in the map. If so, then it sets fnames to be the set of first names with that last name. If not, it creates a new fnames set and inserts it and the last name into the map. Last, it inserts the first name into the set.

When it's done reading input, it does a nested traversal to print out all of the names.

Note the typedef statement to make the program read more easily.

This program will not print out duplicate names, because sets don't hold duplicate entries. If you wanted it to print out duplicate names, you would have to use a multiset.


UNIX> sort_names_1 < Roster.txt | head
Nader Abdallah
Hamza Abdullah
Husain Abdullah
Victor Abiamiri
John Abraham
Anthony Adams
Flozell Adams
Gaines Adams
Jamar Adams
Michael Adams
UNIX> 

You should use pointers as the val part of a map

The program above uses a pointer to a fnset rather than simply using a fnset. You may wonder, "Why not just use a fnset, so I don't have to mess with pointers?" The reason is that C++'s habit of making copies of things makes this an inefficient and often bug-prone ordeal. First, take a look at sort_names_bad.cpp. This is a mapping of sort_names_1.cpp that doesn't use the pointer.

#include <stdio.h>
#include <iostream>
#include <string>
#include <set>
#include <map>
using namespace std;

typedef set <string> fnset;

main()
{
  map <string, fnset> lnames;
  map <string, fnset>::iterator lnit;
  fnset fnames;
  fnset::iterator fnit;
  int i;
  string fn, ln, name;
  
  while (!cin.eof()) {
    cin >> fn;
    if (!cin.fail()) {
      cin >> ln;
      lnit = lnames.find(ln);
      if (lnit == lnames.end()) {
        lnames.insert(make_pair(ln, fnames));
      } else {
        fnames = lnit->second;
      }
      fnames.insert(fn);
    }
  }

  for (lnit = lnames.begin(); lnit != lnames.end(); lnit++) {
    fnames = lnit->second;
    for (fnit = fnames.begin(); fnit != fnames.end(); fnit++) {
      cout << *fnit << " " << lnit->first << endl;
    }
  }
}

This program is very buggy. Take a simple example:

UNIX> head -n 2 Roster.txt
Adam Anderson
Andy Alleman
UNIX> head -n 2 Roster.txt | sort_names_1
Andy Alleman
Adam Anderson
UNIX> head -n 2 Roster.txt | sort_names_bad
Adam Alleman
UNIX> 
Yuck. What's going on? Well, two things. Let's concentrate on the most egregious. This is the fact that you reuse fnames to insert a name into the set, and then you use that same fnames when you insert a last name into the map. That's wrong. Let's fix that by having two fnset's: fnames, which we'll use to insert first names, and fnames_empty, which we use to put an empty set into a newly created last name map: sort_names_bad2.cpp

#include <stdio.h>
#include <iostream>
#include <string>
#include <set>
#include <map>
using namespace std;

typedef set <string> fnset;

main()
{
  map <string, fnset> lnames;
  map <string, fnset>::iterator lnit;
  fnset fnames, fnames_empty;
  fnset::iterator fnit;
  int i;
  string fn, ln, name;
  
  while (!cin.eof()) {
    cin >> fn;
    if (!cin.fail()) {
      cin >> ln;
      lnit = lnames.find(ln);
      if (lnit == lnames.end()) {
        lnames.insert(make_pair(ln, fnames_empty));
        lnit = lnames.find(ln);
      }
      fnames = lnit->second;
      fnames.insert(fn);
    }
  }

  for (lnit = lnames.begin(); lnit != lnames.end(); lnit++) {
    fnames = lnit->second;
    for (fnit = fnames.begin(); fnit != fnames.end(); fnit++) {
      cout << *fnit << " " << lnit->first << endl;
    }
  }
}

This one still doesn't work:

UNIX> head -n 2 Roster.txt | sort_names_bad2
UNIX> 
Why? The culprit lies in these two lines:

      fnames = lnit->second;
      fnames.insert(fn);

The first of these lines makes a copy of lnit->second; You insert the first name into the copy, which does not modifiy the fnset that is actually in lnit->second. To fix this, you need to insert directly into lnit->second. I do this in sort_names_bad3.cpp:

#include <stdio.h>
#include <iostream>
#include <string>
#include <set>
#include <map>
using namespace std;

typedef set <string> fnset;

main()
{
  map <string, fnset> lnames;
  map <string, fnset>::iterator lnit;
  fnset fnames, fnames_empty;
  fnset::iterator fnit;
  int i;
  string fn, ln, name;
  
  while (!cin.eof()) {
    cin >> fn;
    if (!cin.fail()) {
      cin >> ln;
      lnit = lnames.find(ln);
      if (lnit == lnames.end()) {
        lnames.insert(make_pair(ln, fnames_empty));
        lnit = lnames.find(ln);
      }
      lnit->second.insert(fn);
    }
  }

  for (lnit = lnames.begin(); lnit != lnames.end(); lnit++) {
    fnames = lnit->second;
    for (fnit = fnames.begin(); fnit != fnames.end(); fnit++) {
      cout << *fnit << " " << lnit->first << endl;
    }
  }
}

This works as it should:

UNIX> sort_names_1 < Roster.txt > out1.txt
UNIX> sort_names_bad3 < Roster.txt > out2.txt
UNIX> diff out1.txt out2.txt
UNIX> 
So, now you say, "Ok, it works. Why can't I do this?" The answer is twofold. First, the fact that you can't have a variable point to lnit->second is not only inconvenient, it makes your programs very hard to read. Second, you'll find yourself setting variables to lnit->second and making copies when you don't have to. For example, look at the for loop that prints out the names:

  for (lnit = lnames.begin(); lnit != lnames.end(); lnit++) {
    fnames = lnit->second;
    for (fnit = fnames.begin(); fnit != fnames.end(); fnit++) {
      cout << *fnit << " " << lnit->first << endl;
    }
  }

It is making copies of lnit->second. Even though it's not a bug, it's extremely inefficient in terms of both time and memory. Get into the habit of using pointers in the second field of your maps.


The return value of insert()

In class we looked at the prototype for the insert() method of a set (not a multiset):

    pair<iterator, bool> set::insert(const TYPE& val);

The "(const TYPE& val)" simply means that it works with type that you specify when you define the set.

The return value is a pair much like what you pass to the insert() call of a map. Its first field will be an iterator for the set, and the second will be a boolean. If the element is inserted, then the iterator will point to the newly inserted element. Otherwise, you tried to insert a duplicate, and the iterator is to the value already in the set. The second field reports whether the item was inserted or not.

To see usage, take a look at setreturn.cpp:

#include <set>
#include <iostream>
using namespace std;

typedef set <string> string_set;

main()
{
  string s;
  string_set names;
  string_set::iterator nit;
  pair <string_set::iterator, bool> retval;
  

  while(!cin.fail()) {
    getline(cin, s);
    if (!cin.fail()) {
      retval = names.insert(s);
      if (retval.second) {
        cout << s << ": Successfully inserted.\n";
      } else {
        cout << s << ": Duplicate not inserted.\n";
      }
    }
  }
} 

Note how it returns a pair, whose fields you access with dots rather than arrows. Why then do you use arrows in iterators on maps? Because those iterators point to pairs -- they are not pairs themselves.

UNIX> cat input-2.txt 
Tim
David
Adrian
Hamza
Tim
UNIX> setreturn < input-2.txt
Tim: Successfully inserted.
David: Successfully inserted.
Adrian: Successfully inserted.
Hamza: Successfully inserted.
Tim: Duplicate not inserted.
UNIX>