CS140 Lecture Notes - STL Sets and Maps


Sets and Maps are two very powerful parts of the STL. They let you do sorting and searching in log time, which gives you optimal performance with two extremely important functionalities (searching and sorting).

Sets

A set is an ordered collection of data, such as ints or strings. You may insert elements into the set, and then you may find them, or traverse the set in order. You do insertion just like calling push_back() or push_front() on a list. The difference is that the item goes into its proper place in the set, rather than on the back or front of a list.

When you traverse a set, you use an iterator, just as you do with lists. Thus, the simple program simple_set.cpp employs a set to sort the lines of standard input:

#include <set>
#include <iostream>
using namespace std;

main()
{
  string s;
  set <string> names;
  set <string>::iterator nit;

  while(getline(cin, s)) names.insert(s);

  for (nit = names.begin(); nit != names.end(); nit++) {
    cout << *nit << endl;
  }
} 

To repeat, instead of using push_back(), like you do with lists or vectors, you use insert(), which puts the string into the right place. The traversal is exactly like traversing a list.

UNIX> cat input-1.txt
Jack Journey
Mackenzie Olympia
James Splotch
Dylan Ache
UNIX> simple_set < input-1.txt
Dylan Ache
Jack Journey
James Splotch
Mackenzie Olympia
UNIX> 
The first question you should have is: "What about duplicate entries?" For example, let's try input-2.txt, which has two duplicate entries:
UNIX> cat input-2.txt
John Bevy
Xavier Ornately
Nicholas Wyatt Fecund
Max Inadvertent III
John Bevy
Max Inadvertent III
UNIX> simple_set < input-2.txt
John Bevy
Max Inadvertent III
Nicholas Wyatt Fecund
Xavier Ornately
UNIX> 
As you can see, it does not insert duplicates. If you want to allow duplicates, you use a multiset, as in simple_multiset.cpp. The only difference with this program is the declaration of names and nit:

  multiset <string> names;
  multiset <string>::iterator nit;

Everything else is the same, and the duplicate entries each get their own entry in the multiset:

UNIX> simple_multiset < input-2.txt
John Bevy
John Bevy
Max Inadvertent III
Max Inadvertent III
Nicholas Wyatt Fecund
Xavier Ornately
UNIX> 
We can use the find() method of a set or multiset to see if an element is in the set or multiset. This is done in log time, which means very fast -- much faster than traversing all elements of the set to find it. Find() returns an iterator to the element in the set if it is found. If it is not in the set, it returns an iterator that equals the end() method.

Here's an example in simple_set_find.cpp:

#include <set>
#include <fstream>
#include <iostream>
#include <cstdlib>
using namespace std;

main(int argc, char **argv)
{
  string s;
  ifstream f;
  set <string> names;

  if (argc != 2) { cerr << "usage: simple_set_find file\n"; exit(1); }

  f.open(argv[1]);
  if (f.fail()) { perror(argv[1]); exit(1); }

  while(getline(f, s)) names.insert(s);
  f.close();

  while (1) {
    cout << "Enter a name: ";
    cout.flush();                    // Don't worry about this too much -- I do this
                                     // to make sure that the string is printed to the screen.
                                     // Sometimes, partial lines aren't printed immediately,
                                     // and cout.flush() forces the partial line to be printed.
    if (!getline(cin, s)) exit(0);
    if (names.find(s) == names.end()) {
      cout << s << " is not in the set.\n";
    } else {
      cout << s << " is in the set.\n";
    }
  }
} 

The program reads a file and puts each line into a set. It then reads lines from standard input and prints whether the line is in the set. For example:

UNIX> cat input-3.txt
Madelyn Psychotic
Joseph Halverson
Aidan Pooh
Bailey Cycad
Wyatt Advantageous
UNIX> simple_set_find input-3.txt
Enter a name: Aidan Pooh
Aidan Pooh is in the set.
Enter a name: Jim Plank
Jim Plank is not in the set.
Enter a name: 
<CNTL-D>
UNIX> 

Maps

Although sets are nice, they are a little limited. Often we want to store key-value pairs, where we can search on the key and have data associated with a value. For that, we use a map. When you declare a map, you specify the type of the key and the value. For example, the following declaration is for a map whose keys are strings and whose values are integers. I also include the declaration for the map iterator as well.

map <string, int> names;
map <string, int>::iterator nit;

We'll write a simple example. This example assumes that input is as in Roster.txt: it is composed of first and last names of people. (Our example is all the NFL players in 2009 whose last names begin with "A", in random order). We'll use a map as declared above, and what we are going to do is keep track of the last names, and how many players have each last name. The program for this is in simple_map.cpp:

#include <stdio.h>
#include <iostream>
#include <string>
#include <map>
using namespace std;

main()
{
  map <string, int> names;
  map <string, int>::iterator nit;
  string fn, ln;
  
  while (cin >> fn >> ln) {
    nit = names.find(ln);
    if (nit == names.end()) {
      names.insert(make_pair(ln, 1));
    } else {
      nit->second++;
    }
  }

  for (nit = names.begin(); nit != names.end(); nit++) {
    cout << "Last name: " << nit->first << ". Number of players: " << nit->second << endl;
  }
}

When you insert into a map, since you are inserting two things (a key and value), you must combine them into a pair with the make_pair() procedure. The types of the arguments must match the types specified in the declaration -- in this case, they must be a string and an integer.

The iterator for a map is different, too. Instead of simply specifying it with pointer indirection, you can grab the key from an iterator with "->first" and the value with "->second". Yes, I wish they were called key and val, but that is life. When we run it on Roster.txt, we get:

UNIX> simple_map < Roster.txt
Last name: Abdallah. Number of players: 1
Last name: Abdullah. Number of players: 2
Last name: Abiamiri. Number of players: 1
Last name: Abraham. Number of players: 1
Last name: Adams. Number of players: 7
.....
We can check for correctness with grep:
UNIX> grep Abdallah Roster.txt
Nader Abdallah
UNIX> grep Adams Roster.txt
Gaines Adams
Jamar Adams
Anthony Adams
Michael Adams
Titus Adams
Flozell Adams
Mike Adams
UNIX> grep Adams Roster.txt | wc
       7      14      90
UNIX> 
Like sets, you traverse the maps in ascending order, and you can't insert duplicate keys. Since simple_map.cpp calls find() and only performs insert() when the key is not found, the limitation on duplicate keys is not a problem. If you need duplicate keys, use a multimap.


Writing that last program with a multiset

As observed in class, we could have written that last program with a multiset or even a vector. Let's consider the multiset. Suppose we insert all the last names into the multiset. We then traverse the multiset, maintaing a string pn that holds the string in the previous element of the multiset, plus a count of the number of times that we have seen that string. If the current string equals the previous string, then we simply increment the count. Otherwise, we print the previous string and its count, and then reset the count. At the end of the traversal, we print out the last element. The code is in nnames_multiset.cpp:

#include <cstdio>
#include <iostream>
#include <string>
#include <set>
using namespace std;

main()
{
  multiset <string> names;
  multiset <string>::iterator nit;
  string fn, ln, pn;
  int count;
  
  while (cin >> fn >> ln) names.insert(ln);

  
  for (nit = names.begin(); nit != names.end(); nit++) {
    if (nit != names.begin()) {
      if (*nit == pn) {
        count++;
      } else { 
        printf("%-20s %d\n", pn.c_str(), count);
        pn = *nit;
        count = 1;
      }
    } else {
      pn = *nit;
      count = 1;
    }
  }

  if (names.size() > 0) printf("%-20s %d\n", pn.c_str(), count);
}

Compared to the map, that's a pretty convoluted piece of code. However, make sure that you can step through it and convince yourself that it works.

UNIX> head -n 10 Roster.txt
Russell Allen
Gaines Adams
Aundrae Allison
David Anderson
Adrian Arrington
Hamza Abdullah
Tim Anderson
Devin Aromashodu
Asher Allen
Eric Alexander
UNIX> head -n 10 Roster.txt | nnames_multiset
Abdullah             1
Adams                1
Alexander            1
Allen                2
Allison              1
Anderson             2
Aromashodu           1
Arrington            1
UNIX> 

Maps and Sets together

The next program is a more detailed example. This program reads an input file like Roster.txt and prints out the players sorted by last name. When two players have the same last name, they are sorted by first name. The program is in sort_names_1.cpp:

#include <stdio.h>
#include <iostream>
#include <string>
#include <set>
#include <map>
using namespace std;

typedef set <string> fnset;

main()
{
  map <string, fnset *> lnames;
  map <string, fnset *>::iterator lnit;
  fnset *fnames;
  fnset::iterator fnit;
  string fn, ln;
  
  while (cin >> fn >> ln) {
    lnit = lnames.find(ln);
    if (lnit == lnames.end()) {
      fnames = new fnset;
      lnames.insert(make_pair(ln, fnames));
    } else {
      fnames = lnit->second;
    }
    fnames->insert(fn);
  }

  for (lnit = lnames.begin(); lnit != lnames.end(); lnit++) {
    fnames = lnit->second;
    for (fnit = fnames->begin(); fnit != fnames->end(); fnit++) {
      cout << *fnit << " " << lnit->first << endl;
    }
  }
}

The program uses a map to sort the last names. The "second" field of the map is a pointer to a set, which sorts the first names that belong to that last name. When you read in a name, you check the last name to see if it's in the map. If so, then it sets fnames to be the set of first names with that last name. If not, it creates a new fnames set and inserts it and the last name into the map. Last, it inserts the first name into the set.

When it's done reading input, it does a nested traversal to print out all of the names.

Note the typedef statement to make the program read more easily.

This program will not print out duplicate names, because sets don't hold duplicate entries. If you wanted it to print out duplicate names, you would have to use a multiset.

UNIX> sort_names_1 < Roster.txt | head
Nader Abdallah
Hamza Abdullah
Husain Abdullah
Victor Abiamiri
John Abraham
Anthony Adams
Flozell Adams
Gaines Adams
Jamar Adams
Michael Adams
UNIX> 

You should use pointers as the val part of a map

The program above uses a pointer to a fnset rather than simply using a fnset. You may wonder, "Why not just use a fnset, so I don't have to mess with pointers?" The reason is that C++'s habit of making copies of things makes this an inefficient and often bug-prone ordeal. First, take a look at sort_names_bad.cpp. This is a mapping of sort_names_1.cpp that doesn't use the pointer.

#include <stdio.h>
#include <iostream>
#include <string>
#include <set>
#include <map>
using namespace std;

typedef set <string> fnset;

main()
{
  map <string, fnset> lnames;
  map <string, fnset>::iterator lnit;
  fnset fnames;
  fnset::iterator fnit;
  string fn, ln;
  
  while (cin >> fn >> ln) {
    lnit = lnames.find(ln);
    if (lnit == lnames.end()) {
      lnames.insert(make_pair(ln, fnames));
    } else {
      fnames = lnit->second;
    }
    fnames.insert(fn);
  }

  for (lnit = lnames.begin(); lnit != lnames.end(); lnit++) {
    fnames = lnit->second;
    for (fnit = fnames.begin(); fnit != fnames.end(); fnit++) {
      cout << *fnit << " " << lnit->first << endl;
    }
  }
}

This program is very buggy. Take a simple example:

UNIX> head -n 2 Roster.txt
Adam Anderson
Andy Alleman
UNIX> head -n 2 Roster.txt | sort_names_1
Andy Alleman
Adam Anderson
UNIX> head -n 2 Roster.txt | sort_names_bad
Adam Alleman
UNIX> 
Yuck. What's going on? Well, two things. Let's concentrate on the most egregious. This is the fact that you reuse fnames to insert a name into the set, and then you use that same fnames when you insert a last name into the map. That's wrong. We don't need to use a variable to create a new set when you insert a last name into the map. Instead we can simply call the constructor for the set using fnset(): (sort_names_bad2.cpp)

#include <stdio.h>
#include <iostream>
#include <string>
#include <set>
#include <map>
using namespace std;

typedef set <string> fnset;

main()
{
  map <string, fnset> lnames;
  map <string, fnset>::iterator lnit;
  fnset fnames;
  fnset::iterator fnit;
  string fn, ln;
  
  while (cin >> fn >> ln) {
    lnit = lnames.find(ln);
    if (lnit == lnames.end()) {
      lnames.insert(make_pair(ln, fnset()));
      lnit = lnames.find(ln);
    }
    fnames = lnit->second;
    fnames.insert(fn);
  }

  for (lnit = lnames.begin(); lnit != lnames.end(); lnit++) {
    fnames = lnit->second;
    for (fnit = fnames.begin(); fnit != fnames.end(); fnit++) {
      cout << *fnit << " " << lnit->first << endl;
    }
  }
}

This one still doesn't work:

UNIX> head -n 2 Roster.txt | sort_names_bad2
UNIX> 
Why? The culprit lies in these two lines:

      fnames = lnit->second;
      fnames.insert(fn);

The first of these lines makes a copy of lnit->second; You insert the first name into the copy, which does not modifiy the fnset that is actually in lnit->second. To fix this, you need to either 1) insert directly into lnit->second by writing:

lnit->second.insert(fn);

or you need to store a reference to the fnset stored in lnit->second by writing:

fnset &names = lnit->second;
names.insert(fn);

This is the first time that you have seen a reference variable used outside a function declaration. A reference variable is much like a pointer in that it contains the address of a variable or object (in this case it contains the address of a fnset object). However, unlike a pointer, you use the . operator to access an object's fields via a reference variable. Also, unlike a pointer variable, you cannot change the object to which the reference variable points. This means that you must initialize the reference variable when you declare it, as I have done above. Unfortunately reference variables are confusing because they act like a pointer variable, but use different syntax. Nonetheless, much C++ code is written using reference variables, and hence you should be introduced to reference variables, even though we don't want you to use them in your code for this course. Reference variables tend to be safer than pointers, because when used in conjunction with stack-allocated objects, they can be used to avoid the memory problems associated with heap-allocated objects. Unfortunately, when you use stack-allocated objects, it is often easy to inadvertently copy these objects. I have fixed the copy problem mentioned above in the following code. It still performs one copy of an fnset that should be avoided. Can you spot it? sort_names_bad3.cpp:

#include <stdio.h>
#include <iostream>
#include <string>
#include <set>
#include <map>
using namespace std;

typedef set <string> fnset;

main()
{
  map <string, fnset> lnames;
  map <string, fnset>::iterator lnit;
  fnset fnames;
  fnset::iterator fnit;
  string fn, ln;
  
  while (cin >> fn >> ln) {
    lnit = lnames.find(ln);
    if (lnit == lnames.end()) {
      lnames.insert(make_pair(ln, fnset()));
      lnit = lnames.find(ln);
    }
    fnset &names = lnit->second;
    names.insert(fn);
  }

  for (lnit = lnames.begin(); lnit != lnames.end(); lnit++) {
    fnames = lnit->second;
    for (fnit = fnames.begin(); fnit != fnames.end(); fnit++) {
      cout << *fnit << " " << lnit->first << endl;
    }
  }
}

At least this code works as it should:

UNIX> sort_names_1 < Roster.txt > out1.txt
UNIX> sort_names_bad3 < Roster.txt > out2.txt
UNIX> diff out1.txt out2.txt
UNIX> 
Have you spotted the inadvertent copy? If you haven't, look at the for loop that prints out the names:

  for (lnit = lnames.begin(); lnit != lnames.end(); lnit++) {
    fnames = lnit->second;
    for (fnit = fnames.begin(); fnit != fnames.end(); fnit++) {
      cout << *fnit << " " << lnit->first << endl;
    }
  }

It is making copies of lnit->second. Even though it's not a bug, it's extremely inefficient in terms of both time and memory. We can fix it by declaring a second reference variable named printnames:

  for (lnit = lnames.begin(); lnit != lnames.end(); lnit++) {
    fnset &printnames = lnit->second;
    for (fnit = printnames.begin(); fnit != printnames.end(); fnit++) {
      cout << *fnit << " " << lnit->first << endl;
    }
  }

So, now you say, "Ok, it works. Why can't I do this?" When you get a job it is quite possible that you will be allowed to do so. However, in this class we want you to use the pointer approach instead. The reason is threefold. First, since reference variables cannot be re-used (i.e., you cannot make them point to different objects), you typically need to litter your code with declarations for reference variables, as I had to do above. Second, you'll find yourself forgetting to declare reference variables. Instead you will set variables to lnit->second and make copies when you don't mean to. In the worst case this will lead to logic errors in your program. In the slightly less worst case, you will create inadvertent copies of an object, which will both slow down your program, since the copy must be created, and will use excessive memory. Finally, reference variables are confusing and will almost certainly lead to problems with your code that you will find difficult to resolve. Leave them for when you are a more experienced C++ programmer. For the time being, get into the habit of using pointers in the second field of your maps.


The return value of insert()

In class we looked at the prototype for the insert() method of a set (not a multiset):

    pair<iterator, bool> set::insert(const TYPE& val);

The "(const TYPE& val)" simply means that it works with type that you specify when you define the set.

The return value is a pair much like what you pass to the insert() call of a map. Its first field will be an iterator for the set, and the second will be a boolean. If the element is inserted, then the iterator will point to the newly inserted element. Otherwise, you tried to insert a duplicate, and the iterator is to the value already in the set. The second field reports whether the item was inserted or not.

To see usage, take a look at setreturn.cpp:

#include <set>
#include <iostream>
using namespace std;

typedef set <string> string_set;

main()
{
  string s;
  string_set names;
  string_set::iterator nit;
  pair <string_set::iterator, bool> retval;
  

  while(getline(cin, s)) {
    retval = names.insert(s);
    if (retval.second) {
      cout << s << ": Successfully inserted.\n";
    } else {
      cout << s << ": Duplicate not inserted.\n";
    }
  }
} 

Note how it returns a pair, whose fields you access with dots rather than arrows. Why then do you use arrows in iterators on maps? Because those iterators point to pairs -- they are not pairs themselves.

UNIX> cat input-2.txt
John Bevy
Xavier Ornately
Nicholas Wyatt Fecund
Max Inadvertent III
John Bevy
Max Inadvertent III
UNIX> setreturn < input-2.txt
John Bevy: Successfully inserted.
Xavier Ornately: Successfully inserted.
Nicholas Wyatt Fecund: Successfully inserted.
Max Inadvertent III: Successfully inserted.
John Bevy: Duplicate not inserted.
Max Inadvertent III: Duplicate not inserted.
UNIX>