CS302 Lecture Notes

CS302 Lecture Notes - Using a Map as an Associative Array

September 10, 2009
James S. Plank
Directory: /home/plank/cs302/Notes/Map_Assoc_Array

There's a nice manual on this topic at http://www.cplusplus.com/reference/stl/map/operator[]/.

The STL overloads the '[]' operator on its map data structure, which allows you to treat a map as an "Associative Array." In other words, you can use the map as an array whose indices are elements of the "first" part of the map. For example, the file 2009-Masters.txt contains the scores of the 2009 Masters golf tournament. The format is the score followed by the golfer's name.

Suppose I want to write a quick and dirty program that prints out all the players and scores in that file, sorted by first name. I can do it very easily using a map as an associative array. The program is in simpgolf.cpp:

#include <iostream>
#include <map>
using namespace std;

main()
{
  map <string, int> golfer;
  map <string, int>::iterator git;
  int score;
  string name, s;

  while (!cin.fail()) {
    cin >> score >> name;
    if (!cin.fail()) {
      getline(cin, s);
      golfer[name+s] = score;      /* Here's the associative array line */
    }
  }

  for (git = golfer.begin(); git != golfer.end(); git++) {
    printf("%-30s %5d\n", git->first.c_str(), git->second);
  }
}

The commented line treats the map as an associative array -- indexed by a string. (You'll note, I construct the golfers' names by reading the first name, and then using getline() to read in the rest of the name). When I insert an element into the array, it inserts it into the map. At the end of the program, I traverse the map using an iterator and print out the names and scores:

UNIX> head 2009-Masters.txt
276     Angel Cabrera
276     Chad Campbell
276     Kenny Perry
278     Shingo Katayama
279     Phil Mickelson
280     John Merrick
280     Steve Flesch
280     Tiger Woods
280     Steve Stricker
281     Hunter Mahan
UNIX> sed 's/\(...\)\(.*\)/\2      \1/' 2009-Masters.txt | sort | head -n 10
        Aaron Baddeley      284
        Adam Scott      299
        Alvaro Quiros      306
        Andres Romero      297
        Angel Cabrera      276
        Anthony Kim      286
        Ben Crenshaw      309
        Ben Curtis      288
        Bernhard Langer      303
        Billy Mayfair      305
UNIX> simpgolf < 2009-Masters.txt | head -n 10
Aaron Baddeley                   284
Adam Scott                       299
Alvaro Quiros                    306
Andres Romero                    297
Angel Cabrera                    276
Anthony Kim                      286
Ben Crenshaw                     309
Ben Curtis                       288
Bernhard Langer                  303
Billy Mayfair                    305
UNIX>

That seems quite convenient, but you should ask yourself what happens if you try to access a map in this way, and the key isn't there? The answer is that it inserts the key into the map with a blank value. To illustrate, take a look at tiger_and_jack.cpp

#include <iostream>
#include <map>
using namespace std;

main()
{
  map <string, int> golfer;
  map <string, int>::iterator git;
  int score;
  string name, s;

  while (!cin.fail()) {
    cin >> score >> name;
    if (!cin.fail()) {
      getline(cin, s);
      golfer[name+s] = score;
    }
  }

  cout << "Number of golfers: " << golfer.size() << endl;
  cout << "Tiger Woods' Score: " << golfer["Tiger Woods"] << endl;
  cout << "Jack Nicklaus' Score: " << golfer["Jack Nicklaus"] << endl;
  cout << "Number of golfers: " << golfer.size() << endl;
}

This program reads in the golfers just like simpgolf.cpp. After reading in the golfers, it prints the size of the map, then it prints Tiger Woods' score and Jack Nicklaus' score. Finally it prints the size of the map again. Below we run it on 2009-Masters.txt:

UNIX> tiger_and_jack < 2009-Masters.txt
Number of golfers: 96
Tiger Woods' Score: 280
Jack Nicklaus' Score: 0
Number of golfers: 97
UNIX>

Since Jack Nicklaus didn't play in the 2009 Masters, he has no score. When we try to look him up in the map, it creates a new entry for him with a default score of zero. For this reason, looking him up in the map has increased the size of the map by one.

Is that what you want? It doesn't really matter -- that's what happens, and you should be aware of it.

Temptation, Temptation

Now that we know about this feature, we may be tempted to use it often. For example, suppose I want to see how the professional golfers did in all four major tournaments of 2009. I have the info for each of these tournaments in 2009-Masters.txt, 2009-British_Open.txt, 2009-US-Open.txt, and 2009-PGA-Championship.txt.

In allmajors1.cpp, I modified simpgolf.cpp so that the val field of the map is a vector of scores. Then when I traverse the map, I print out the average scores of the golfers who played in all four tournaments:

#include <iostream>
#include <map>
#include <vector>
using namespace std;

typedef vector <int> ivector;
     
main()
{
  map <string, ivector> golfer;
  map <string, ivector>::iterator git;
  int score;
  string name, s;
  double total;
  int i;

  while (!cin.fail()) {
    cin >> score >> name;
    if (!cin.fail()) {
      getline(cin, s);
      golfer[name+s].push_back(score);
    }
  }

  for (git = golfer.begin(); git != golfer.end(); git++) {
    if (git->second.size() == 4) {
      total = 0;
      for (i = 0; i < git->second.size(); i++) total += git->second[i];
      printf("%-30s %10.2lf\n", git->first.c_str(), total/4.0);
    }
  }
}

When I run it, all looks good:

UNIX> cat 2009* | allmajors1 | head
Adam Scott                         302.50
Alvaro Quiros                      302.00
Andres Romero                      294.50
Angel Cabrera                      287.75
Anthony Kim                        292.50
Ben Curtis                         295.00
Boo Weekley                        296.25
Brandt Snedeker                    305.25
Briny Baird                        304.75
Bubba Watson                       297.00
UNIX>

However, I really want to sort by the best (lowest) scores. To do that, I can use a second map keyed on doubles. This is in: allmajors2.cpp. I add the second map an iterator in the variable declarations:

  map <double, string> averages;
  map <double, string>::iterator ait;

And insert by treating the map as an associative array, before printing it out:

  for (git = golfer.begin(); git != golfer.end(); git++) {
    if (git->second.size() == 4) {
      total = 0;
      for (i = 0; i < git->second.size(); i++) total += git->second[i];
      averages[total/4.0] = git->first;
    }
  }
  for (ait = averages.begin(); ait != averages.end(); ait++) {
    printf("%8.2lf %s\n", ait->first, ait->second.c_str());
  }
}

When I run it, all appears well:

UNIX> cat *.txt | allmajors2 | head
  284.50 Ross Fisher
  284.75 Henrik Stenson
  285.00 Lee Westwood
  285.25 Rory McIlroy
  286.50 Camilo Villegas
  287.00 Vijay Singh
  287.25 Kenny Perry
  287.75 Jim Furyk
  288.75 Soren Hansen
  289.25 Retief Goosen
UNIX>

But where is Tiger Woods? Certainly he would be in the top ten:

UNIX> cat *.txt | allmajors1 | grep Tiger
Tiger Woods                        287.00
UNIX> cat *.txt | allmajors1 | wc
      50     150    2100
UNIX> cat *.txt | allmajors2 | wc
      41     123     909
UNIX>

I hope you see what has happened. We used a map, which means that when we inserted Vijay Singh's score of 287, it replaced Tiger's score. We need to instead use a multimap, and if we try to do so as in allmajors3.cpp, which simply replaces the map with a multimap, it will not compile, since you cannot use a multimap as an associative array:

UNIX> g++ -o allmajors3 allmajors3.cpp
allmajors3.cpp: In function 'int main()':
allmajors3.cpp:32: error: no match for 'operator[]' in 'averages[(total / 4.0e+0)]'
UNIX>

Instead, you have to go back to using the insert() method explicitly. This is done in allmajors4.cpp:

  for (git = golfer.begin(); git != golfer.end(); git++) {
    if (git->second.size() == 4) {
      total = 0;
      for (i = 0; i < git->second.size(); i++) total += git->second[i];
      averages.insert(make_pair(total/4.0, git->first));
    }
  }
  for (ait = averages.begin(); ait != averages.end(); ait++) {
    printf("%8.2lf %s\n", ait->first, ait->second.c_str());
  }
}

Now when we run it, we not only see Tiger Woods, we don't lose any golfers.

UNIX> cat *.txt | allmajors4 | head
  284.50 Ross Fisher
  284.75 Henrik Stenson
  285.00 Lee Westwood
  285.25 Graeme McDowell
  285.25 Rory McIlroy
  286.50 Camilo Villegas
  287.00 Tiger Woods
  287.00 Vijay Singh
  287.25 Kenny Perry
  287.75 Angel Cabrera
UNIX> cat *.txt | allmajors4 | wc
      50     150    1111
UNIX>

Beware of performance

A final note. You may be inclined to use maps instead of vectors since they are so convenient. In particular, you don't have to resize them or worry about empty elements. Let's take an example that can easily happen if you are lazy. Suppose I want to write a histogram-like program. It is going to take as input a bunch of data points which are doubles. It will then round each data point to an integer, and then keep track of how many of each rounded value there is. For example, consider the following data points:

( 6.2, 5.8, 2.3, 1.7, 2.0 )

Our program will organize these as two data points that round to 6 and three that round to 2.

I'm going to write four versions of this program. They will all assume that our data points are nonnegative. The first uses a map, which it traverses like an array. It is in histomap1.cpp. In case you didn't know, you can put the cin >> d statement into the while loop if you want, and it will return true/false depending on whether the statement successfully read the double:

#include <iostream>
#include <map>
#include <cmath>
using namespace std;

main()
{
  map <double, int> histo;
  double i;
  double d;

  while (cin >> d) histo[rint(d)]++;

  for (i = 0; i <= histo.rbegin()->first; i++) {
    if (histo[i] > 0) cout << i << " " << histo[i] << endl;
  }
}

It runs fine on our example above:

UNIX> echo 6.2 5.8 2.3 1.7 2.0 | histomap1
2 3
6 2
UNIX>

The second implementation (histomap2.cpp) is identical to the first, except we use an iterator to iterate through the map rather than an integer.

#include <iostream>
#include <map>
#include <cmath>
using namespace std;

main()
{
  map <double, int> histo;
  map <double, int>::iterator hit;
  double d;

  while (cin >> d) histo[rint(d)]++;

  for (hit = histo.begin(); hit != histo.end(); hit++) {
    cout << hit->first  << " " << hit->second << endl;
  }
}

And a third implementation (histomap3.cpp) also traverses the map, but uses the map as an array to print out the values. I'm only including the for loop here:

  for (hit = histo.begin(); hit != histo.end(); hit++) {
    d = hit->first;
    cout << d << " " << histo[d] << endl;
  }

Finally, a fourth implementation uses a vector instead of a map. The resize() method allows you to specify a value to use when you resize the vector. The implementation is in histovector.cpp

#include <iostream>
#include <vector>
#include <cmath>
using namespace std;

main()
{
  vector <int> histo;
  double i;
  double d;

  while (cin >> d) {
    i = rint(d);
    if (histo.size() <= i) histo.resize(i+1, 0);
    histo[i]++;
  }

  for (i = 0; i <= histo.size(); i++) {
    if (histo[i] > 0) cout << i << " " << histo[i] << endl;
  }
}

All the implementations product the same output on the same input, so they are all correct. However, let's think about how they are each going to fare on different input files. For tiny input files, they should all work equivalently. For example, tinyinput has the five example numbers above, and all four programs perform equivalently:

UNIX> time histomap1 < tinyinput > /dev/null
0.001u 0.002s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
UNIX> time histomap2 < tinyinput > /dev/null
0.001u 0.002s 0:00.00 0.0%      0+0k 0+1io 0pf+0w
UNIX> time histomap3 < tinyinput > /dev/null
0.001u 0.001s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
0.000u 0.000s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
UNIX> time histovector < tinyinput > /dev/null
0.001u 0.001s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
UNIX>

The inputfile sparseinput has four values, two of which round to two and two of which round to 1,000,000:

4.2 1000000.1 1000000.2 3.8

When we run our four implementations, we see that histomap1 performs the worst. Why? Because when we traverse the map, we look at histo[i] for each value of i from 4 to 1,000,000. Doing so inserts each value into the map, meaning our map has roughly 1,000,000 elements instead of two.

Histomap2 and histomap3 are much faster, because they only look at two values in the map. Histovector is slower because it does create a vector of 1,000,000 elements. It is faster than histomap1, because the underlying implementation of the vector is an array, and the underlying implementation of the map is a tree. Thus, the creation of the map is O(n log(n))), where n=1,000,000, whereas the creation of the vector is O(n).

UNIX> time histomap1 < sparseinput > /dev/null
1.081u 0.049s 0:01.13 99.1%     0+0k 0+0io 0pf+0w
UNIX> time histomap2 < sparseinput > /dev/null
0.001u 0.001s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
UNIX> time histomap3 < sparseinput > /dev/null
0.001u 0.001s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
UNIX> time histovector < sparseinput > /dev/null
0.023u 0.008s 0:00.03 66.6%     0+0k 0+0io 0pf+0w
UNIX>

Finally, the program biggen.cpp generates 10,000,000 numbers uniformly distributed between 0 and 1,000,000.

#include <cstdlib>
#include <iostream>
using namespace std;

main()
{
  int i;

  for (i = 0; i < 10000000; i++) printf("%.2lf\n", drand48()*100000);
}

When we use it as input, the vector version of the program outperforms the others, although not drastically, because its insert operations take less time (and memory):

UNIX> time sh -c "biggen | histomap1 > /dev/null"
48.255u 0.309s 0:42.73 113.6%   0+0k 0+2io 0pf+0w
UNIX> time sh -c "biggen | histomap2 > /dev/null"
48.434u 0.304s 0:42.87 113.6%   0+0k 0+0io 0pf+0w
UNIX> time sh -c "biggen | histomap3 > /dev/null"
48.327u 0.298s 0:42.76 113.6%   0+0k 0+0io 0pf+0w
UNIX> time sh -c "biggen | histovector > /dev/null"
45.019u 0.289s 0:39.46 114.7%   0+0k 0+0io 0pf+0w
UNIX>

What's the lesson? Use the proper data structure for the job. Sometimes the characteristics of the input dictate what data structure you should use.