Suppose I want to write a quick and dirty program that prints out all the players and scores in that file, sorted by first name. I can do it very easily using a map as an associative array. The program is in simpgolf.cpp:
#include <iostream> #include <map> using namespace std; main() { map <string, int> golfer; map <string, int>::iterator git; int score; string name, s; while (!cin.fail()) { cin >> score >> name; if (!cin.fail()) { getline(cin, s); golfer[name+s] = score; /* Here's the associative array line */ } } for (git = golfer.begin(); git != golfer.end(); git++) { printf("%-30s %5d\n", git->first.c_str(), git->second); } } |
The commented line treats the map as an associative array -- indexed by a string. (You'll note, I construct the golfers' names by reading the first name, and then using getline() to read in the rest of the name). When I insert an element into the array, it inserts it into the map. At the end of the program, I traverse the map using an iterator and print out the names and scores:
UNIX> head 2009-Masters.txt 276 Angel Cabrera 276 Chad Campbell 276 Kenny Perry 278 Shingo Katayama 279 Phil Mickelson 280 John Merrick 280 Steve Flesch 280 Tiger Woods 280 Steve Stricker 281 Hunter Mahan UNIX> sed 's/\(...\)\(.*\)/\2 \1/' 2009-Masters.txt | sort | head -n 10 Aaron Baddeley 284 Adam Scott 299 Alvaro Quiros 306 Andres Romero 297 Angel Cabrera 276 Anthony Kim 286 Ben Crenshaw 309 Ben Curtis 288 Bernhard Langer 303 Billy Mayfair 305 UNIX> simpgolf < 2009-Masters.txt | head -n 10 Aaron Baddeley 284 Adam Scott 299 Alvaro Quiros 306 Andres Romero 297 Angel Cabrera 276 Anthony Kim 286 Ben Crenshaw 309 Ben Curtis 288 Bernhard Langer 303 Billy Mayfair 305 UNIX>That seems quite convenient, but you should ask yourself what happens if you try to access a map in this way, and the key isn't there? The answer is that it inserts the key into the map with a blank value. To illustrate, take a look at tiger_and_jack.cpp
#include <iostream> #include <map> using namespace std; main() { map <string, int> golfer; map <string, int>::iterator git; int score; string name, s; while (!cin.fail()) { cin >> score >> name; if (!cin.fail()) { getline(cin, s); golfer[name+s] = score; } } cout << "Number of golfers: " << golfer.size() << endl; cout << "Tiger Woods' Score: " << golfer["Tiger Woods"] << endl; cout << "Jack Nicklaus' Score: " << golfer["Jack Nicklaus"] << endl; cout << "Number of golfers: " << golfer.size() << endl; } |
This program reads in the golfers just like simpgolf.cpp. After reading in the golfers, it prints the size of the map, then it prints Tiger Woods' score and Jack Nicklaus' score. Finally it prints the size of the map again. Below we run it on 2009-Masters.txt:
UNIX> tiger_and_jack < 2009-Masters.txt Number of golfers: 96 Tiger Woods' Score: 280 Jack Nicklaus' Score: 0 Number of golfers: 97 UNIX>Since Jack Nicklaus didn't play in the 2009 Masters, he has no score. When we try to look him up in the map, it creates a new entry for him with a default score of zero. For this reason, looking him up in the map has increased the size of the map by one.
Is that what you want? It doesn't really matter -- that's what happens, and you should be aware of it.
In allmajors1.cpp, I modified simpgolf.cpp so that the val field of the map is a vector of scores. Then when I traverse the map, I print out the average scores of the golfers who played in all four tournaments:
#include <iostream> #include <map> #include <vector> using namespace std; typedef vector <int> ivector; main() { map <string, ivector> golfer; map <string, ivector>::iterator git; int score; string name, s; double total; int i; while (!cin.fail()) { cin >> score >> name; if (!cin.fail()) { getline(cin, s); golfer[name+s].push_back(score); } } for (git = golfer.begin(); git != golfer.end(); git++) { if (git->second.size() == 4) { total = 0; for (i = 0; i < git->second.size(); i++) total += git->second[i]; printf("%-30s %10.2lf\n", git->first.c_str(), total/4.0); } } } |
When I run it, all looks good:
UNIX> cat 2009* | allmajors1 | head Adam Scott 302.50 Alvaro Quiros 302.00 Andres Romero 294.50 Angel Cabrera 287.75 Anthony Kim 292.50 Ben Curtis 295.00 Boo Weekley 296.25 Brandt Snedeker 305.25 Briny Baird 304.75 Bubba Watson 297.00 UNIX>However, I really want to sort by the best (lowest) scores. To do that, I can use a second map keyed on doubles. This is in: allmajors2.cpp. I add the second map an iterator in the variable declarations:
map <double, string> averages; map <double, string>::iterator ait; |
And insert by treating the map as an associative array, before printing it out:
for (git = golfer.begin(); git != golfer.end(); git++) { if (git->second.size() == 4) { total = 0; for (i = 0; i < git->second.size(); i++) total += git->second[i]; averages[total/4.0] = git->first; } } for (ait = averages.begin(); ait != averages.end(); ait++) { printf("%8.2lf %s\n", ait->first, ait->second.c_str()); } } |
When I run it, all appears well:
UNIX> cat *.txt | allmajors2 | head 284.50 Ross Fisher 284.75 Henrik Stenson 285.00 Lee Westwood 285.25 Rory McIlroy 286.50 Camilo Villegas 287.00 Vijay Singh 287.25 Kenny Perry 287.75 Jim Furyk 288.75 Soren Hansen 289.25 Retief Goosen UNIX>But where is Tiger Woods? Certainly he would be in the top ten:
UNIX> cat *.txt | allmajors1 | grep Tiger Tiger Woods 287.00 UNIX> cat *.txt | allmajors1 | wc 50 150 2100 UNIX> cat *.txt | allmajors2 | wc 41 123 909 UNIX>I hope you see what has happened. We used a map, which means that when we inserted Vijay Singh's score of 287, it replaced Tiger's score. We need to instead use a multimap, and if we try to do so as in allmajors3.cpp, which simply replaces the map with a multimap, it will not compile, since you cannot use a multimap as an associative array:
UNIX> g++ -o allmajors3 allmajors3.cpp allmajors3.cpp: In function 'int main()': allmajors3.cpp:32: error: no match for 'operator[]' in 'averages[(total / 4.0e+0)]' UNIX>Instead, you have to go back to using the insert() method explicitly. This is done in allmajors4.cpp:
for (git = golfer.begin(); git != golfer.end(); git++) { if (git->second.size() == 4) { total = 0; for (i = 0; i < git->second.size(); i++) total += git->second[i]; averages.insert(make_pair(total/4.0, git->first)); } } for (ait = averages.begin(); ait != averages.end(); ait++) { printf("%8.2lf %s\n", ait->first, ait->second.c_str()); } } |
Now when we run it, we not only see Tiger Woods, we don't lose any golfers.
UNIX> cat *.txt | allmajors4 | head 284.50 Ross Fisher 284.75 Henrik Stenson 285.00 Lee Westwood 285.25 Graeme McDowell 285.25 Rory McIlroy 286.50 Camilo Villegas 287.00 Tiger Woods 287.00 Vijay Singh 287.25 Kenny Perry 287.75 Angel Cabrera UNIX> cat *.txt | allmajors4 | wc 50 150 1111 UNIX>
Our program will organize these as two data points that round to 6 and three that round to 2.
I'm going to write four versions of this program. They will all assume that our data points are nonnegative. The first uses a map, which it traverses like an array. It is in histomap1.cpp. In case you didn't know, you can put the cin >> d statement into the while loop if you want, and it will return true/false depending on whether the statement successfully read the double:
#include <iostream> #include <map> #include <cmath> using namespace std; main() { map <double, int> histo; double i; double d; while (cin >> d) histo[rint(d)]++; for (i = 0; i <= histo.rbegin()->first; i++) { if (histo[i] > 0) cout << i << " " << histo[i] << endl; } } |
It runs fine on our example above:
UNIX> echo 6.2 5.8 2.3 1.7 2.0 | histomap1 2 3 6 2 UNIX>The second implementation (histomap2.cpp) is identical to the first, except we use an iterator to iterate through the map rather than an integer.
#include <iostream> #include <map> #include <cmath> using namespace std; main() { map <double, int> histo; map <double, int>::iterator hit; double d; while (cin >> d) histo[rint(d)]++; for (hit = histo.begin(); hit != histo.end(); hit++) { cout << hit->first << " " << hit->second << endl; } } |
And a third implementation (histomap3.cpp) also traverses the map, but uses the map as an array to print out the values. I'm only including the for loop here:
for (hit = histo.begin(); hit != histo.end(); hit++) { d = hit->first; cout << d << " " << histo[d] << endl; } |
Finally, a fourth implementation uses a vector instead of a map. The resize() method allows you to specify a value to use when you resize the vector. The implementation is in histovector.cpp
#include <iostream> #include <vector> #include <cmath> using namespace std; main() { vector <int> histo; double i; double d; while (cin >> d) { i = rint(d); if (histo.size() <= i) histo.resize(i+1, 0); histo[i]++; } for (i = 0; i <= histo.size(); i++) { if (histo[i] > 0) cout << i << " " << histo[i] << endl; } } |
All the implementations product the same output on the same input, so they are all correct. However, let's think about how they are each going to fare on different input files. For tiny input files, they should all work equivalently. For example, tinyinput has the five example numbers above, and all four programs perform equivalently:
UNIX> time histomap1 < tinyinput > /dev/null 0.001u 0.002s 0:00.00 0.0% 0+0k 0+0io 0pf+0w UNIX> time histomap2 < tinyinput > /dev/null 0.001u 0.002s 0:00.00 0.0% 0+0k 0+1io 0pf+0w UNIX> time histomap3 < tinyinput > /dev/null 0.001u 0.001s 0:00.00 0.0% 0+0k 0+0io 0pf+0w 0.000u 0.000s 0:00.00 0.0% 0+0k 0+0io 0pf+0w UNIX> time histovector < tinyinput > /dev/null 0.001u 0.001s 0:00.00 0.0% 0+0k 0+0io 0pf+0w UNIX>The inputfile sparseinput has four values, two of which round to two and two of which round to 1,000,000:
4.2 1000000.1 1000000.2 3.8 |
When we run our four implementations, we see that histomap1 performs the worst. Why? Because when we traverse the map, we look at histo[i] for each value of i from 4 to 1,000,000. Doing so inserts each value into the map, meaning our map has roughly 1,000,000 elements instead of two.
Histomap2 and histomap3 are much faster, because they only look at two values in the map. Histovector is slower because it does create a vector of 1,000,000 elements. It is faster than histomap1, because the underlying implementation of the vector is an array, and the underlying implementation of the map is a tree. Thus, the creation of the map is O(n log(n))), where n=1,000,000, whereas the creation of the vector is O(n).
UNIX> time histomap1 < sparseinput > /dev/null 1.081u 0.049s 0:01.13 99.1% 0+0k 0+0io 0pf+0w UNIX> time histomap2 < sparseinput > /dev/null 0.001u 0.001s 0:00.00 0.0% 0+0k 0+0io 0pf+0w UNIX> time histomap3 < sparseinput > /dev/null 0.001u 0.001s 0:00.00 0.0% 0+0k 0+0io 0pf+0w UNIX> time histovector < sparseinput > /dev/null 0.023u 0.008s 0:00.03 66.6% 0+0k 0+0io 0pf+0w UNIX>Finally, the program biggen.cpp generates 10,000,000 numbers uniformly distributed between 0 and 1,000,000.
#include <cstdlib> #include <iostream> using namespace std; main() { int i; for (i = 0; i < 10000000; i++) printf("%.2lf\n", drand48()*100000); } |
When we use it as input, the vector version of the program outperforms the others, although not drastically, because its insert operations take less time (and memory):
UNIX> time sh -c "biggen | histomap1 > /dev/null" 48.255u 0.309s 0:42.73 113.6% 0+0k 0+2io 0pf+0w UNIX> time sh -c "biggen | histomap2 > /dev/null" 48.434u 0.304s 0:42.87 113.6% 0+0k 0+0io 0pf+0w UNIX> time sh -c "biggen | histomap3 > /dev/null" 48.327u 0.298s 0:42.76 113.6% 0+0k 0+0io 0pf+0w UNIX> time sh -c "biggen | histovector > /dev/null" 45.019u 0.289s 0:39.46 114.7% 0+0k 0+0io 0pf+0w UNIX>What's the lesson? Use the proper data structure for the job. Sometimes the characteristics of the input dictate what data structure you should use.