Suppose I want to write a quick and dirty program that prints out all the players and scores in that file, sorted by first name. I can do it very easily using a map as an associative array. The program is in simpgolf.cpp:
#include <iostream>
#include <map>
using namespace std;
main()
{
map <string, int> golfer;
map <string, int>::iterator git;
int score;
string name, s;
while (cin >> score >> name) {
getline(cin, s);
golfer[name+s] = score; // Here's the associative array line.
}
for (git = golfer.begin(); git != golfer.end(); git++) {
printf("%-30s %5d\n", git->first.c_str(), git->second);
}
}
|
The commented line treats the map as an associative array -- indexed by a string. (You'll note, I construct the golfers' names by reading the first name, and then using getline() to read in the rest of the name). When I insert an element into the array, it inserts it into the map. At the end of the program, I traverse the map using an iterator and print out the names and scores:
UNIX> head 2009-Masters.txt
276 Angel Cabrera
276 Chad Campbell
276 Kenny Perry
278 Shingo Katayama
279 Phil Mickelson
280 John Merrick
280 Steve Flesch
280 Tiger Woods
280 Steve Stricker
281 Hunter Mahan
UNIX> sed 's/\(...\)\(.*\)/\2 \1/' 2009-Masters.txt | sort | head -n 10
Aaron Baddeley 284
Adam Scott 299
Alvaro Quiros 306
Andres Romero 297
Angel Cabrera 276
Anthony Kim 286
Ben Crenshaw 309
Ben Curtis 288
Bernhard Langer 303
Billy Mayfair 305
UNIX> simpgolf < 2009-Masters.txt | head -n 10
Aaron Baddeley 284
Adam Scott 299
Alvaro Quiros 306
Andres Romero 297
Angel Cabrera 276
Anthony Kim 286
Ben Crenshaw 309
Ben Curtis 288
Bernhard Langer 303
Billy Mayfair 305
UNIX>
That seems quite convenient, but you should ask yourself what happens if you try to
access a map in this way, and the key isn't there? The answer is that it inserts
the key into the map with a blank value. To illustrate, take a look at
tiger_and_jack.cpp:
#include <iostream>
#include <map>
using namespace std;
main()
{
map <string, int> golfer;
map <string, int>::iterator git;
int score;
string name, s;
while (cin >> score >> name) {
getline(cin, s);
golfer[name+s] = score;
}
cout << "Number of golfers: " << golfer.size() << endl;
cout << "Tiger Woods' Score: " << golfer["Tiger Woods"] << endl;
cout << "Jack Nicklaus' Score: " << golfer["Jack Nicklaus"] << endl;
cout << "Number of golfers: " << golfer.size() << endl;
}
|
This program reads in the golfers just like simpgolf.cpp. After reading in the golfers, it prints the size of the map, then it prints Tiger Woods' score and Jack Nicklaus' score. Finally it prints the size of the map again. Below we run it on 2009-Masters.txt:
UNIX> tiger_and_jack < 2009-Masters.txt Number of golfers: 96 Tiger Woods' Score: 280 Jack Nicklaus' Score: 0 Number of golfers: 97 UNIX>Since Jack Nicklaus didn't play in the 2009 Masters, he has no score. When we try to look him up in the map, it creates a new entry for him with a default score of zero. For this reason, looking him up in the map has increased the size of the map by one.
Is that what you want? It doesn't really matter -- that's what happens, and you should be aware of it.
In allmajors1.cpp, I modified simpgolf.cpp so that the val field of the map is a vector of scores. Then when I traverse the map, I print out the average scores of the golfers who played in all four tournaments:
#include <iostream>
#include <map>
#include <vector>
using namespace std;
typedef vector <int> ivector;
main()
{
map <string, ivector> golfer;
map <string, ivector>::iterator git;
int score, i;
string name, s;
double total;
while (cin >> score >> name) {
getline(cin, s);
golfer[name+s].push_back(score);
}
for (git = golfer.begin(); git != golfer.end(); git++) {
if (git->second.size() == 4) {
total = 0;
for (i = 0; i < git->second.size(); i++) total += git->second[i];
printf("%-30s %10.2lf\n", git->first.c_str(), total/4.0);
}
}
}
|
When I run it, all looks good:
UNIX> cat 2009* | allmajors1 | head
Adam Scott 302.50
Alvaro Quiros 302.00
Andres Romero 294.50
Angel Cabrera 287.75
Anthony Kim 292.50
Ben Curtis 295.00
Boo Weekley 296.25
Brandt Snedeker 305.25
Briny Baird 304.75
Bubba Watson 297.00
UNIX> cat 2009* | grep Snedeker | awk '{ l++; n += $1; print n/l }'
309
306
306
305.25
UNIX> cat 2009* | grep Weekly | awk '{ l++; n += $1; print n/l }'
282
292
292.333
296.25
UNIX>
However, I really want to sort by the best (lowest) scores. To do that, I
can use a second map keyed on doubles. This is in:
allmajors2.cpp. I add the second
map and iterator in the variable declarations:
map <double, string> averages; map <double, string>::iterator ait; |
And insert by treating the map as an associative array, before printing it out:
for (git = golfer.begin(); git != golfer.end(); git++) {
if (git->second.size() == 4) {
total = 0;
for (i = 0; i < git->second.size(); i++) total += git->second[i];
averages[total/4.0] = git->first;
}
}
for (ait = averages.begin(); ait != averages.end(); ait++) {
printf("%8.2lf %s\n", ait->first, ait->second.c_str());
}
}
|
When I run it, all appears well:
UNIX> cat 2009*.txt | allmajors2 | head 284.50 Ross Fisher 284.75 Henrik Stenson 285.00 Lee Westwood 285.25 Rory McIlroy 286.50 Camilo Villegas 287.00 Vijay Singh 287.25 Kenny Perry 287.75 Jim Furyk 288.75 Soren Hansen 289.25 Retief Goosen UNIX>But where is Tiger Woods? Certainly he would be in the top ten (at least in 2009, before his sex scandals and subsequent tanking of his golf game...):
UNIX> cat 2009*.txt | allmajors1 | grep Tiger
Tiger Woods 287.00
UNIX> cat 2009*.txt | allmajors1 | wc
50 150 2100
UNIX> cat 2009*.txt | allmajors2 | wc
41 123 909
UNIX>
I hope you see what has happened. We used a map, which means that when we inserted
Vijay Singh's score of 287, it replaced Tiger's score. We need to instead use a multimap,
and if we try to do so as in
allmajors3.cpp, which simply replaces the map
with a multimap, it will not compile, since you cannot use a multimap as
an associative array:
UNIX> g++ -o allmajors3 allmajors3.cpp allmajors3.cpp: In function 'int main()': allmajors3.cpp:28: error: no match for operator[] in averages[(total / 4.0e+0)] UNIX>Instead, you have to go back to using the insert() method explicitly. This is done in allmajors4.cpp:
for (git = golfer.begin(); git != golfer.end(); git++) {
if (git->second.size() == 4) {
total = 0;
for (i = 0; i < git->second.size(); i++) total += git->second[i];
averages.insert(make_pair(total/4.0, git->first));
}
}
for (ait = averages.begin(); ait != averages.end(); ait++) {
printf("%8.2lf %s\n", ait->first, ait->second.c_str());
}
}
|
Now when we run it, we not only see Tiger Woods, we don't lose any golfers.
UNIX> cat 2009*.txt | allmajors4 | head
284.50 Ross Fisher
284.75 Henrik Stenson
285.00 Lee Westwood
285.25 Graeme McDowell
285.25 Rory McIlroy
286.50 Camilo Villegas
287.00 Tiger Woods
287.00 Vijay Singh
287.25 Kenny Perry
287.75 Angel Cabrera
UNIX> cat 2009*.txt | allmajors4 | wc
50 150 1111
UNIX>
Our program will organize these as two data points that round to 6 and three that round to 2.
I'm going to write four versions of this program. They will all assume that our data points are nonnegative. The first uses a map, which it traverses like an array. It is in histomap1.cpp:
#include <iostream>
#include <map>
#include <cmath>
using namespace std;
main()
{
map <double, int> histo;
double i;
double d;
while (cin >> d) histo[rint(d)]++;
for (i = 0; i <= histo.rbegin()->first; i++) {
if (histo[i] > 0) cout << i << " " << histo[i] << endl;
}
}
|
It runs fine on our example above:
UNIX> echo 6.2 5.8 2.3 1.7 2.0 | histomap1 2 3 6 2 UNIX>The second implementation (histomap2.cpp) is identical to the first, except we use an iterator to iterate through the map rather than an integer.
#include <iostream>
#include <map>
#include <cmath>
using namespace std;
main()
{
map <double, int> histo;
map <double, int>::iterator hit;
double d;
while (cin >> d) histo[rint(d)]++;
for (hit = histo.begin(); hit != histo.end(); hit++) {
cout << hit->first << " " << hit->second << endl;
}
}
|
And a third implementation (histomap3.cpp) also traverses the map, but uses the map as an array to print out the values. I'm only including the for loop here:
for (hit = histo.begin(); hit != histo.end(); hit++) {
d = hit->first;
cout << d << " " << histo[d] << endl;
}
|
Finally, a fourth implementation uses a vector instead of a map. The implementation is in histovector.cpp
#include <iostream>
#include <vector>
#include <cmath>
using namespace std;
main()
{
vector <int> histo;
double i;
double d;
while (cin >> d) {
i = rint(d);
if (histo.size() <= i) histo.resize(i+1, 0);
histo[i]++;
}
for (i = 0; i <= histo.size(); i++) {
if (histo[i] > 0) cout << i << " " << histo[i] << endl;
}
}
|
All the implementations produce the same output on the same input, so they are all correct. However, let's think about how they are each going to fare on different input files. For tiny input files, they should all work equivalently. For example, tinyinput has the five example numbers above, and all four programs perform equivalently:
UNIX> time histomap1 < tinyinput > /dev/null 0.001u 0.002s 0:00.00 0.0% 0+0k 0+0io 0pf+0w UNIX> time histomap2 < tinyinput > /dev/null 0.001u 0.002s 0:00.00 0.0% 0+0k 0+1io 0pf+0w UNIX> time histomap3 < tinyinput > /dev/null 0.001u 0.001s 0:00.00 0.0% 0+0k 0+0io 0pf+0w UNIX> time histovector < tinyinput > /dev/null 0.001u 0.001s 0:00.00 0.0% 0+0k 0+0io 0pf+0w UNIX>The inputfile sparseinput has four values, two of which round to two and two of which round to 1,000,000:
4.2 1000000.1 1000000.2 3.8 |
When we run our four implementations, we see that histomap1 performs the worst. Why? Because when we traverse the map, we look at histo[i] for each value of i from 4 to 1,000,000. Doing so inserts each value into the map, meaning our map has roughly 1,000,000 elements instead of two.
Histomap2 and histomap3 are much faster, because they only look at two values in the map. Histovector is slower because it creates a vector of 1,000,000 elements. It is faster than histomap1, because the underlying implementation of the vector is an array, and the underlying implementation of the map is a tree. Thus, the creation of the map is O(n log(n))), where n=1,000,000, whereas the creation of the vector is O(n).
UNIX> time histomap1 < sparseinput > /dev/null 1.081u 0.049s 0:01.13 99.1% 0+0k 0+0io 0pf+0w UNIX> time histomap2 < sparseinput > /dev/null 0.001u 0.001s 0:00.00 0.0% 0+0k 0+0io 0pf+0w UNIX> time histomap3 < sparseinput > /dev/null 0.001u 0.001s 0:00.00 0.0% 0+0k 0+0io 0pf+0w UNIX> time histovector < sparseinput > /dev/null 0.023u 0.008s 0:00.03 66.6% 0+0k 0+0io 0pf+0w UNIX>Finally, the program biggen.cpp generates 10,000,000 numbers uniformly distributed between 0 and 1,000,000.
#include <cstdlib>
#include <iostream>
using namespace std;
main()
{
int i;
for (i = 0; i < 10000000; i++) printf("%.2lf\n", drand48()*1000000);
}
|
When we use it as input, the vector version of the program outperforms the others, because its insert operations take less time (and memory):
UNIX> time sh -c "biggen | histomap1 > /dev/null" 44.328u 0.839s 0:41.09 109.8% 0+0k 0+0io 0pf+0w UNIX> time sh -c "biggen | histomap2 > /dev/null" 42.677u 0.824s 0:39.44 110.2% 0+0k 0+0io 0pf+0w UNIX> time sh -c "biggen | histomap3 > /dev/null" 43.462u 0.850s 0:40.24 110.1% 0+0k 0+3io 0pf+0w UNIX> time sh -c "biggen | histovector > /dev/null" 26.350u 0.722s 0:22.99 117.7% 0+0k 0+0io 0pf+0w UNIX>What's the lesson? Use the proper data structure for the job. Sometimes the characteristics of the input dictate what data structure you should use.