CS202 Lecture Notes

CS202 Lecture Notes - Unordered Sets and Maps

James S. Plank
Original Notes: March 24, 2023
Last revision: Fri Mar 24 11:13:27 EDT 2023
Directory: /home/jplank/cs202/Notes/Unordered

Introduction

C++11 introduced a very powerful set of data structures: unordered_set and unordered_map. These are the exact same as set and map, except for one key difference -- they are not sorted.

That means that if you iterate through them, they will be in any arbitrary order. What you gain for that difference is speed:

Operation	Set/Map	Unordered Set/Map
`insert()`	O(log n)	O(1)
`find()`	O(log n)	O(1)
`erase()`	O(log n)	O(1)
`begin()`	O(1)	O(1)
`end()`	O(1)	O(1)
Traversal	O(n)	O(n)

They are implemented using a hash table that gets resized and rehashed when it becomes too full. So, if you don't need the sorting feature of a set or map, you should use the unordered data structures.

An example

I have two very simple programs in src/store_find_set.cpp and src/store_find_unordered.cpp. They are the exact same, except the first one uses a set and the second uses an unordered set.

The programs take two files as arguments and a "Y/N" for printing. They read the words in the first file into a set or unordered set. Then they read words in the second file and try to find them. If you said "Y" for printing, then it prints everything. Otherwise, it simply prints how many words it found.

The code is straightforward, and you should have no trouble reading it:

#include <set>
#include <iostream>
#include <fstream>
using namespace std;

int main(int argc, char **argv)
{
  ifstream data_file, to_find_file;
  bool print;
  set <string> data;
  set <string>::const_iterator f;
  string s;
  int found;

  /* Parse the command line. */

  try {
    if (argc != 4) throw (string) "usage: store_find_set data_file to_find_file print(Y/N)\n";
    data_file.open(argv[1]);
    if (data_file.fail()) throw (string) "can't open " + argv[1];
    to_find_file.open(argv[2]);
    if (to_find_file.fail()) throw (string) "can't open " + argv[2];
    print = (argv[3][0] == 'Y');
  } catch (const string &s) {
    cerr << s << endl;
    return 1;
  }
 
  /* Read the data file. */

  while (data_file >> s) data.insert(s);

  if (print) {
    cout << "Data:" << endl;
    for (f = data.begin(); f != data.end(); f++) cout << *f << endl;
    cout << endl;
  }
 
  data_file.close();

  /* Read the to_find_file, and try to find each word in the data file */

  found = 0;

  while (to_find_file >> s) {
    f = data.find(s);
    if (f != data.end()) found++;
    if (print) cout << s << ": " << ((f == data.end()) ? "Not found" : "Found") << endl;
  }

  if (print) cout << endl;
  cout << "Found " << found << endl;
  return 0;
}

First, let's run it and look at output. I have the following files:

txt/phones-small.txt - Ten random phone numbers.
txt/phones-big.txt - 100,000 random phone numbers.
txt/pfind-small.txt - Five phone numbers from txt/phones-small.txt and five random phone numbers.
txt/pfind-big.txt - 50,000 phone numbers from txt/phones-big.txt and 50,000 random phone numbers.

Below, I'll run the two programs on the small files:

UNIX> bin/store_find_set txt/phones-small.txt txt/pfind-small.txt Y
Data:
009-759-6084         # The only difference in the two outputs is that this is sorted.
062-707-0682
161-804-8876
276-780-5793
366-672-5281
392-698-1589
639-049-9982
874-615-3750
927-211-9485
943-433-6132

067-449-4119: Not found
634-692-2465: Not found
087-310-7338: Not found
062-707-0682: Found
750-158-1494: Not found
927-211-9485: Found
639-049-9982: Found
366-672-5281: Found
158-103-1526: Not found
276-780-5793: Found

Found 5
UNIX> bin/store_find_unordered txt/phones-small.txt txt/pfind-small.txt Y
Data:
927-211-9485         # And this is not sorted.
639-049-9982
366-672-5281
161-804-8876
943-433-6132
276-780-5793
009-759-6084
062-707-0682
874-615-3750
392-698-1589

067-449-4119: Not found
634-692-2465: Not found
087-310-7338: Not found
062-707-0682: Found
750-158-1494: Not found
927-211-9485: Found
639-049-9982: Found
366-672-5281: Found
158-103-1526: Not found
276-780-5793: Found

Found 5
UNIX>

Now, let's time them on the big files:

UNIX> time bin/store_find_set txt/phones-big.txt txt/pfind-big.txt N
Found 50000

real	0m0.522s
user	0m0.516s
sys	0m0.005s
UNIX> time bin/store_find_unordered txt/phones-big.txt txt/pfind-big.txt N
Found 50000

real	0m0.180s
user	0m0.174s
sys	0m0.005s
UNIX>

The difference is significant! Even though O(log n) is pretty small (in this example, it is 17), the difference is enough to make the programs run at significantly different speeds. The comparison is not as strong as it should be, because in both programs, reading the files takes a lot of time.

Regardless, I hope this is convincing enough to you to pay attention to these data structures and use them when you don't need sorting.

Is there an unsorted_multiset?

Yes, and unsorted_multimap.

Bottom Line

I know I'm repeating myself -- if you need storage and retrieval, and you don't need for your data to be sorted, you will gain significant speedups if you use unsorted_set and unsorted_map instead of set and map.