Challenge 05: Comparing different hash functions

Problem overview

This will be a "hands on" version of the discussion in Dr. Plank's notes. Just as in class on Tuesday, we'll be using separate chaining so you should have a vector of lists (of type string)

Instead of submitting actual code, submit a brief report (report.txt) on canvas of the runtime of each of the small toy programs below. Again, I've benchmarked this on tesla1 and have a sense of how long it should take.

Inspiration

This will explain more practically collisions, load factors, and the value of better hash functions using string data as input.

Input / Output

You will write a series of 10-15 line programs that have this skeleton:

#include <iostream>
#include <vector>
#include <list>
#include <algorithm>
#include <fstream>

using namespace std;

// from Dr. Plank's lecture notes from hashing, most simple hash function you can think of           
unsigned int bad_hash(const string &s)
{
  size_t i;
  unsigned int h;

  h = 0;

  for (i = 0; i < s.size(); i++) {
    h += s[i];
  }
  return h;
}

int main() {

  string line;
  int cnt;
  vector<list<string> > data;
  data.resize(200000);

  int h;
  int collisions = 0;

  while (getline(cin, line)) {

    h = bad_hash(line) % 200000;
    data[h].push_back (line);

    if (data[h].size() > 1)
      collisions++;

  }

  // compute load factor of hash table here

}

In your groups, complete the following tasks:

We will be using the same list of 100k names from Dr. Plank we used on Tues.
Compute the load factor of your hash table using the "bad hash" function and include it in the report you will upload on Friday
Also compute the min and max values of h in your code and include those values in your report.
Swap out the bad hash function for DJB. Repeat steps 2 and 3. Which hash function do you think is better and why?
Now try using the STL find algorithm. In my sample code I put the names in a simple vector and searched for the 80,000th to the 95,000th name in the entire vector. If you have the time/desire, you can use file streams. A simple way to generate a subset of names is to use the UNIX tool tail as follows: "tail -n 15000 names_100000.txt > query.txt which will save the last 15k lines in the new file. Report the time (using "time ./a.out" or similar) of three separate runs using DJB. For reference, searching for the 80,000th to the 95,000th in a simple vector containing all 100k names takes roughly 13 seconds on tesla1.
Swap out the DJB hash function for ACM_hash. Repeat steps 2, 3 and 5 and include the values in the report you will upload on Canvas.
In your own words, is the ACM_hash better, worse, or roughly the same as DJB? Why?

Rubric

We will test your code using the following rubric as pass/fail:

+0.5   Test values and runtimes reported
+0.5 Questions above also answered in your report

Testing your code prior to submission

I'm not going to run your code, so no worries using git. Copy and paste the skeleton above it you want to but its not required.

Submission

To submit your report, you must upload a report.txt on Canvas prior to the deadline. We highly recommend that all members of a group upload a version prior to the deadline as these likely will be on the final exam.