CS494 Lecture Notes

CS494 Lecture Notes - MinHash

James S. Plank
Directory: /home/plank/cs494/Notes/Min-Hash
Original notes: October, 2017
Most recent revision: Tue Feb 27 00:12:53 EST 2024

Reference Material

Of course, there's Wikipedia. It has math. These notes are pretty much self-encapsulated.

The Problem Solved By MinHash

MinHash is a pretty esoteric algorithm. However, it is interesting because like a Bloom Filter, it leverages the randomness of hashing to solve a problem "quickly," but probabilistically. You'll see why I need the quotes below. This is why I typically teach MinHash right after I teach Bloom Filters.

It addresses the problem of set similarity. In other words, you have two sets of elements. How similar are they? The metric of similarity is the Jaccard similarity coefficient. This is a very natural metric, defined as follows:

Jaccard similarity of sets A and B equals |A ∩ B| / |A ∪ B|.

Think about this for a minute. If two sets are identical, then their Jaccard similarity is one. If they are completely disjoint, then their Jaccard similarity is zero. If set A is twice the size of set B and contains set B completely, then the two sets' Jaccard similarity is 0.5. The same is true if the two sets are the same size, and share exactly two thirds of their elements.

If we like Jaccard similarity as a metric, then think about how you can calculate it. In this discussion, let's assume that A and B are the same size. A straightforward way is to store all of the values of A in a C++ set, and then look up each value of B. That will let you calculate |A ∩ B| and |A ∪ B|, and then you can calculate the Jaccard similarity. What's the running time of that? Our favorite: O(n log(n)), where |A| = |B| = n. The space complexity is O(n), because the size of a tree is proportional to the number of nodes.

If the elements of A and B are large, we can hash them. Since O(n log(n)) is pretty fast, why would we need to improve it? Well, for the same reason as a Bloom Filter -- when they get really big, you start running out of time and space. So a heuristic is nice.

MinHash with k hash functions -- a Theoretician's Delight

This is the classic presentation of MinHash, which is elegant, but as we'll see, also useless. What you do is hash each element of A and each element of B, and look at the minimum hash value of each set. As it turns out, the probability of these two hashes being equal is exactly the Jaccard similarity of the two sets. Wow -- that is elegant!

Think about an example. If A and B are identical, then their minimum hashes will clearly be the same. If they are disjoint, and we assume no hash collisions (which I'm going to assume for the remainder of this writeup), then their minimum hashes will clearly be different. Let's take the example where A contains B and is twice its size. Then it is pretty intuitive that the minimum hash of B has a 50% chance of being the minimum hash of A. That's nice.

How about the fourth example -- A and B are the same size and contain 2/3 the same elements. In that case, their Jaccard similarity is 0.5. Let's look at an example:

A = { 1, 2, 3 } and B = { 2, 3, 4 }

There are four total elements, and when we hash them, each of the four has an equal probability (0.25) of being the minimum. When 1 or 4 is the minimum, then the two sets will have different minimum hashes. When 2 or 3 is the minimum, then the two sets will have the same minimum hashes. It works!

It's not hard to prove in general -- there are |A ∪ B| total elements, so there are |A ∪ B| total hashes. The minimum hash is a random element chosen from the |A ∪ B| hashes, so it has a probability of:

|A ∩ B| / |A ∪ B|

of being part of the intersection of the two sets. Again -- elegant.

Unfortunately, in order to approximate a probability from a sample drawn from that probability, we need a lot of samples. So, with this version of MinHash, you calculate k different hashes of each value of A and B. For each hash, you compare the minimum values, and from that you approximate the probability. The running time of this is O(nk) so if k is small, then this will be a win. It is also O(1) in terms of space, which is attractive. However, Wikipedia tells us that the error in this technique is 1/sqrt(k). So, for 10% error, you need 100 hash functions. For 1% error, you need 10,000 hash functions. Yick. That's friggin expensive, and if that's all there were to MinHash, I'd simply go with the O(n log(n)) algorithm. Or maybe even use a Bloom Filter! But, there's more, so keep reading.

MinHash with one hash function

While the theory of MinHash above is clean and simple, the theory in this version of MinHash is less clear to me, so I choose to ignore it and trust the theoreticians. My implementation confirms that they are right. What you do with this version of MinHash is calculate one hash of every value, and store the smallest k hashes in each set. Then you calculate the Jaccard similarity of those two smaller sets of values. Their Jaccard similarity will approximate A and B's similarity, and the error bound is the same as the previous MinHash. Why is this better? Well, the running time is O(n log(k)), and the space is O(k). This is a nice tradeoff on the exact calculations, and the "k hash" calculations.

This version of MinHash has an additional desirable property -- as k approaches n, the accuracy approaches 1, and when k ≥ n, the accuracy is one. That is not true for the k-hash function version of MinHash.

Can we use a Bloom Filter?

Sure -- create a Bloom Filter from A, and then calculate the Jaccard similarity by looking up each element of B. The running time of that is O(nk) and the space is O(n) (remember, though, that the k here is different from the MinHash k).

What about the false positives? Believe it or not, they can help you and not hurt you! Let's consider an example that we're going to see below. Two very large sets have 90% of their elements in common. Their Jaccard similarity is thus 90/110 = 0.818182. Suppose we use a Bloom filter where m = 10n and k = 7. Our predicted false positive rate will be 0.0082 (please see the Bloom Filter lecture notes for an explanation).

How does that impact the error of the Jaccard similarity calculation? Let's do a concrete example where A and B both contain 1,000,000 elements, with 900,000 of them being shared. We have put all of the elements of A into the Bloom Filter. Now we test every eleemnt of B. In that test, 900,000 of the elements will return a true positive. 820 of the remaining 100,000 elements will return a false positive. The remaining 99,180 elements will return a true negative. We can leverage the fact that we can estimate the false positive rate to calculate the true Jaccard similarity:

Let the number of true positives be TP.
Let the number of false positives be FP.
Let the number of true negatives be TN.
Let the false positive rate be R
|A ∩ B| = TP
|A ∪ B| = TP + 2FP + 2TN
The Jaccard similarity is: TP / (TP + 2FP + 2TN)

So now, with our Bloom Filter, we know (TP+FP), TN and R. We can use R to calculate FP as follows:

FP = R(FP+TN)
FP - R(FP) = R(TN)
(1 - R)(FP) = R(TN)
FP = R(TN)/(1-R)

That gives us all we need for a proper calculation! In our example above, we have:

TP+FP = 900,820
TN = 99,180
R = 0.0082
FP = 0.0082 * 99,180 / (1-0.0082) = 820
TP = 900,820-820 = 900,000
Our Jaccard similarity is thus 900,000/(900,000 + 2x820 + 2x99,180) = 0.818182.
How cool is that?

MinHash with m sets

Let's suppose we want to compare the pair-wise similarity of m sets. Then we have the following table to compare time and space costs:

Technique	Time Complexity	Space Complexity	Accuracy
Direct calculation using sets	O(m²n log(n))	O(mn)	Perfect
Direct calculation using sorted vectors	O(mn log(n) + m²n)	O(mn)	Perfect
MinHash with k hash functions	O(mnk + m²k)	O(mk)	Error is 1/sqrt(k)
MinHash with one hash function	O(mn log(k) + m²k)	O(mk)	Error is 1/sqrt(k)
Bloom Filters	O(mnk + m²n)	O(mn)	Error is really small.

I'm going to project that for this problem, you'll quantify solutions as follows:

If m and n are relatively small, then you simply do the direct calculation.
As they grow, a Bloom Filter solution will perform the best with the best accuracy.
There may be a point where the space requirement of the Bloom Filter renders it slower than Min-Hash with one hash function.
There will never be a time when MinHash with k hash functions is a good solution.

A Usage Scenario

I was bored, and I was trying to figure out a practical usage scenario, so this is what I came up with. First, I generated a data set representative of each faculty member in our department (as of Fall, 2017). What I did to create this data set was try to grab the 100 most cited papers of each faculty member. If I couldn't do this via Google Scholar, I simply did the best that I could. And you should note that some professors (e.g. Dr. Jantz and Dr. Schuchard) are quite early in their careers, and don't have 100 papers yet.

For each paper, I added every word of the paper's title, that was 5 letters or longer, to the professor's set. I didn't do anything smart, like remove plurals. I simply used the words (lower case, of course). So, for example, I had words like "erasure" "checkpointing" and "neuromorphic" that are indicative of my research career. I also had words like "approach", "impact" and "effect", which are less distinctive. The words for each professor is in the directory Words:

Plank	Abidi	Beck	Berry	Blalock	Cao	Costinett	Day
Dean	Djouadi	Dongarra	Emrich	Fathy	Gregor	Gu	Huang
Islam	Jantz	Langston	Fran Li	Husheng Li	Liu	Maclennan	Materassi
Mcfarlane	Mockus	Parker	Peterson	Pulgar	Qi	Rose	Schuchard
Stella Sun	Kai Sun	Tolbert	Tomsovic	BVZ	Wang	Wu

Now, why would we care about such data sets? Well, maybe we are interviewing faculty candidates, and for a given candidate, we want to see who he/she is most similar to. For that, I have three more data sets:

Ethan Miller, a professor from Santa Cruz who does research in storage systems, security, and cloud computing.
Eitan Yaakobi, an assistant professor from Technion who does more theoretical research in erasure codes.
Bob Tarjan, Turing award winning professor from Princeton who does graph theory.

All of these data sets are relatively small. Here's a histogram, including Miller, Yaakobi and Tarjan:

So, I calculated the Jaccard similarity for each pair of professors in the data set using a shell script (see below). And from that, I generated a heat map for similarity:

You can see some similar pairs of professors there: Me and Dr. Beck, Dr. (F) Li and Dr. Tomsovic, Dr. Costinett and Dr. Tolbert, and Dr. Islam and Dr. McFarlane. You can also see that Drs. Jantz, Schuard and Day are not overly similar to anyone, because they don't have very many publications yet.

A practical exploration

As we know, an algorithm is no fun unless you implement it, and by implementing, you gain some intuition about how algorithms work. So, we will implement all of the algorithms above. For performance testing, I'm going to use three files in the Names directory:

UNIX> wc Names/*
 1000000 1000000 16098423 Names/01.txt
 1000000 1000000 16089116 Names/02.txt
 1000000 1000000 16079936 Names/03.txt
 3000000 3000000 48267475 total
UNIX>

Each has a million unique names. 90% of the names in 01.txt are in 02.txt, and 90% of the names in 02.txt are in 03.txt. This will be a good stress-test for the programs.

Here's the bottom line of my implementations:

As you can see, you can make the direct calculation of Jaccard similarity pretty fast; however, the 1-hash version of MinHash is significantly faster up to k=10K or so, which gets you under 1% error. The k-hash version of MinHash is completely useless.

What about the Bloom Filter? Its performance depends on the settings of k and m. When you use the "recommended" settings of m = 10n and k=7, you get speed that is on par with "Jaccard Vector With Strings." In other words, not useful. On the flip side, the error is incredibly low -- 0.00026. Remember, the false positive rate is not the error, but it is leveraged to predict false positive and hone the error.

So, maybe we can get away with smaller values of m and k. Below is a graph where I plot time and error for various settings of m and k:

As you can see, the performance of pretty much all values of m are similar when k ≤ 4. Why would that be? Well, we can get all four hashes from one call to MD5(). The error is incredibly low -- you'd need k=1,000,000 for min-hash to get that low, and that's the number of items in the sets!

For that reason, in the overall performance graph above, I use m = 3n and k = 1.

Conclusion: Bloom Filters 1, Min-Hash 0.

Implementation #1: A shell script.

If someone put a gun to my head and told me to write a really inefficient shell script for min-hash, I'd use sort and wc:

You can calculate |A ∪ B| with "cat A B | sort -u | wc".
You can calculate |A| + |B| as "cat A B | wc".
Because |A| + |B| double-counts the elements in |A ∩ B|, you can calculate |A ∩ B| as |A| + |B| - |A ∪ B|.

My shell script is in minhash.sh:

if [ $# -ne 2 ]; then
  echo 'usage: sh minhash.sh f1 f2' >&2
  exit 1
fi

total=`cat $1 $2 | wc | awk '{ print $1 }'`
union=`cat $1 $2 | sort -u | wc | awk '{ print $1 }'`
intersection=`echo $total $union | awk '{ print $1-$2 }'`
jaccard=`echo $intersection $union | awk '{ print $1 / $2 }'`

echo "Total:        $total"
echo "Union:        $union"
echo "Intersection:  $intersection"
echo "Jaccard:      $jaccard"

It's not too slow on the professorial words files (this is on my 2.2 GHz Macbook Pro):

UNIX> time sh minhash.sh Words/01-Plank.txt Words/02-Abidi.txt 
Total:        552
Union:        511
Intersection:  41
Jaccard:      0.0802348
0.014u 0.015s 0:00.02 100.0%	0+0k 0+0io 0pf+0w
UNIX>

But it's pretty brutal on the larger files:

UNIX> time sh minhash.sh Names/01.txt Names/02.txt
Total:        2000000
Union:        1100000
Intersection:  900000
Jaccard:      0.818182
43.121u 0.227s 0:43.27 100.1%	0+0k 0+9io 0pf+0w
UNIX>

On the flip side, this gives us a way to verify that our other implementations are correct. Let's go ahead and make two files:

Words-Verify.txt contains the Jaccard similarity for each pair of files in Words, plus Miller, Yaakobi and Tarjan.
Names-Verify.txt contains the Jaccard similarity for each pair of files in Names:

I do this in the shell script make-verify.sh. It's pretty pokey:

UNIX> time sh make-verify.sh Words/* Miller.txt Yaakobi.txt Tarjan.txt > Words-Verify.txt
27.107u 30.734s 0:35.39 163.4%	0+0k 0+4io 0pf+0w
UNIX> time sh make-verify.sh Names/* > Names-Verify.txt
391.009u 1.973s 6:32.34 100.1%	0+0k 0+112io 0pf+0w
UNIX>

We can see the five professors who are most similar to me:

UNIX> grep '^Words.01' Words-Verify.txt | sort -nr -k 3 | head -n 5
Words/01-Plank.txt             Words/01-Plank.txt             1.000000
Words/01-Plank.txt             Words/03-Beck.txt              0.267717
Words/01-Plank.txt             Miller.txt                     0.160633
Words/01-Plank.txt             Words/16-Huang.txt             0.138614
Words/01-Plank.txt             Words/06-Cao.txt               0.129754
UNIX>

That does not surprise me at all (I've written quite a few papers with Dr. Beck).

calc-error.cpp -- calculate the mean error in two files of Jaccard similarity.

I won't go over this program -- it's straightforward, reading two files like Words-Verify.txt, and calculating the mean error between the two. At the moment, we don't have anything substantive to test it on, but if we run it on Words-Verify.txt and Words-Verify.txt, it will show zero error:

UNIX> g++ -o calc-error calc-error.cpp
UNIX> calc-error Words-Verify.txt Words-Verify.txt
0.00000000
UNIX> calc-error Names-Verify.txt Names-Verify.txt 
0.00000000
UNIX>

control.cpp -- just read the data

This program allows you to factor out the time spent reading the data from disk. For the Names file, this is around four tenths of a second:

UNIX> time control Names/*
0.411u 0.012s 0:00.42 100.0%	0+0k 0+0io 0pf+0w
UNIX>

jaccard-set-lazy.cpp -- using sets and some lazy programming.

In jaccard-set-lazy.cpp, I do a quick and dirty implementation of Jaccard similarity. For each file on the command line, I read the lines and insert them into a set. Then, for each pair of sets, I run through one and try to find it in the other. Here's the main loop -- easy enough for an in-class CS302 lab, and no comments necessary.

  for (i = 0; i < sets.size(); i++) {
    for (j = 0; j < sets.size(); j++) {
      Total = sets[i].size() + sets[j].size();
      Intersection = 0;
      for (lit = sets[j].begin(); lit != sets[j].end(); lit++) {
        if (sets[i].find(*lit) != sets[i].end()) Intersection++;
      }
      Union = Total - Intersection;
      printf("%-30s %-30s %.6lf\n", argv[i+1], argv[j+1], Intersection / Union);
    }
  }
}

I compile it with -O3, and while it is smoking fast on the small data set, it's pretty slow on the large one. You'll see that I verify it against the shell script, and I get zero error, as anticipated.

UNIX> time jaccard-set-lazy Words/* Miller.txt Yaakobi.txt Tarjan.txt > tmp.txt
0.062u 0.001s 0:00.06 100.0%	0+0k 0+3io 0pf+0w
UNIX> calc-error Words-Verify.txt tmp.txt
0.000000
UNIX> time jaccard-set-lazy Names/* > tmp.txt
7.696u 0.120s 0:07.82 99.8%	0+0k 0+3io 0pf+0w
UNIX> calc-error Names-Verify.txt tmp.txt
0.000000
UNIX>

jaccard-set-linear.cpp and jaccard-set-sort.cpp -- Speeding the implementation up

I call the above program "lazy" because the code that calculates Jaccard similarity is O(n log(n)) rather than O(n) like it could be -- I fix that in jaccard-set-linear.cpp: Instead of calling find(), I use two iterators and run them through both sets simultaneously:

  for (i = 0; i < sets.size(); i++) {
    for (j = 0; j < sets.size(); j++) {
      Total = sets[i].size() + sets[j].size();
      Intersection = 0;
      liti = sets[i].begin(); 
      litj = sets[j].begin(); 
      while (liti != sets[i].end() && litj != sets[j].end()) {
        if (*liti == *litj) {
          Intersection++;
          liti++;
          litj++;
        } else if (*liti < *litj) {
          liti++;
        } else {
          litj++;
        }
      }
      Union = Total - Intersection;
      printf("%-30s %-30s %.6lf\n", argv[i+1], argv[j+1], Intersection / Union);
    }
  }

This speeds us up a little:

UNIX> time jaccard-set-linear Names/* > tmp.txt
5.436u 0.116s 0:05.56 99.6%	0+0k 0+0io 0pf+0w
UNIX> calc-error Names-Verify.txt tmp.txt
0.000000
UNIX>

Instead of using a set, we can read the lines into vectors and sort them. While that has the same big-O complexity as the set implementation, it should be more time and space efficient. The code is in jaccard-sort.cpp, and it's really straightforward:

  /* Sort the vectors. */

  for (i = 0; i < sets.size(); i++) sort(sets[i].begin(), sets[i].end());

  /* For each pair of sets, calculate the Jaccard similarity directly. */

  for (i = 0; i < sets.size(); i++) {
    for (j = 0; j < sets.size(); j++) {
      Total = sets[i].size() + sets[j].size();
      Intersection = 0;
      ip = 0;
      jp = 0;
      while (ip < sets[i].size() && jp < sets[j].size()) {
        if (sets[i][ip] == sets[j][jp]) {
          Intersection++;
          ip++;
          jp++;
        } else if (sets[i][ip] < sets[j][jp]) {
          ip++;
        } else {
          jp++;
        }
      }
      Union = Total - Intersection;
      printf("%-30s %-30s %.6lf\n", argv[i+1], argv[j+1], Intersection / Union);
    }
  }
}

It also is more than twice as fast:

UNIX> time jaccard-sort Names/* > tmp.txt
1.782u 0.054s 0:01.84 99.4%	0+0k 0+3io 0pf+0w
UNIX> calc-error Names-Verify.txt tmp.txt
0.000000
UNIX>

A final speedup -- using hashes instead of strings

Our final direct calculation code does one last speedup -- instead of storing and comparing the strings, let's calculate hashes and store/compare them. That's in jaccard-sort-hash.cpp. Here's the code that calculates the hashes using the MD5 hash function. I copy the first 8 bytes into an unsigned long long, and store that instead of the string:

    while (getline(f, s)) {
      MD5((const unsigned char *) s.c_str(), s.size(), md5_buf);
      memcpy(&ull, md5_buf, sizeof(unsigned long long));
      sets[i-1].push_back(ull);
    }
    f.close();
  }

The rest of the code is identical. That shaves 21% off the running time:

UNIX> time jaccard-sort-hash Names/* > tmp.txt
1.408u 0.025s 0:01.43 99.3%	0+0k 0+0io 0pf+0w
UNIX> calc-error Names-Verify.txt tmp.txt
0.000000
UNIX>

Min Hash with k hash functions

This is a little bit of a pain to code. If you recall from the Bloom Filter Lecture Notes, you can create 256 k hash functions from a single good one by adding an extra byte to the data, and then incrementing it for each hash function. With MD5 generating 16-byte hashes, we can generate 1024 4-byte hashes with this technique. Unfortunately, with an error of 1/sqrt(k), we may have to have values of k well over 1024 to get the desired error bounds, so we're going to do something different.

What we're going to do is figure out how many bytes of hashes we need, and allocate room, padded to a multiple of 16. Then, for each chunk of 16 bytes, we are going to calculate an MD5 hash, where we XOR a number from 0 to 2¹⁶-1 with the first two bytes of the string. After calculating each hash, we'll XOR with the number again, which will turn the two bytes back to their original values. This will generate up to 2²⁰ bytes of hashes, which should be enough. We are going to make sure that our strings have at least two characters, so that this works.

For each data set, we'll maintain a similar region of hashes, and we'll use memcmp() to makes sure that we maintain the minimum hash for each of the k hashes. We're going to have the number of bytes in each hash be a variable on the command line. That way, we can use two-byte hashes when the number of elements of the set is small, and three or four-byte hashes when it's bigger.

Let's look at some code, and print some state. My code is in min-hash-k.cpp. Here are the important variable declarations:

int main(int argc, char **argv)
{
  vector <string> files;                // Filenames
  vector <unsigned char *> min_hashes;  // The minimum hashes for each file.
  int k;                                // The number of hashes
  int bbh;                              // Bytes per hash 
  int hash_buf_size;                    // Size of the hash buffers (k*bbh) padded to 16
  unsigned int ff;                      // An integer that holds 0xffffffff
  unsigned char *hash;                  // Where we calculate the hashes for each string.

First, we calculate how big the hash buffers should be, and then we allocate a buffer for hash and a vector of hash buffers for each file. We initialize the file buffers to all f's, and print them out:

  /* Calculate the number of bytes for all of the hashes, and allocate a
     hash buffer for temporary use, and a hash buffer for each data set
     to hold the minimum hashes for each data set.   Set each byte of 
     these buffers to 0xff, which is their maximum value, regardless of the
     size of the hash. */

  hash_buf_size = k * bbh;
  if (hash_buf_size % 16 != 0) hash_buf_size += (16 - hash_buf_size % 16);
  ff = 0xffffffff;

  hash = (unsigned char *) malloc(hash_buf_size);
  min_hashes.resize(files.size());
  for (i = 0; i < min_hashes.size(); i++) {
    min_hashes[i] = (unsigned char *) malloc(hash_buf_size);
    for (j = 0; j < hash_buf_size; j += sizeof(int)) {
      memcpy(min_hashes[i]+j, &ff, sizeof(int));
    }
  }

  /* Error check code #1: Print out the initial values of all the hashes,
     which should all be ff's */

  for (i = 0; i < min_hashes.size(); i++) {
    printf("%20s ", files[i].c_str());
    for (j = 0; j < k * bbh; j++) printf("%02x", min_hashes[i][j]);
    printf("\n");
  }

Let's test this on the three files in Names, using 6 for k ahd 3 for bbh. We need 18 bytes for the hash regions, which, when padded to 16 is 32 bytes:

UNIX> min-hash-k 6 3 Names/*.txt
        Names/01.txt ffffffffffffffffffffffffffffffffffff
        Names/02.txt ffffffffffffffffffffffffffffffffffff
        Names/03.txt ffffffffffffffffffffffffffffffffffff
UNIX> echo ffffffffffffffffffffffffffffffffffff | wc
       1       1      37
UNIX>

That last command verifies that we have 18 bytes in the hash buf -- each byte is two hex digits, and the echo command adds a newline, so 37 characters is the correct number of characters.

Next, here is the code that reads in each string, calculates the hashes, and then compares the k hashes to the min_hashes, setting min_hashes when the hashes are smaller. This also has the code that xor's the first two bytes of the string, so that you can generate up to 2¹⁶ different MD5 hashes:

  /* Read the data sets.  For each value, you're going to calculate the k hashes
     and then update the minimum hashes for the data set. */

  for (findex = 0; findex < files.size(); findex++) {
    f.clear();
    f.open(files[findex].c_str());
    if (f.fail()) { perror(files[findex].c_str()); exit(1); }
    while (getline(f, s)) {
      if (s.size() < 2) {
        fprintf(stderr, "File %s - can't have one-character strings.\n", files[findex].c_str());
        exit(1);
      }

      /* Here is where we calculate the hash_buf_size bytes of hashes. */

      j = 0;
      sz = s.size();
      for (i = 0; i < hash_buf_size; i += 16) {
        s[0] ^= (j & 0xff);
        s[1] ^= (j >> 8);
        MD5((unsigned char *) s.c_str(), sz, hash+i);
        s[0] ^= (j & 0xff);
        s[1] ^= (j >> 8);
        j++;
      }

      /* And here is where we compare each unit of bbh bytes with the unit in min_hashes,
         and if it's smaller, we set the bbh bytes of min_hashes to the bytes in hash: */

      j = 0;
      for (i = 0; i < k * bbh; i += bbh) {
        if (memcmp(hash+i, min_hashes[findex]+i, bbh) < 0) {
          memcpy(min_hashes[findex]+i, hash+i, bbh);
        }
      }

      /* Error check code #2: Print the hashes and the min hashes. */

      printf("%-20s %-20s\n", files[findex].c_str(), s.c_str());
      printf("  hash: ");
      for (i = 0; i < k*bbh; i++) printf("%s%02x", (i%bbh == 0) ? " " : "", hash[i]);
      printf("\n  minh: ");
      for (i = 0; i < k*bbh; i++) printf("%s%02x", (i%bbh == 0) ? " " : "", min_hashes[findex][i]);
      printf("\n");
    }
    f.close();
  }

Let's test -- I'm going to create a small file with three words, and then take a look at the output when k is 3 and bbh is 2:

UNIX> ( echo Give ; echo Him ; echo Six ) > junk.txt
UNIX> cat junk.txt
Give
Him
Six
UNIX> min-hash-k 3 2 junk.txt
junk.txt             Give                
  hash:  2f35 5d9f a7ac
  minh:  2f35 5d9f a7ac
junk.txt             Him                 
  hash:  b582 f0dd d1c3
  minh:  2f35 5d9f a7ac
junk.txt             Six                 
  hash:  e6fb c0b9 673f
  minh:  2f35 5d9f 673f
UNIX>

As you can see, with the string "Give", all three hashes were set to the hashes of "Give". that's because they were all initialized to 0xffff. With "Him", the three hashes were all bigger than the hashes for "Give", so min_hash was unchanged. With "Six", the last hash, 0x673f, was smaller than 0xa7ac, so the third hash of min_hash was changed.

Let's do a second test where we have to calculate a second hash, just to make sure that it's different from the first:

UNIX> min-hash-k 2 16 junk.txt
junk.txt             Give                
  hash:  2f355d9fa7accc561d3edc335de2fbcf e5d9de39f7ca1ba2637e5640af3ae8aa
  minh:  2f355d9fa7accc561d3edc335de2fbcf e5d9de39f7ca1ba2637e5640af3ae8aa
junk.txt             Him                 
  hash:  b582f0ddd1c3852810de9cb577293351 07ec74942cd2e4040c9c6c62cfdfaa4f
  minh:  2f355d9fa7accc561d3edc335de2fbcf 07ec74942cd2e4040c9c6c62cfdfaa4f
junk.txt             Six                 
  hash:  e6fbc0b9673f8c86726688d7607fc8f5 bd717113c3cafa1681fe96b05c8b3645
  minh:  2f355d9fa7accc561d3edc335de2fbcf 07ec74942cd2e4040c9c6c62cfdfaa4f
UNIX>

And, let's do a final test where we generate over 256 hashes, to make sure that our XOR code is working on the second byte as well. To do that, we'll use k equal to 258 and bbh equal to 16 again. We'll grep for "hash" to isolate the lines that have the hash values, and then we'll put those lines into a temporary file:

UNIX> min-hash-k 258 16 junk.txt | grep hash > tmp.txt

Next, let's use awk to print out all of the hash values, one per line:

UNIX> awk '{ for (i=2; i <= NF; i++) print $i }' < tmp.txt | head
2f355d9fa7accc561d3edc335de2fbcf
e5d9de39f7ca1ba2637e5640af3ae8aa
ee759d6725400149983a4b7ba847130f
096f6d7168640882498c00b9142932e7
3501ace2a69bb89b3981554306a60a57
a5ebcbceb6e1ec0356db8e4b7faf5d94
e37b483220e5802754d870863327567d
b7f4c393b08b1866b8409e3f169fdc89
7a3c17bf9a2d9b42123cc229ecb04bce
4276d507cb57cf5eedac31e10cbe3f54
UNIX>

Finally, let's count the unique hash values with sort -u. The answer should be 258*3 = 774.

UNIX> awk '{ for (i=2; i <= NF; i++) print $i }' < tmp.txt | sort -u | wc
     774     774   25542
UNIX>

Ok -- I'm happy. Now that we have min_hashes set for every data file, we simply compare them. Here's the code for that. Do you see how nice memcmp() and memcpy() have been to use?

  /* Error check #3: Let's print out the min hashes, so that we can double-check. */

  for (findex = 0; findex < files.size(); findex++) {
    printf("%-10s ", files[findex].c_str());
    for (i = 0; i < k*bbh; i++) printf("%s%02x", (i%bbh == 0) ? " " : "", min_hashes[findex][i]);
    printf("\n");
  }

  /* For each pair of files, compare the hashes. */

  for (i = 0; i < files.size(); i++) {
    for (j = 0; j < files.size(); j++) {
      Intersection = 0;
      for (l = 0; l < k*bbh; l += bbh) {
        if (memcmp(min_hashes[i]+l, min_hashes[j]+l, bbh) == 0) Intersection++;
      }
      printf("%-30s %-30s %.6lf\n", files[i].c_str(), files[j].c_str(), Intersection / (double) k);
    }
  }
  exit(0);
}

As always, there's a temptation to simply let this code rip on our bigger data sets, but let's do a small test first, to make sure that everything makes sense. Here, I'm going to make three files with three words each. The first two share two words, and the second two share one word. The first and third share zero words:

UNIX> rm -f junk*.txt
UNIX> ( echo Give ; echo Him ; echo Six ) > junk1.txt
UNIX> ( echo Give ; echo Him ; echo Ten ) > junk2.txt
UNIX> ( echo Eight ; echo Nine ; echo Ten ) > junk3.txt
UNIX> min-hash-k 5 4 junk1.txt junk2.txt junk3.txt
junk1.txt   2f355d9f 673f8c86 10de9cb5 5de2fbcf 07ec7494
junk2.txt   2f355d9f 38dca5ef 10de9cb5 5de2fbcf 07ec7494
junk3.txt   24db1121 38dca5ef 172c33cf 3def2f3f 08cd928e
junk1.txt                      junk1.txt                      1.000000
junk1.txt                      junk2.txt                      0.800000
junk1.txt                      junk3.txt                      0.000000
junk2.txt                      junk1.txt                      0.800000
junk2.txt                      junk2.txt                      1.000000
junk2.txt                      junk3.txt                      0.200000
junk3.txt                      junk1.txt                      0.000000
junk3.txt                      junk2.txt                      0.200000
junk3.txt                      junk3.txt                      1.000000
UNIX>

You can verify easily by looking at the hashes, that junk1.txt and junk2.txt share hashes 0, 2, 3 and 4. junk2.txt and junk3.txt share hash 1. And junk1.txt and junk3.txt share nothing. The output looks good! We'll comment out our error checking code, and do a little testing on our two data sets. Let's start with the Names data set, because it's easier to look at the output.

UNIX> time min-hash-k 10 4 Names/*.txt
Names/01.txt  00002300 000021a5 000000bd 000002f9 000008a4 00001488 000017f6 000008c9 00000dc5 00000dc4
Names/02.txt  00002300 00000126 000000bd 000002f9 000008a4 00001488 000017f6 00000628 00000dc5 00000dc4
Names/03.txt  00002300 00000126 000000bd 000002f9 000008a4 00001488 000017f6 00000628 00000dc5 00000dc4
Names/01.txt                   Names/01.txt                   1.000000
Names/01.txt                   Names/02.txt                   0.800000
Names/01.txt                   Names/03.txt                   0.800000
Names/02.txt                   Names/01.txt                   0.800000
Names/02.txt                   Names/02.txt                   1.000000
Names/02.txt                   Names/03.txt                   1.000000
Names/03.txt                   Names/01.txt                   0.800000
Names/03.txt                   Names/02.txt                   1.000000
Names/03.txt                   Names/03.txt                   1.000000
2.837u 0.021s 0:02.86 99.6%	0+0k 2+0io 0pf+0w
UNIX>

I forgot to comment out the error checking code, but that's ok, because it's a good sanity check. Does that look buggy to you? It scared me at first, because every hash started with 0000. However, think about it -- you have 1,000,000 words. What are the chances that one will start with 0000? The answer to that is 1 - (65535/65536)^1,000,000. You can calculate that one by hand:

UNIX> echo "" | awk '{ l=1.0; for (i = 0; i < 1000000; i++) { l *= (65535/65536); print l}}' | head
0.999985
0.999969
0.999954
0.999939
0.999924
0.999908
0.999893
0.999878
0.999863
0.999847
UNIX> echo "" | awk '{ l=1.0; for (i = 0; i < 1000000; i++) { l *= (65535/65536); print l}}' | tail
2.36157e-07
2.36154e-07
2.3615e-07
2.36146e-07
2.36143e-07
2.36139e-07
2.36136e-07
2.36132e-07
2.36128e-07
2.36125e-07
UNIX>

Yes, the chances are really high that all of the numbers start with 0000. Ok, back to the test:

UNIX> cat Names-Verify.txt
Names/01.txt                   Names/01.txt                   1.000000
Names/01.txt                   Names/02.txt                   0.818182
Names/01.txt                   Names/03.txt                   0.666667
Names/02.txt                   Names/01.txt                   0.818182
Names/02.txt                   Names/02.txt                   1.000000
Names/02.txt                   Names/03.txt                   0.818182
Names/03.txt                   Names/01.txt                   0.666667
Names/03.txt                   Names/02.txt                   0.818182
Names/03.txt                   Names/03.txt                   1.000000
UNIX> time min-hash-k 10 4 Names/*.txt > junk.txt
2.933u 0.015s 0:02.95 99.6%	0+0k 0+2io 0pf+0w
UNIX> cat junk.txt
Names/01.txt                   Names/01.txt                   1.000000
Names/01.txt                   Names/02.txt                   0.800000
Names/01.txt                   Names/03.txt                   0.800000
Names/02.txt                   Names/01.txt                   0.800000
Names/02.txt                   Names/02.txt                   1.000000
Names/02.txt                   Names/03.txt                   1.000000
Names/03.txt                   Names/01.txt                   0.800000
Names/03.txt                   Names/02.txt                   1.000000
Names/03.txt                   Names/03.txt                   1.000000
UNIX> calc-error Names-Verify.txt junk.txt
0.074074
UNIX> time min-hash-k 100 4 Names/*.txt > junk.txt
20.184u 0.017s 0:20.20 99.9%	0+0k 0+4io 0pf+0w
UNIX> cat junk.txt
Names/01.txt                   Names/01.txt                   1.000000
Names/01.txt                   Names/02.txt                   0.870000
Names/01.txt                   Names/03.txt                   0.710000
Names/02.txt                   Names/01.txt                   0.870000
Names/02.txt                   Names/02.txt                   1.000000
Names/02.txt                   Names/03.txt                   0.820000
Names/03.txt                   Names/01.txt                   0.710000
Names/03.txt                   Names/02.txt                   0.820000
Names/03.txt                   Names/03.txt                   1.000000
UNIX> calc-error Names-Verify.txt junk.txt
0.021549
UNIX> time min-hash-k 1000 4 Names/*.txt > junk.txt
197.086u 0.038s 3:17.12 99.9%	0+0k 0+2io 0pf+0w
UNIX> calc-error Names-Verify.txt junk.txt
0.011326
UNIX>

Well, my conclusion from this is that this version of min-hash, much like bubble-sort, is almost irresponsible to teach. It makes sense -- for this data set, log(n) is 20, so setting k to 1000 is going to be a clear loser time-wise. And all for an accuracy lower than 99 percent? This is terrible. As you'll see below, the one-hash version of min-hash is very nice, so this version, with all of its yucky coding, is destined for the trash heap. So it goes.

MinHash with one hash function -- this is much better

For this version of MinHash, you need to keep track of the smallest (or biggest) k hashes. The easiest way to do that is to go back to using a set. The relevant code is in min-hash-1.cpp. I use unsigned long long's to store the hashes, and I keep the k largest, because it is easier to delete from the front of a set rather than from the back. In the code below, I print out the hashes so that I can error check.

  for (findex = 0; findex < files.size(); findex++) {
    f.clear();
    f.open(files[findex].c_str());
    if (f.fail()) { perror(files[findex].c_str()); exit(1); }
    while (getline(f, s)) {
      MD5((unsigned char *) s.c_str(), s.size(), hash);
      memcpy((unsigned char *) &ll, hash, sizeof(long long));
      
      /* Error check code 1: Print out the hashes. */
      printf("%-20s 0x%016llx\n", s.c_str(), ll);

      if (min_hashes[findex].size() < k) {
        min_hashes[findex].insert(ll);
      } else {
        liti = min_hashes[findex].begin();
        if (ll > *liti) {
          min_hashes[findex].insert(ll);
          if (min_hashes[findex].size() > k) min_hashes[findex].erase(liti);
        }
      }
    }
  }

  /* Error check code #2: Print out the min hashes. */

  for (findex = 0; findex != files.size(); findex++) {
    printf("%s\n", files[findex].c_str());
    for (liti = min_hashes[findex].begin(); liti != min_hashes[findex].end(); liti++) {
      printf("  0x%016llx\n", *liti);
    }
  }

Here's my error checking code. First, I set k to ten, so that it keeps all of the hashes:

UNIX> ( echo Give ; echo Him ; echo Six ; echo Touchdown ; echo Tennessee ) > junk1.txt
UNIX> min-hash-1 10 junk1.txt
Give                 0x56ccaca79f5d352f
Him                  0x2885c3d1ddf082b5
Six                  0x868c3f67b9c0fbe6
Touchdown            0x8a3287e254ab0f3b
Tennessee            0x5265f51b083bc5a5
junk1.txt
  0x2885c3d1ddf082b5
  0x5265f51b083bc5a5
  0x56ccaca79f5d352f
  0x868c3f67b9c0fbe6
  0x8a3287e254ab0f3b
UNIX>

And next I set k to two, and keep the two largest hashes:

UNIX> min-hash-1 2 junk1.txt
Give                 0x56ccaca79f5d352f
Him                  0x2885c3d1ddf082b5
Six                  0x868c3f67b9c0fbe6
Touchdown            0x8a3287e254ab0f3b
Tennessee            0x5265f51b083bc5a5
junk1.txt
  0x868c3f67b9c0fbe6
  0x8a3287e254ab0f3b
UNIX>

Now, I'll calculate the Jaccard similarity of the set, just like I did above in jaccard-set-linear.cpp. First, test for correctness:

UNIX> min-hash-1 10000000 Names/* > junk.txt
UNIX> calc-error Names-Verify.txt junk.txt
0.000000
UNIX>

And let's time it and check accuracy.

UNIX> time min-hash-1 10 Names/* > junk.txt
1.162u 0.011s 0:01.17 100.0%	0+0k 0+1io 0pf+0w
UNIX> calc-error Names-Verify.txt junk.txt
0.062160
UNIX> time min-hash-1 100 Names/* > junk.txt
1.231u 0.012s 0:01.24 100.0%	0+0k 0+0io 0pf+0w
UNIX> calc-error Names-Verify.txt junk.txt
0.054892
UNIX> time min-hash-1 1000 Names/* > junk.txt
1.242u 0.013s 0:01.25 100.0%	0+0k 0+1io 0pf+0w
UNIX> calc-error Names-Verify.txt junk.txt
0.006717
UNIX> time min-hash-1 10000 Names/* > junk.txt
1.218u 0.013s 0:01.23 99.1%	0+0k 0+3io 0pf+0w
UNIX> calc-error Names-Verify.txt junk.txt
0.003341
UNIX>

It should be clear that this version is far superior to the other version.

Bloom Filter Implementation

I've written this up and timed it, but I don't have any commentary or writeup. Someday.