CS494 Lecture Notes - The Rabin-Karp String-Searching Algorithm


Reference Material

There's always Wikipedia: https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm. This page is intended to be an introduction, and then a demonstration of how you can use Rabin-Karp to search faster than the find() method of C++ strings.

The Problem Solved By The Rabin-Karp Algorithm

You have two strings -- call them p for "pattern" and f for "file". Your goal is to find the first occurrence of p in f (or all occurrences, etc). If the gun were to your head, and you didn't have language features to help you write a program to solve this problem, you'd probably do something like:

  for (i = 0; i + p.size() <= f.size(); i++) {
    if (strncmp(p.c_str(), f.c_str()+i, p.size()) == 0) {
      printf("Found: %d\n", i);
      return 0;
    }
  }
  printf("Not found.\n");
  return 0;
}

The worst-case running time of this is O(p.size() * f.size()), which is slow indeed. With Rabin-Karp, you reduce this to O(f.size()).


How does it work?

It revolves on something called a "rolling hash", and falls out quite naturally from the properties of the rolling hash. What is a "rolling hash"? It is a hashing algorithm that lets you do the following: Intuitively, the hash function lets you "subtract" a and "add" b. Let's think of a simple (and bad) hash function that would work -- adding the characters in a string to create a hash function. Then you can actually subtract a and add b.

We know that adding ASCII values makes for a really bad hash function, so for the moment, let's just assume that we have a good Rolling Hash function.

What you do is hash p. And consider the string s made from the first p.size() characters of f. Hash s Then you repeat the following:

Done. Because the rolling hash function does this modification in O(1) time, this algorithm is O(f.size()), and is independent of p's size. That's awesome.

Let's see it in action

So, in 2018, I created a rolling hash function using Galois-Field arithmetic, and I was unable to demonstrate that Rabin-Karp was any faster then the gun-to-the-head solution. When I returned to it in 2020, I realized that the hashing implementation was way to complicated, and decided to simplify it. I decided to start with the following hash function from the CS140 lecture on hashing, which we call "djb hash", named after its author, Dan Bernstein, and described in http://www.cse.yorku.ca/~oz/hash.html:

unsigned int djb_hash(const string &s)
{
  size_t i;
  unsigned int h;
  
  h = 5381;

  for (i = 0; i < s.size(); i++) {
    h = (h << 5) + h + s[i];
  }
  return h;
}

This function isn't going to work, in my opinion, because of the potential integer overflow with addition, and the fact that we're shifting away five bits of h at a time. I came up with the following modification, which has all of the desirable properties for a rolling hash function:

unsigned int djb_hash(const string &s)
{
  size_t i;
  unsigned int h;
  
  h = 5381;

  for (i = 0; i < s.size(); i++) {
    h = (h << 5) ^ (s[i]) ^ (h >> 27);
  }
  return h;
}

Why does this seem good for a rolling hash function? Well, we never shift any bits away. The hash of a string of size n is going to contain the number 5381 circular-shifted left by 5n%32 bits, and character of s circular shifted by a deterministic number of bits. So, given a hash of a+s, I can calculate the hash of s+b by doing the following:

Let's do an example. Let's suppose we've hashed the string "govol", and we next want to hash the string "ovols". First, let's run the program bin/djb_altered on them -- this will simply print out their hashes:
UNIX> ( echo govol ; echo ovols ) | bin/djb_altered
205936038
214610393
UNIX> 
I've written the program
src/govols.cpp to calculate the second hash from the first. Here's the output, so you can see the calculation:
UNIX> bin/govols
Starting value     = 0x0c4655a6 =  205936038      # Here's the hash of "govol"
hs                 = 0x0a00002a =  167772202      # This is h, circular shifted to the left by 25 (because the size of "govol" is 5.)
x = value xor hs   = 0x0646558c =  105272716      # This "subtracts" h
x circ<<= 5        = 0xc8cab180 = 3368726912      # Now the circular shift by five.
'g'                = 0x00000067 =        103      # These three lines "subtract" 'g'
g = 'g' circ<< 25  = 0xce000000 = 3456106496
x ^= g             = 0x06cab180 =  113947008
's'                = 0x00000073 =        115      # And these to "add" 's'
x ^= 's'           = 0x06cab1f3 =  113947123
x ^= hs            = 0x0ccab1d9 =  214610393      # We "add" h back in, and voila, we have the hash of "ovols"!
UNIX> 

Four programs to speed test

I have four programs to speed test:
  1. src/control.cpp -- this simply reads a file from standard input, and then exits.
  2. src/cpp_find.cpp -- this simply reads a file from standard input and a pattern on the command line, and uses C++'s find() method to find the pattern in the file. Newlines in the file are turned into spaces in the string.
  3. src/strcmp.cpp -- this works like src/cpp_find, except it runs through each possible starting place in the file string, and calls strncmp() to find the pattern.
  4. src/rabin_karp.cpp -- this uses Rabin_Karp with the rolling hash scheme detailed above. I calculate the hash a little differently. I basically do the algorithm without h, and then when I need a hash, I XOR hs at the end:

      /* Hash the pattern and exit if the pattern is too big. */
    
      hash = djb_hash(pattern.c_str(), pattern.size());
      if (file.size() < pattern.size()) {
        printf("file is smaller than pattern\n");
        return 0;
      }
    
      /* Calculate the DJB hash of the first pattern.size() characters of the file,
         without the h term. */
    
      sh = 0;
      for (i = 0; i < pattern.size(); i++) {
        sh = (sh << 5) | (sh >> (32-5));
        sh ^= file[i];
      }
      
    
      /* Calculate hs */
    
      l = pattern.size() * 5;
      shift = l % 32;
      hs = 5381;
      if (shift != 0) hs = ((hs << shift) ^ (hs >> (32-shift)));
    
      /* Do the Rabin_Karp algorithm i is the next character, j is the first character. */
    
      j = 0;
      while (i < file.size()) {
        sh = (sh << 5) | (sh >> (32-5));    // Rolling hash -- do the circular shift.
        sh ^= file[i];                      // Add in the next character.
        tmp = file[j];                      // Subtract the first character.
        if (shift != 0) tmp = ((tmp << shift) ^ (tmp >> (32-shift)));
        sh ^= tmp;
    
        if ((sh^hs) == hash) {               // If the hashes match, then verify
          if (strncmp(pattern.c_str(), file.c_str()+j+1, pattern.size()) == 0) {
            printf("Found at index %d\n", j+1);
            return 0;
          }
        }
        i++;
        j++;
      }
    
      return 0;
    }
    

Just as an aside, I have to do the following if statements because the morons who implemented bit shift decided that shifting an integer by more than 31 bits should be undefined rather than zero:

    if (shift != 0) tmp = ((tmp << shift) ^ (tmp >> (32-shift)));

Here's the quote from stackoverflow:

I wonder how many hours of programmer time have been wasted "discovering" this feature like I did...

Anyway. You have to think about how to concoct a scenario where Rabin-Karp is better than find(). I have two in scripts/time.sh:

  1. "negligible" -- 1,000,000 names in txt/names.txt. The pattern is the last name in the file,
  2. "significant" -- the file txt/file.txt is composed of 139265 lines, each of which has 50 'a' characters separated by a space. The very last line has "b c d". The pattern is in txt/pattern.txt, and is composed of 1000 'a' characters separated by a space, and then a 'b'.
The two files are big -- 15.3 MB and 13.3 MB respectively. Let's take a look at the timings (my Mac in 2020):
UNIX> sh scripts/time.sh negligible
control          -1 real 0m0.776s
cpp_find   16095573 real 0m0.735s
strcmp     16095573 real 0m0.816s
rabin_karp 16095573 real 0m0.720s
UNIX> sh scripts/time.sh significant
control          -1 real 0m0.584s
cpp_find   13924500 real 0m1.096s
strcmp     13924500 real 0m1.157s
rabin_karp 13924500 real 0m0.608s
UNIX> 
As you can see, Rabin-Karp barely outperforms the C++ find() routine on the "negligible" test. That's because when you're comparing two strings, and they differ by their first characters, then C++ find() (and strncmp()) return instantly. The "significant" test is much more challenging, because there are a lot of strings that start with 1000 a's and spaces, and find()/strncmp() have to do a lot of comparisons before failing. That's why Rabin-Karp performs so much better. If you subtract the control from the numbers, Rabin-Karp is 21 times faster than C++ find()!