CS494 Lecture Notes - The Rabin-Karp String-Searching Algorithm
- James S. Plank
- Directory: /home/plank/cs494/Notes/Rabin-Karp
- Original notes (kinda): November, 2018.
- Most recent revision:
Wed Nov 4 17:27:30 EST 2020
Reference Material
There's always Wikipedia: https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm. This page is intended to be an
introduction, and then a demonstration of how you can use Rabin-Karp to search faster than
the find() method of C++ strings.
The Problem Solved By The Rabin-Karp Algorithm
You have two strings -- call them p for "pattern" and f for "file".
Your goal is to find the first occurrence of p in f (or all occurrences, etc).
If the gun were to your head, and you didn't have language features to help you write
a program to solve this problem, you'd probably do something like:
for (i = 0; i + p.size() <= f.size(); i++) {
if (strncmp(p.c_str(), f.c_str()+i, p.size()) == 0) {
printf("Found: %d\n", i);
return 0;
}
}
printf("Not found.\n");
return 0;
}
|
The worst-case running time of this is O(p.size() * f.size()), which is slow indeed.
With Rabin-Karp, you reduce this to O(f.size()).
How does it work?
It revolves on something called a "rolling hash", and falls out quite naturally from the properties
of the rolling hash. What is a "rolling hash"? It is a hashing algorithm that lets you do
the following:
- Suppose you have a string of the form a+s, where a is a character, and s is a string.
- And suppose you've generated a hash of this string.
- Then generating the hash of s+b, where b is a character, can be done in O(1) time.
Intuitively, the hash function lets you "subtract" a and "add" b. Let's think
of a simple (and bad) hash function that would work -- adding the characters in a string to
create a hash function. Then you can actually subtract a and add b.
We know that adding ASCII values makes for a really bad hash function, so for the moment, let's
just assume that we have a good Rolling Hash function.
What you do is hash p. And consider the string s made from the first p.size()
characters of f. Hash s
Then you repeat the following:
- If the two hash functions match, then use strncmp() to verify that
p and s are indeed identical, and not the result of a hash collision.
- You're going to change s be deleting its first character, and then appending the
next character from f. You'll use the rolling hash function to calculate this
new s's hash.
Done. Because the rolling hash function does this modification in O(1) time, this algorithm
is O(f.size()), and is independent of p's size. That's awesome.
Let's see it in action
So, in 2018, I created a rolling hash function using Galois-Field arithmetic, and I was unable
to demonstrate that Rabin-Karp was any faster then the gun-to-the-head solution. When I returned
to it in 2020, I realized that the hashing implementation was way to complicated, and decided
to simplify it. I decided to start with the following hash function from the CS140 lecture on hashing,
which we
call "djb hash", named after its author, Dan Bernstein, and described in http://www.cse.yorku.ca/~oz/hash.html:
unsigned int djb_hash(const string &s)
{
size_t i;
unsigned int h;
h = 5381;
for (i = 0; i < s.size(); i++) {
h = (h << 5) + h + s[i];
}
return h;
}
|
This function isn't going to work, in my opinion, because of the potential integer overflow
with addition, and the fact that we're shifting away five bits of h at a time. I
came up with the following modification, which has all of the desirable properties for a
rolling hash function:
unsigned int djb_hash(const string &s)
{
size_t i;
unsigned int h;
h = 5381;
for (i = 0; i < s.size(); i++) {
h = (h << 5) ^ (s[i]) ^ (h >> 27);
}
return h;
}
|
Why does this seem good for a rolling hash function? Well, we never shift any bits away.
The hash of a string of size n is going to contain the number 5381
circular-shifted left by 5n%32 bits, and character of s circular shifted by
a deterministic number of bits. So, given a hash of a+s, I can calculate the
hash of s+b by doing the following:
- Let hs be the circular shift of h by (5|a+s|)%32.
- First, XOR the hash with hs -- that "subtracts" h from the hash.
- Next, do a circular shift of the hash by 5.
- XOR the hash with a, shifted by (5|a+s|)%32. That will "subtract" a.
- XOR the hash with b. That "adds" b.
- Finally, XOR the hash with hs -- that puts h back into the hash.
Let's do an example. Let's suppose we've hashed the string "govol", and we next want to hash
the string "ovols". First, let's run the program bin/djb_altered on them -- this will
simply print out their hashes:
UNIX> ( echo govol ; echo ovols ) | bin/djb_altered
205936038
214610393
UNIX>
I've written the program src/govols.cpp to calculate
the second hash from the first. Here's the output, so you can see the calculation:
UNIX> bin/govols
Starting value = 0x0c4655a6 = 205936038 # Here's the hash of "govol"
hs = 0x0a00002a = 167772202 # This is h, circular shifted to the left by 25 (because the size of "govol" is 5.)
x = value xor hs = 0x0646558c = 105272716 # This "subtracts" h
x circ<<= 5 = 0xc8cab180 = 3368726912 # Now the circular shift by five.
'g' = 0x00000067 = 103 # These three lines "subtract" 'g'
g = 'g' circ<< 25 = 0xce000000 = 3456106496
x ^= g = 0x06cab180 = 113947008
's' = 0x00000073 = 115 # And these to "add" 's'
x ^= 's' = 0x06cab1f3 = 113947123
x ^= hs = 0x0ccab1d9 = 214610393 # We "add" h back in, and voila, we have the hash of "ovols"!
UNIX>
Four programs to speed test
I have four programs to speed test:
- src/control.cpp -- this simply reads a file from standard
input, and then exits.
- src/cpp_find.cpp -- this simply reads a file from standard
input and a pattern on the command line, and uses C++'s find() method to find the
pattern in the file. Newlines in the file are turned into spaces in the string.
- src/strcmp.cpp -- this works like src/cpp_find,
except it runs through each possible starting place in the file string, and calls strncmp()
to find the pattern.
- src/rabin_karp.cpp -- this uses Rabin_Karp with the
rolling hash scheme detailed above. I calculate the hash a little differently. I basically
do the algorithm without h, and then when I need a hash, I XOR hs at the end:
/* Hash the pattern and exit if the pattern is too big. */
hash = djb_hash(pattern.c_str(), pattern.size());
if (file.size() < pattern.size()) {
printf("file is smaller than pattern\n");
return 0;
}
/* Calculate the DJB hash of the first pattern.size() characters of the file,
without the h term. */
sh = 0;
for (i = 0; i < pattern.size(); i++) {
sh = (sh << 5) | (sh >> (32-5));
sh ^= file[i];
}
/* Calculate hs */
l = pattern.size() * 5;
shift = l % 32;
hs = 5381;
if (shift != 0) hs = ((hs << shift) ^ (hs >> (32-shift)));
/* Do the Rabin_Karp algorithm i is the next character, j is the first character. */
j = 0;
while (i < file.size()) {
sh = (sh << 5) | (sh >> (32-5)); // Rolling hash -- do the circular shift.
sh ^= file[i]; // Add in the next character.
tmp = file[j]; // Subtract the first character.
if (shift != 0) tmp = ((tmp << shift) ^ (tmp >> (32-shift)));
sh ^= tmp;
if ((sh^hs) == hash) { // If the hashes match, then verify
if (strncmp(pattern.c_str(), file.c_str()+j+1, pattern.size()) == 0) {
printf("Found at index %d\n", j+1);
return 0;
}
}
i++;
j++;
}
return 0;
}
|
Just as an aside, I have to do the following if statements because the morons
who implemented bit shift decided that shifting an integer by more than 31 bits should
be undefined rather than zero:
if (shift != 0) tmp = ((tmp << shift) ^ (tmp >> (32-shift)));
|
Here's the quote from stackoverflow:
The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.
I wonder how many hours of programmer time have been wasted "discovering" this feature like I did...
Anyway. You have to think about how to concoct a scenario where Rabin-Karp is better than find(). I have two in
scripts/time.sh:
- "negligible" -- 1,000,000 names in
txt/names.txt. The pattern is the last name in the file,
- "significant" -- the file
txt/file.txt is composed of 139265 lines, each of which has 50 'a' characters separated by a space. The very last line has "b c d".
The pattern is in
txt/pattern.txt, and is composed of 1000 'a' characters separated
by a space, and then a 'b'.
The two files are big -- 15.3 MB and 13.3 MB respectively. Let's take a look at the timings
(my Mac in 2020):
UNIX> sh scripts/time.sh negligible
control -1 real 0m0.776s
cpp_find 16095573 real 0m0.735s
strcmp 16095573 real 0m0.816s
rabin_karp 16095573 real 0m0.720s
UNIX> sh scripts/time.sh significant
control -1 real 0m0.584s
cpp_find 13924500 real 0m1.096s
strcmp 13924500 real 0m1.157s
rabin_karp 13924500 real 0m0.608s
UNIX>
As you can see, Rabin-Karp barely outperforms the C++ find() routine on the "negligible"
test. That's because when you're comparing two strings, and they differ by their first characters,
then C++ find() (and strncmp()) return instantly. The "significant" test is much
more challenging, because there are a lot of strings that start with 1000 a's and spaces, and
find()/strncmp() have to do a lot of comparisons before failing. That's why Rabin-Karp
performs so much better. If you subtract the control from the numbers, Rabin-Karp is 21 times
faster than C++ find()!