for (i = 0; i + p.size() <= f.size(); i++) {
if (strncmp(p.c_str(), f.c_str()+i, p.size()) == 0) {
printf("Found: %d\n", i);
return 0;
}
}
printf("Not found.\n");
return 0;
}
|
The worst-case running time of this is O(p.size() * f.size()), which is slow indeed. With Rabin-Karp, you reduce this to O(f.size()).
We know that adding ASCII values makes for a really bad hash function, so for the moment, let's just assume that we have a good Rolling Hash function.
What you do is hash p. And consider the string s made from the first p.size() characters of f. Hash s Then you repeat the following:
unsigned int djb_hash(const string &s)
{
size_t i;
unsigned int h;
h = 5381;
for (i = 0; i < s.size(); i++) {
h = (h << 5) + h + s[i];
}
return h;
}
|
This function isn't going to work, in my opinion, because of the potential integer overflow with addition, and the fact that we're shifting away five bits of h at a time. I came up with the following modification, which has all of the desirable properties for a rolling hash function:
unsigned int djb_hash(const string &s)
{
size_t i;
unsigned int h;
h = 5381;
for (i = 0; i < s.size(); i++) {
h = (h << 5) ^ (s[i]) ^ (h >> 27);
}
return h;
}
|
Why does this seem good for a rolling hash function? Well, we never shift any bits away. The hash of a string of size n is going to contain the number 5381 circular-shifted left by 5n%32 bits, and character of s circular shifted by a deterministic number of bits. So, given a hash of a+s, I can calculate the hash of s+b by doing the following:
UNIX> ( echo govol ; echo ovols ) | bin/djb_altered 205936038 214610393 UNIX>I've written the program src/govols.cpp to calculate the second hash from the first. Here's the output, so you can see the calculation:
UNIX> bin/govols Starting value = 0x0c4655a6 = 205936038 # Here's the hash of "govol" hs = 0x0a00002a = 167772202 # This is h, circular shifted to the left by 25 (because the size of "govol" is 5.) x = value xor hs = 0x0646558c = 105272716 # This "subtracts" h x circ<<= 5 = 0xc8cab180 = 3368726912 # Now the circular shift by five. 'g' = 0x00000067 = 103 # These three lines "subtract" 'g' g = 'g' circ<< 25 = 0xce000000 = 3456106496 x ^= g = 0x06cab180 = 113947008 's' = 0x00000073 = 115 # And these to "add" 's' x ^= 's' = 0x06cab1f3 = 113947123 x ^= hs = 0x0ccab1d9 = 214610393 # We "add" h back in, and voila, we have the hash of "ovols"! UNIX>
/* Hash the pattern and exit if the pattern is too big. */
hash = djb_hash(pattern.c_str(), pattern.size());
if (file.size() < pattern.size()) {
printf("file is smaller than pattern\n");
return 0;
}
/* Calculate the DJB hash of the first pattern.size() characters of the file,
without the h term. */
sh = 0;
for (i = 0; i < pattern.size(); i++) {
sh = (sh << 5) | (sh >> (32-5));
sh ^= file[i];
}
/* Calculate hs */
l = pattern.size() * 5;
shift = l % 32;
hs = 5381;
if (shift != 0) hs = ((hs << shift) ^ (hs >> (32-shift)));
/* Do the Rabin_Karp algorithm i is the next character, j is the first character. */
j = 0;
while (i < file.size()) {
sh = (sh << 5) | (sh >> (32-5)); // Rolling hash -- do the circular shift.
sh ^= file[i]; // Add in the next character.
tmp = file[j]; // Subtract the first character.
if (shift != 0) tmp = ((tmp << shift) ^ (tmp >> (32-shift)));
sh ^= tmp;
if ((sh^hs) == hash) { // If the hashes match, then verify
if (strncmp(pattern.c_str(), file.c_str()+j+1, pattern.size()) == 0) {
printf("Found at index %d\n", j+1);
return 0;
}
}
i++;
j++;
}
return 0;
}
|
if (shift != 0) tmp = ((tmp << shift) ^ (tmp >> (32-shift)));
|
Here's the quote from stackoverflow:
Anyway. You have to think about how to concoct a scenario where Rabin-Karp is better than find(). I have two in scripts/time.sh:
UNIX> sh scripts/time.sh negligible control -1 real 0m0.776s cpp_find 16095573 real 0m0.735s strcmp 16095573 real 0m0.816s rabin_karp 16095573 real 0m0.720s UNIX> sh scripts/time.sh significant control -1 real 0m0.584s cpp_find 13924500 real 0m1.096s strcmp 13924500 real 0m1.157s rabin_karp 13924500 real 0m0.608s UNIX>As you can see, Rabin-Karp barely outperforms the C++ find() routine on the "negligible" test. That's because when you're comparing two strings, and they differ by their first characters, then C++ find() (and strncmp()) return instantly. The "significant" test is much more challenging, because there are a lot of strings that start with 1000 a's and spaces, and find()/strncmp() have to do a lot of comparisons before failing. That's why Rabin-Karp performs so much better. If you subtract the control from the numbers, Rabin-Karp is 21 times faster than C++ find()!