You are to write the following method of the Solution class:
vector <int> findSubstring(string s, vector <string> &words); |
Here are details:
012345678901234567890123 ex1.txt s = "fredsaiddropdeadsaidfred" p = "fred", "said" ^ ^ 0 16 012345678901234567 ex2.txt s = "fredfofofofofofred" p = "fred", "fofo", "fofo" ^ ^ 0 6 0123456789012345678901 ex3.txt s = "fredfofofofofofofofred" p = "fred", "fofo", "fofo" ^ ^ 0 10 01234567890123456789 ex4.txt s = "fredfofofofofofofred" p = "fofo", "fofo" ^ ^ ^ 4 6 8 |
So instead, I decided to hash the strings to 64-bit integers, and pray that Leetcode's testing wouldn't exploit any collisions. I used the following hash function, which I use in my CS494 lecture on the Rabin-Karp algorithm:
unsigned long long hash(const string &s) { size_t i; unsigned long long h; h = 0; for (i = 0; i < s.size(); i++) { h = (h << 5 | h >> 59) ^ s[i]; } return h; } |
This hash function does a 5-bit circular bit-shift for each character. I assumed that for these small strings, it would do a good job. It also has the feature that it is a "rolling" hash function, that I can calculate all of the hashes in s with just one pass -- just like the Rabin-Karp algorithm. I didn't anticipate using that feature, but it was there if I needed it.
With this hash function, I can now represent each of the strings in p with a single long long.
Word Hash #-Times fred 3327172 1 ex1.txt said 3672388 1 Word Hash #-Times fofo 3322031 2 ex2.txt fred 3327172 1 Word Hash #-Times fofo 3322031 2 ex3.txt fred 3327172 1 Word Hash #-Times fofo 3322031 2 ex4.txt |
for (i = 0; i < p[0].size(); i++) { for (j = i; j + p[0].size() <= s.size(); j += p[0].size) { // Now I'll evaluate each word beginning with j in s |
Plus, I maintained two data structures:
deque <unsigned long long> hdeq; -- Hashes of the last words in front of you in s, that correspond to words in p. map <unsigned long long, int> hmap; -- Hashes in hdeq, and the number of times each hash appears in hdeq. |
We're going to use these data structures to keep track of how many consecutive substrings in s are also in p. The hashes of those substrings are held in hdeq, which will be ordered by the order in which they appear in s. hmap is used to identify which hashes are in hdeq and how many times each of those hashes is in hdeq.
Here's how we use them:
Now, at the beginning of the inner loop above, I clearled both data structures. Then, for each word w, starting with index j above, I did the following:
They way to handle this is to start deleting hashes from the front of hdeq. That's why we use a deque -- because we're pushing to the back and popping from the front. As we delete a hash from hdeq, we also lower the hashes count in hmqp. We do this until we've lowered the count to match pmap.
Use the "fredfred" example. When we process the first "fred", we have:
hdec = [3327172], hmap = { (3327172,1) }, pmap = { (3327172,1), (3672388,1) } |
When we process the second "fred", we have:
hdec = [3327172,3327172], hmap = { (3327172,2) }, pmap = { (3327172,1), (3672388,1) } |
Because the count in hmap is too high, we delete the first hash in the front of hdec, and lower its count in hmap. We're left with:
hdec = [3327172], hmap = { (3327172,1) }, pmap = { (3327172,1), (3672388,1) } |
Basically, we identified that we had to throw away that first "fred".
If I hadn't used the hashes, then you'd have to multiply in another factor of p[0].size(). Maybe that would be fast enough, but I didn't want to find out.
As with the "Median of Two Sorted Arrays" problem, I don't want to deny you the pleasure of writing this program yourself, so I have not provided any solutions. Or testing. Sorry.