Leetcode.com problem 729: "Substring with Concatenation of All Words"

James S. Plank

Fri Aug 5 09:59:36 AM EDT 2022
Although this is a hard problem, you can solve it using techniques from CS202: hashing and data structures from the STL.


In case Leetcode's servers are down

Here is a summary of the problem:

You are to write the following method of the Solution class:

vector <int> findSubstring(string s, vector <string> &words);

Here are details:

Examples:

     012345678901234567890123   ex1.txt
s = "fredsaiddropdeadsaidfred"  p = "fred", "said"
     ^               ^
     0              16

     012345678901234567         ex2.txt
s = "fredfofofofofofred"        p = "fred", "fofo", "fofo"
     ^     ^
     0     6

     0123456789012345678901     ex3.txt
s = "fredfofofofofofofofred"    p = "fred", "fofo", "fofo"
     ^         ^
     0         10

     01234567890123456789       ex4.txt
s = "fredfofofofofofofred"      p = "fofo", "fofo"
         ^ ^ ^     
         4 6 8


The case for hashing

Clearly, we are going to have to identify strings from p in s. The easiest thing would be to store the strings from p in a set or map, and use the substr() method to extract substrings from s. However, since the strings in p can be up to 30 characters, that may be too expensive computationally.

So instead, I decided to hash the strings to 64-bit integers, and pray that Leetcode's testing wouldn't exploit any collisions. I used the following hash function, which I use in my CS494 lecture on the Rabin-Karp algorithm:

unsigned long long hash(const string &s)
{
  size_t i;
  unsigned long long h;

  h = 0;
  for (i = 0; i < s.size(); i++) {
    h = (h << 5 | h >> 59) ^ s[i];
  }
  return h;
}

This hash function does a 5-bit circular bit-shift for each character. I assumed that for these small strings, it would do a good job. It also has the feature that it is a "rolling" hash function, that I can calculate all of the hashes in s with just one pass -- just like the Rabin-Karp algorithm. I didn't anticipate using that feature, but it was there if I needed it.

With this hash function, I can now represent each of the strings in p with a single long long.


Maintaining a map of hashes

Now, we'd like to use a set or map to keep track of the words in p. Since we can have duplicate words, we're going to use a map. Its keys will be the hashes of words in p, and the vals will be the number of times the words appear in p. Here's what the map looks like for our example files. For the sake of later discussion, we're going to call this pmap.

Word  Hash    #-Times
fred  3327172   1           ex1.txt
said  3672388   1

Word  Hash    #-Times
fofo  3322031   2           ex2.txt
fred  3327172   1

Word  Hash    #-Times
fofo  3322031   2           ex3.txt
fred  3327172   1

Word  Hash    #-Times
fofo  3322031   2           ex4.txt


Our main loops and our main data structures

My main loops started in this manner:

for (i = 0; i < p[0].size(); i++) {
  for (j = i; j + p[0].size() <= s.size(); j += p[0].size) {
    // Now I'll evaluate each word beginning with j in s

Plus, I maintained two data structures:

deque <unsigned long long>      hdeq;  -- Hashes of the last words in front of you in s, 
                                          that correspond to words in p.
map   <unsigned long long, int> hmap;  -- Hashes in hdeq, and the number of times 
                                          each hash appears in hdeq.

We're going to use these data structures to keep track of how many consecutive substrings in s are also in p. The hashes of those substrings are held in hdeq, which will be ordered by the order in which they appear in s. hmap is used to identify which hashes are in hdeq and how many times each of those hashes is in hdeq.

Here's how we use them:

Now, at the beginning of the inner loop above, I clearled both data structures. Then, for each word w, starting with index j above, I did the following:

You now have the structure to solve the problem. What's the running time? Well, your outer loop is O(p[0].size()). Your inner loop is O(s.size() * log(p.size())). So:

O(s.size() * log(p.size()) * p[0].size()

If I hadn't used the hashes, then you'd have to multiply in another factor of p[0].size(). Maybe that would be fast enough, but I didn't want to find out.


Timing

My solution was faster than 98% of C++ solutions and used less memory than 96.86%. I assume that's all because of the hashing. I'm guessing that I could have used hash tables instead of the maps, and it would have been faster. I'm not going to do that.

As with the "Median of Two Sorted Arrays" problem, I don't want to deny you the pleasure of writing this program yourself, so I have not provided any solutions. Or testing. Sorry.