Leetcode.com problem 729: "Substring with Concatenation of All Words"

James S. Plank

Fri Aug 5 09:59:36 AM EDT 2022

Problem Statement.
Success Rate (as of July, 2019): 28.5%
This is a hard problem.
main.cpp - A driver that you can use for testing.
ex1.txt - Example input 1 (see below).
ex2.txt - Example input 2 (see below).
ex3.txt - Example input 3 (see below).
ex4.txt - Example input 4 (see below).
ex1.-output.txt - Annotated output for ex1.txt.
ex2.-output.txt - Annotated output for ex2.txt.
ex3.-output.txt - Annotated output for ex3.txt.
ex4.-output.txt - Annotated output for ex4.txt.

Although this is a hard problem, you can solve it using techniques from CS202: hashing and data structures from the STL.

In case Leetcode's servers are down

Here is a summary of the problem:

You are to write the following method of the Solution class:

vector <int> findSubstring(string s, vector <string> &words);

Here are details:

s will be a non-empty string whose maximum size is 10,000.
words will be a non-empty vector whose maximum size is 5,000.
Each element of words is a non-empty string of lower-case letters.
Each string in words is the same size, with the maximum size being 30 characters.
Your job is to find the starting index of each substring of s that is a concatenation of every string in words, in any order.
Return a vector of these starting strings.
The vector may be in any order.

Examples:

     012345678901234567890123   ex1.txt
s = "fredsaiddropdeadsaidfred"  p = "fred", "said"
     ^               ^
     0              16

     012345678901234567         ex2.txt
s = "fredfofofofofofred"        p = "fred", "fofo", "fofo"
     ^     ^
     0     6

     0123456789012345678901     ex3.txt
s = "fredfofofofofofofofred"    p = "fred", "fofo", "fofo"
     ^         ^
     0         10

     01234567890123456789       ex4.txt
s = "fredfofofofofofofred"      p = "fofo", "fofo"
         ^ ^ ^     
         4 6 8

The case for hashing

Clearly, we are going to have to identify strings from p in s. The easiest thing would be to store the strings from p in a set or map, and use the substr() method to extract substrings from s. However, since the strings in p can be up to 30 characters, that may be too expensive computationally.

So instead, I decided to hash the strings to 64-bit integers, and pray that Leetcode's testing wouldn't exploit any collisions. I used the following hash function, which I use in my CS494 lecture on the Rabin-Karp algorithm:

unsigned long long hash(const string &s)
{
  size_t i;
  unsigned long long h;

  h = 0;
  for (i = 0; i < s.size(); i++) {
    h = (h << 5 | h >> 59) ^ s[i];
  }
  return h;
}

This hash function does a 5-bit circular bit-shift for each character. I assumed that for these small strings, it would do a good job. It also has the feature that it is a "rolling" hash function, that I can calculate all of the hashes in s with just one pass -- just like the Rabin-Karp algorithm. I didn't anticipate using that feature, but it was there if I needed it.

With this hash function, I can now represent each of the strings in p with a single long long.

Maintaining a map of hashes

Now, we'd like to use a set or map to keep track of the words in p. Since we can have duplicate words, we're going to use a map. Its keys will be the hashes of words in p, and the vals will be the number of times the words appear in p. Here's what the map looks like for our example files. For the sake of later discussion, we're going to call this pmap.

Word  Hash    #-Times
fred  3327172   1           ex1.txt
said  3672388   1

Word  Hash    #-Times
fofo  3322031   2           ex2.txt
fred  3327172   1

Word  Hash    #-Times
fofo  3322031   2           ex3.txt
fred  3327172   1

Word  Hash    #-Times
fofo  3322031   2           ex4.txt

Our main loops and our main data structures

My main loops started in this manner:

for (i = 0; i < p[0].size(); i++) {
  for (j = i; j + p[0].size() <= s.size(); j += p[0].size) {
    // Now I'll evaluate each word beginning with j in s

Plus, I maintained two data structures:

deque <unsigned long long>      hdeq;  -- Hashes of the last words in front of you in s, 
                                          that correspond to words in p.
map   <unsigned long long, int> hmap;  -- Hashes in hdeq, and the number of times 
                                          each hash appears in hdeq.

We're going to use these data structures to keep track of how many consecutive substrings in s are also in p. The hashes of those substrings are held in hdeq, which will be ordered by the order in which they appear in s. hmap is used to identify which hashes are in hdeq and how many times each of those hashes is in hdeq.

Here's how we use them:

Now, at the beginning of the inner loop above, I clearled both data structures. Then, for each word w, starting with index j above, I did the following:

Calculate the hash of that word.
Look to see that the hash corresponds to a word in p (i.e. is it in pmap?)
If so, then:
- Append the hash to hdeq,
- Add the hash to hmap (if it's not there already) and increment its val.
- Now, there are a few things to check. First, you'll note that every key in hmap is in pmap. If the val in hmap is bigger than the val in pmap, then we have seed the string too many times. It's like having the string "fredfred" in ex1.txt above. The second time you see it, you'll notice that you've seen it too many times.
  They way to handle this is to start deleting hashes from the front of hdeq. That's why we use a deque -- because we're pushing to the back and popping from the front. As we delete a hash from hdeq, we also lower the hashes count in hmqp. We do this until we've lowered the count to match pmap.
  Use the "fredfred" example. When we process the first "fred", we have:
  hdec = [3327172], hmap = { (3327172,1) }, pmap = { (3327172,1), (3672388,1) }
  When we process the second "fred", we have:
  hdec = [3327172,3327172], hmap = { (3327172,2) }, pmap = { (3327172,1), (3672388,1) }
  Because the count in hmap is too high, we delete the first hash in the front of hdec, and lower its count in hmap. We're left with:
  hdec = [3327172], hmap = { (3327172,1) }, pmap = { (3327172,1), (3672388,1) }
  Basically, we identified that we had to throw away that first "fred".
- Now, if the size of hdec is equal to the size of p, then we've found all of the substrings! Put the starting index into the return vector. Yay! After doing that, you'll need to delete the first hash from hdeq and lower its count in hmap.
If the hash is not in pmap, then we have failed -- we should clear hdeq and hmap.

You now have the structure to solve the problem. What's the running time? Well, your outer loop is O(p[0].size()). Your inner loop is O(s.size() * log(p.size())). So:

O(s.size() * log(p.size()) * p[0].size()

If I hadn't used the hashes, then you'd have to multiply in another factor of p[0].size(). Maybe that would be fast enough, but I didn't want to find out.

Timing

My solution was faster than 98% of C++ solutions and used less memory than 96.86%. I assume that's all because of the hashing. I'm guessing that I could have used hash tables instead of the maps, and it would have been faster. I'm not going to do that.

As with the "Median of Two Sorted Arrays" problem, I don't want to deny you the pleasure of writing this program yourself, so I have not provided any solutions. Or testing. Sorry.