This one originally comes from a topcoder article by vorthys. You are to write a program that takes two strings and returns the length of the largest common subsequence of the two strings. A subsequence of a string is any string that you can create by deleting one or more letters of the original string. For example, "dog" is a subsequence if "dodger", but "red" is not. Even though the characters appear in "dodger", you cannot create "red" by deleting characters from "dodger."
You can compile the programs with "make subseq".
You should see that these examples are related. In the first, the longest subsequence is "Fyy", and in the second, the longest subsequence is "Accc". This example highlights the difficulty of this problem -- If you start looking for subsequences from the beginning of either string, the result may be affected by the end of the string. We'll return to these examples in a bit. Let's try another example:
Man, that looks hard. Don't even bother trying to solve it. However, one thing you do know -- the maximum subsequence starts with a 9. Why? It has to -- you can prove this to yourself by contradiction: If the subsequence is S and S doesn't start with 9, then there is a valid subsequence which is "9"+S.
I'm hoping these examples help you spot the recursion: You will compare the first letters of both substrings, and do different things depending on whether they equal each other:
Let's hack that up. It's in src/subseq1.cpp and is straightforward:
int Subseq::MCS(const string &s1, const string &s2) { int r1, r2; /* Base case -- if either string is empty, return 0. */ if (s1.size() == 0 || s2.size() == 0) return 0; /* If the first characters of the two strings equal each other, then the answer is one plus the MCS of the two string with the first characters chopped off. */ if (s1[0] == s2[0]) return 1 + MCS(s1.substr(1), s2.substr(1)); /* Otherwise, the answer is either: - r1: The MCS of the 1st string, and the 2nd string without its first character - r2: The MCS of the 2nd string, and the 1st string without its first character */ r1 = MCS(s1, s2.substr(1)); r2 = MCS(s1.substr(1), s2); return (r1 > r2) ? r1 : r2; // Return the maximum of r1 and r2. } |
It seems to work on our examples:
UNIX> bin/subseq1 AbbbcccFyy FxxxyyyAc 3 UNIX> bin/subseq1 AbbbcccFyy FxxxyyyAccc 4 UNIX> bin/subseq1 9009709947053769973735856707811337348914 9404888367379074754510954922399291912221 <CNTL-C> UNIX>However, it hangs on that last one. Why? Exponential blow-up of duplicate calls, of course. So, let's move to Step Two:
class Subseq { public: map <string, int> cache; int MCS(const string &s1, const string &s2); }; int Subseq::MCS(const string &s1, const string &s2) { string key; // We'll turn the arguments into a string for the cache. int r1, r2; /* Take care of the base cases without worrying about the cache. */ if (s1.size() == 0 || s2.size() == 0) return 0; /* Create the memoization key, and check the cache. */ key = s1 + ":"; key += s2; if (cache.find(key) != cache.end()) return cache[key]; /* If it's not in the cache, then do the recursion and put the answer into the cache. */ if (s1[0] == s2[0]) { cache[key] = 1 + MCS(s1.substr(1), s2.substr(1)); } else { r1 = MCS(s1, s2.substr(1)); r2 = MCS(s1.substr(1), s2); cache[key] = (r1 > r2) ? r1 : r2; } return cache[key]; } |
This works fine on our examples:
UNIX> bin/subseq2 AbbbcccFyy FxxxyyyAc 3 UNIX> bin/subseq2 AbbbcccFyy FxxxyyyAccc 4 UNIX> time bin/subseq2 9009709947053769973735856707811337348914 9404888367379074754510954922399291912221 15 real 0m0.007s user 0m0.004s sys 0m0.002s UNIX>However, this solution should make you feel uneasy. It makes me feel uneasy. The reason is that each memoization key makes a copy of the two strings, and each call to substr() creates a new string which is just one character smaller than the previous string. That's a lot of memory. To hammer home this point, the file txt/sub-big.txt has two 3000-character strings. When we call bin/subseq2 on it, it hangs because it is making so many copies of large strings:
UNIX> bin/subseq2 `cat txt/sub-big.txt`Think about it -- each substring is a suffix of the previous one, so for S1 and S2 there are roughly 3000 suffixes. That means potentially 3000 * 3000 calls to MCS(), each time creating strings of size up to 3000, and memoization keys of size up to 6000. That's a huge amount of time and memory.
Since we are only using suffixes of S1 and S2, we can represent them with indices to their first characters, and we call MCS() on these indices rather than on the strings. I've put that code into src/subseq3.cpp. It doesn't memoize.
class Subseq { public: string s1, s2; int MCS(size_t i1, size_t i2); }; /* This code doesn't memoize -- it's like the step 1 code that worked on strings, only now we're working on indices, and not making any copies of strings. */ int Subseq::MCS(size_t i1, size_t i2) { int r1, r2; if (s1.size() == i1 || s2.size() == i2) return 0; if (s1[i1] == s2[i2]) { return 1 + MCS(i1+1, i2+1); } else { r1 = MCS(i1, i2+1); r2 = MCS(i1+1, i2); return (r1 > r2) ? r1 : r2; } } |
It works, although since it doesn't memoize, it hangs on the long input.
UNIX> bin/subseq3 AbbbcccFyy FxxxyyyAc 3 UNIX> bin/subseq3 AbbbcccFyy FxxxyyyAccc 4 UNIX> bin/subseq3 9009709947053769973735856707811337348914 9404888367379074754510954922399291912221We memoize in src/subseq4.cpp. Our cache is on the indices, so it's a two-dimensional vector, and is of size s1.size() * s2.size():
class Subseq { public: string s1, s2; vector < vector <int> > cache; // Here's our memoization cache. int MCS(size_t i1, size_t i2); }; int Subseq::MCS(size_t i1, size_t i2) { int r1, r2; size_t i; /* Base case. */ if (s1.size() == i1 || s2.size() == i2) return 0; /* Create the cache if necessary -- sentinelize with -1. */ if (cache.size() == 0) { cache.resize(s1.size()); for (i = 0; i < s1.size(); i++) cache[i].resize(s2.size(), -1); } /* If the answer is in the cache, return it. */ if (cache[i1][i2] != -1) return cache[i1][i2]; /* Otherwise, calculate it with recursion. */ if (s1[i1] == s2[i2]) { cache[i1][i2] = 1 + MCS(i1+1, i2+1); } else { r1 = MCS(i1, i2+1); r2 = MCS(i1+1, i2); cache[i1][i2] = (r1 > r2) ? r1 : r2; } return cache[i1][i2]; } |
Now this version is fast, and it even works on the huge input in under a second!
UNIX> bin/subseq4 AbbbcccFyy FxxxyyyAc 3 UNIX> bin/subseq4 AbbbcccFyy FxxxyyyAccc 4 UNIX> bin/subseq4 9009709947053769973735856707811337348914 9404888367379074754510954922399291912221 15 UNIX> time bin/subseq4 `cat txt/sub-big.txt` 891 real 0m0.123s user 0m0.110s sys 0m0.012s UNIX>
int Subseq::MCS() { int r1, r2, i1, i2; size_t i; /* Create the cache, and fill it with zero's. You'll note that cache[i][s2.size()] and cache[s1.size()][i] should equal zero, so we don't have to calculate them. */ cache.resize(s1.size()+1); for (i = 0; i < cache.size(); i++) cache[i].resize(s2.size()+1, 0); /* Now start with i1 and i2 being s1.size()-1 and s2.size()-1 respectively, and fill in the cache from high to low. */ for (i1 = (int) s1.size()-1; i1 >= 0; i1--) { for (i2 = (int) s2.size()-1; i2 >= 0; i2--) { if (s1[i1] == s2[i2]) { cache[i1][i2] = 1 + cache[i1+1][i2+1]; } else { r1 = cache[i1][i2+1]; r2 = cache[i1+1][i2]; cache[i1][i2] = (r1 > r2) ? r1 : r2; } } } /* The final answer is in cache[0][0]. */ return cache[0][0]; } |
This is much faster than the previous code:
UNIX> time bin/subseq5 `cat txt/sub-big.txt` 891 real 0m0.039s user 0m0.026s sys 0m0.011s UNIX>
int Subseq::MCS() { int r1, r2, i, i1, i2, index, other; cache.resize(2); for (i = 0; i < cache.size(); i++) cache[i].resize(s2.size()+1, 0); for (i1 = s1.size()-1; i1 >= 0; i1--) { index = i1%2; other = (i1+1)%2; for (i2 = s2.size()-1; i2 >= 0; i2--) { if (s1[i1] == s2[i2]) { cache[index][i2] = 1 + cache[other][i2+1]; } else { r1 = cache[index][i2+1]; r2 = cache[other][i2]; cache[index][i2] = (r1 > r2) ? r1 : r2; } } } return cache[0][0]; } |
It runs a little faster than before, presumably because it's using less memory:
UNIX> time bin/subseq6 `cat txt/sub-big.txt` 891 real 0m0.026s user 0m0.021s sys 0m0.003s UNIX>
UNIX> time bin/subseq3 `cat txt/sub-big-2.txt` # Step 1 2592 real 0m0.007s user 0m0.003s sys 0m0.003s UNIX> time bin/subseq4 `cat txt/sub-big-2.txt` # Step 2 2592 real 0m0.021s user 0m0.009s sys 0m0.009s UNIX> time bin/subseq5 `cat txt/sub-big-2.txt` # Step 3 2592 real 0m0.038s user 0m0.025s sys 0m0.011s UNIX> time bin/subseq6 `cat txt/sub-big-2.txt` # Step 4 2592 real 0m0.021s user 0m0.017s sys 0m0.003s UNIX> wc txt/sub-big-2.txt # The input is two strings of size 2592. 2 2 5186 txt/sub-big-2.txt UNIX> sort -u txt/sub-big-2.txt | wc # And the two strings are identical. 1 1 2593 UNIX>What is happening? Well, since the strings are identical, the recursive version always calls MCS(i1+1,i2+1). There is no exponential blow-up, and the answer is found in 2593 recursive calls. It's fast!
Now, in steps 2 and 3, they both employ a cache of size 2593*2593. In step 2, creating that cache dominates the running time, because like step 1, it only makes 2593 recursive calls. In step three, once you make the cache, you still have to fill it in -- every element. That's why it takes so long.
In step 4, we're still filling in 2593*2953 cache entries, but our cache's size is only 2593*2. That's why it's faster than step 3, but it doesn't gain significantly over step 2, and is still inferior to step 1.