Example Program #3: The Maximum Subsequence Problem

- November 18, 2009
- James S. Plank
- Directory:
**/home/plank/cs302/Notes/DynamicProgramming**

This one comes from a topcoder article by vorthys. You are to write a program that takes two strings and returns the length of the largest common subsequence of the two strings. A subsequence is a subset of the letters of a string that appears in the order given by the subsequence. For example, "dog" is a subsequence if "dodger", but "red" is not, because the characters don't appear in "dodger" in the correct order.

As always, we start with **Step 1**, which is to spot the recursive solution. To do so,
let's take a look at two examples. Here's the first example:

You should see that these examples are related. In the first, the longest subsequence is "Fyy", and in the second, the longest subsequence is "Accc". This example highlights the difficulty of this problem -- If you start looking for subsequences from the beginning of either string, the result may be affected by the end of the string. We'll return to these examples in a bit. Let's try another example:

Man, that looks hard. Don't even bother trying to solve it. However, one thing you do know --
the maximum subsequence starts with a 9. Why? It has to -- you can prove this to yourself
by contradiction: If the subsequence is *S* and *S* doesn't start with 9, then there is
a valid subsequence which is "9"+*S*.

I'm hoping these examples help you spot the recursion: You will compare the first letters of both substrings, and do different things depending on whether they equal each other:

- If
**S1[0] == S2[0]**, then the answer is one plus the maximum subsequence of**S1.substr(1)**and**S2.substr(1)**. (This method returns the substring starting at index 1 and going to the end of the string). - If
**S1[0] != S2[0]**, then you know that the maximum subsequence either does not start with**S1[0]**or does not start with**S2[0]**. Thus, you should determine the maximum subsequence of**S1**and**S2.substr(1)**, and the maximum subsequence of**S2**and**S1.substr(1)**. The answer has to be one of these.

Let's hack that up. It's in
**subseq1.cpp**. (There's a **main()** that reads the
two strings from the command line.

int LCS(string s1, string s2) { int r1, r2; /* Base case -- if either string is empty, return 0. */ if (s1.size() == 0 || s2.size() == 0) return 0; /* If the first characters of the two strings equal each other, then the answer is one plus the LCS of the two string with the first characters chopped off. */ if (s1[0] == s2[0]) return 1 + LCS(s1.substr(1), s2.substr(1)); /* Otherwise, the answer is either: - The LCS of the 1st string, and the 2nd string without its first character. - The LCS of the 2nd string, and the 1st string without its first character. */ r1 = LCS(s1, s2.substr(1)); r2 = LCS(s1.substr(1), s2); return (r1 > r2) ? r1 : r2; } |

It seems to work on our examples:

UNIX>However, it hangs on that last one. Why? Exponential blow-up of duplicate calls. Let's memoize. This is an easy memoization on the strings -- just concatenate them with a character that can't be in either string -- I'll choose a colon here. I'm going to turn the solution into a class so that storing the cache is easier. It's ing++ -o subseq1 subseq1.cppUNIX>subseq1 AbbbcccFyy FxxxyyyAc3 UNIX>subseq1 AbbbcccFyy FxxxyyyAccc4 UNIX>subseq1 9009709947053769973735856707811337348914 9404888367379074754510954922399291912221

class Subseq { public: map <string, int> cache; int LCS(string s1, string s2); }; int Subseq::LCS(string s1, string s2) { string key; int r1, r2; if (s1.size() == 0 || s2.size() == 0) return 0; key = s1 + ":"; key += s2; if (cache.find(key) != cache.end()) return cache[key]; if (s1[0] == s2[0]) { cache[key] = 1 + LCS(s1.substr(1), s2.substr(1)); } else { r1 = LCS(s1, s2.substr(1)); r2 = LCS(s2, s1.substr(1)); cache[key] = (r1 > r2) ? r1 : r2; } return cache[key]; } |

This works fine on our examples:

UNIX>However, this solution should make you feel uneasy. It makes me feel uneasy. The reason is that each call tog++ -o subseq2 subseq2.cppUNIX>subseq2 AbbbcccFyy FxxxyyyAc3 UNIX>subseq2 AbbbcccFyy FxxxyyyAccc4 UNIX>subseq2 9009709947053769973735856707811337348914 940488836737907475451095492239929191222115 UNIX>

UNIX>Think about it -- each substring is a suffix of the previous one, so for S1 and S2 there are roughly 3000 suffixes. That means potentially 3000 * 3000 calls tosubseq2 `cat sub-big.txt`

Since we are only using suffixes of S1 and S2, we can represent them with indices to their
first characters, and we call **LCS()** on these indices rather than on the strings.
I've put that code into **subseq3.cpp**. It doesn't memoize.

class Subseq { public: string s1, s2; int LCS(int i1, int i2); }; int Subseq::LCS(int i1, int i2) { int r1, r2; if (s1.size() == i1 || s2.size() == i2) return 0; if (s1[i1] == s2[i2]) { return 1 + LCS(i1+1, i2+1); } else { r1 = LCS(i1, i2+1); r2 = LCS(i1+1, i2); return (r1 > r2) ? r1 : r2; } } |

It works, although since it doesn't memoize, it hangs on the long input.

UNIX>We memoize insubseq3 AbbbcccFyy FxxxyyyAc3 UNIX>subseq3 AbbbcccFyy FxxxyyyAccc4 UNIX>subseq3 9009709947053769973735856707811337348914 9404888367379074754510954922399291912221

class Subseq { public: string s1, s2; vector < vector <int> > cache; int LCS(int i1, int i2); }; int Subseq::LCS(int i1, int i2) { int r1, r2, i; if (s1.size() == i1 || s2.size() == i2) return 0; if (cache.size() == 0) { cache.resize(s1.size()); for (i = 0; i < s1.size(); i++) cache[i].resize(s2.size(), -1); } if (cache[i1][i2] != -1) return cache[i1][i2]; if (s1[i1] == s2[i2]) { cache[i1][i2] = 1 + LCS(i1+1, i2+1); } else { r1 = LCS(i1, i2+1); r2 = LCS(i1+1, i2); cache[i1][i2] = (r1 > r2) ? r1 : r2; } return cache[i1][i2]; } |

Now this version is fast, and it even works on the huge input in under a second without compiler optimization!

UNIX>It's faster with optimization:g++ -o subseq4 subseq4.cppUNIX>subseq4 AbbbcccFyy FxxxyyyAc3 UNIX>subseq4 AbbbcccFyy FxxxyyyAccc4 UNIX>subseq4 9009709947053769973735856707811337348914 940488836737907475451095492239929191222115 UNIX>time subseq4 `cat sub-big.txt`891 0.930u 0.040s 0:00.96 101.0% 0+0k 16+0io 0pf+0w UNIX>

UNIX>We can perform step #3 on this by realizing that you perform the recursion on larger indices, so you build the cache from high to low. This is ing++ -O3 -o subseq4 subseq4.cppUNIX> time subseq4 `cat sub-big.txt` 891 0.310u 0.040s 0:00.33 106.0% 0+0k 0+0io 0pf+0w UNIX>

int Subseq::LCS() { int r1, r2, i, i1, i2; cache.resize(s1.size()+1); for (i = 0; i < cache.size(); i++) cache[i].resize(s2.size()+1, -1); for (i = 0; i <= s1.size(); i++) cache[i][s2.size()] = 0; for (i = 0; i <= s2.size(); i++) cache[s1.size()][i] = 0; for (i1 = s1.size()-1; i1 >= 0; i1--) { for (i2 = s2.size()-1; i2 >= 0; i2--) { if (s1[i1] == s2[i2]) { cache[i1][i2] = 1 + cache[i1+1][i2+1]; } else { r1 = cache[i1][i2+1]; r2 = cache[i1+1][i2]; cache[i1][i2] = (r1 > r2) ? r1 : r2; } } } return cache[0][0]; } |

This is much faster than the previous code:

UNIX>Finally, we can perform step 4 on this by only keeping two rows, and performing the arithmetic ontime subseq5 `cat sub-big.txt `891 0.052u 0.036s 0:00.09 88.8% 0+0k 40+0io 0pf+0w UNIX>

int Subseq::LCS() { int r1, r2, i, i1, i2, index, other; cache.resize(2); for (i = 0; i < cache.size(); i++) cache[i].resize(s2.size()+1, 0); for (i1 = s1.size()-1; i1 >= 0; i1--) { index = i1%2; other = (i1+1)%2; for (i2 = s2.size()-1; i2 >= 0; i2--) { if (s1[i1] == s2[i2]) { cache[index][i2] = 1 + cache[other][i2+1]; } else { r1 = cache[index][i2+1]; r2 = cache[other][i2]; cache[index][i2] = (r1 > r2) ? r1 : r2; } } } return cache[0][0]; } |

It runs at the same speed as before, but uses much less memory:

UNIX>Here's a curiousity -- take a look at the timings and the output and see if you can figure out what's going on:time subseq6 `cat sub-big.txt `891 0.052u 0.000s 0:00.05 100.0% 0+0k 0+0io 0pf+0w UNIX>

UNIX>Why is this interesting? Well,wc sub-big.txt2 2 5353 sub-big.txt UNIX>wc sub-big-2.txt2 2 5186 sub-big-2.txt UNIX>time subseq4 `cat sub-big-2.txt`2592 0.000u 0.036s 0:00.04 75.0% 0+0k 0+0io 0pf+0w UNIX>time subseq5 `cat sub-big-2.txt`2592 0.060u 0.020s 0:00.08 100.0% 0+0k 0+0io 0pf+0w UNIX>time subseq6 `cat sub-big-2.txt`2592 0.044u 0.004s 0:00.05 80.0% 0+0k 0+0io 0pf+0w UNIX>

Take a look at the output -- the maximum subsequence size is 2592, pretty much half the size of the input file. You can confirm that the two strings are in fact the same:

UNIX>Why wouldsort -u sub-big-2.txt | wc1 1 2593 UNIX>

I have a much larger call graph for "AbbbcccFyy" and "FxxxyyyAc" below (click on it to see the full-size picture). In this graph, the red edges are calls when the first characters don't match, and the blue edges are when they do match. The calls go down and to the right. I have widened the calls that lead to the correct answer.