CS302 Lecture Notes - Dynamic Programming
Example Program #3: The Maximum Subsequence Problem


This one comes from a topcoder article by vorthys. You are to write a program that takes two strings and returns the length of the largest common subsequence of the two strings. A subsequence is a subset of the letters of a string that appears in the order given by the subsequence. For example, "dog" is a subsequence if "dodger", but "red" is not, because the characters don't appear in "dodger" in the correct order.

As always, we start with Step 1, which is to spot the recursive solution. To do so, let's take a look at two examples. Here's the first example:

S1 = "AbbbcccFyy", S2 = "FxxxyyyAc"

And here's the second example.

S1 = "AbbbcccFyy", S2 = "FxxxyyyAccc"

You should see that these examples are related. In the first, the longest subsequence is "Fyy", and in the second, the longest subsequence is "Accc". This example highlights the difficulty of this problem -- If you start looking for subsequences from the beginning of either string, the result may be affected by the end of the string. We'll return to these examples in a bit. Let's try another example:

S1 = "9009709947053769973735856707811337348914", S2 = "9404888367379074754510954922399291912221"

Man, that looks hard. Don't even bother trying to solve it. However, one thing you do know -- the maximum subsequence starts with a 9. Why? It has to -- you can prove this to yourself by contradiction: If the subsequence is S and S doesn't start with 9, then there is a valid subsequence which is "9"+S.

I'm hoping these examples help you spot the recursion: You will compare the first letters of both substrings, and do different things depending on whether they equal each other:

Let's hack that up. It's in subseq1.cpp. (There's a main() that reads the two strings from the command line.

int LCS(string s1, string s2)
{
  int r1, r2;

  /* Base case -- if either string is empty, return 0. */

  if (s1.size() == 0 || s2.size() == 0) return 0;

  /* If the first characters of the two strings equal each other,
     then the answer is one plus the LCS of the two string with
     the first characters chopped off. */

  if (s1[0] == s2[0]) return 1 + LCS(s1.substr(1), s2.substr(1));

  /* Otherwise, the answer is either:
       - The LCS of the 1st string, and the 2nd string without its first character.
       - The LCS of the 2nd string, and the 1st string without its first character.
   */   

  r1 = LCS(s1, s2.substr(1));
  r2 = LCS(s1.substr(1), s2);
  return (r1 > r2) ? r1 : r2;
}

It seems to work on our examples:

UNIX> g++ -o subseq1 subseq1.cpp
UNIX> subseq1 AbbbcccFyy FxxxyyyAc
3
UNIX> subseq1 AbbbcccFyy FxxxyyyAccc
4
UNIX> subseq1 9009709947053769973735856707811337348914 9404888367379074754510954922399291912221
However, it hangs on that last one. Why? Exponential blow-up of duplicate calls. Let's memoize. This is an easy memoization on the strings -- just concatenate them with a character that can't be in either string -- I'll choose a colon here. I'm going to turn the solution into a class so that storing the cache is easier. It's in subseq2.cpp:

class Subseq {
  public:
    map <string, int> cache;
    int LCS(string s1, string s2);
};

int Subseq::LCS(string s1, string s2)
{
  string key;
  int r1, r2;

  if (s1.size() == 0 || s2.size() == 0) return 0;

  key = s1 + ":";
  key += s2;
  if (cache.find(key) != cache.end()) return cache[key];

  if (s1[0] == s2[0]) {
   cache[key] = 1 + LCS(s1.substr(1), s2.substr(1));
  } else {
    r1 = LCS(s1, s2.substr(1));
    r2 = LCS(s2, s1.substr(1));
    cache[key] = (r1 > r2) ? r1 : r2;
  }
  return cache[key];
}

This works fine on our examples:

UNIX> g++ -o subseq2 subseq2.cpp
UNIX> subseq2 AbbbcccFyy FxxxyyyAc
3
UNIX> subseq2 AbbbcccFyy FxxxyyyAccc
4
UNIX> subseq2 9009709947053769973735856707811337348914 9404888367379074754510954922399291912221
15
UNIX> 
However, this solution should make you feel uneasy. It makes me feel uneasy. The reason is that each call to LCS makes a copy of its arguments, and each call to substr() creates a new string which is just one character smaller than the previous string. That's a lot of memory. To hammer home this point, the file sub-big.txt has two 3000-character strings. When we call subseq2 on it, it hangs because it is making so many copies of large strings:
UNIX> subseq2 `cat sub-big.txt`
Think about it -- each substring is a suffix of the previous one, so for S1 and S2 there are roughly 3000 suffixes. That means potentially 3000 * 3000 calls to LCS(), each time creating strings of size up to 3000. That's a huge amount of time and memory.

Since we are only using suffixes of S1 and S2, we can represent them with indices to their first characters, and we call LCS() on these indices rather than on the strings. I've put that code into subseq3.cpp. It doesn't memoize.

class Subseq {
  public:
    string s1, s2;
    int LCS(int i1, int i2);
};

int Subseq::LCS(int i1, int i2)
{
  int r1, r2;

  if (s1.size() == i1 || s2.size() == i2) return 0;

  if (s1[i1] == s2[i2]) {
    return 1 + LCS(i1+1, i2+1);
  } else {
    r1 = LCS(i1, i2+1);
    r2 = LCS(i1+1, i2);
    return (r1 > r2) ? r1 : r2;
  }
}

It works, although since it doesn't memoize, it hangs on the long input.

UNIX> subseq3 AbbbcccFyy FxxxyyyAc
3
UNIX> subseq3 AbbbcccFyy FxxxyyyAccc
4
UNIX> subseq3 9009709947053769973735856707811337348914 9404888367379074754510954922399291912221
We memoize in subseq4.cpp. Our cache is on the indices, and is thus of size s1.size() * s2.size():

class Subseq {
  public:
    string s1, s2;
    vector < vector <int> > cache;
    int LCS(int i1, int i2);
};

int Subseq::LCS(int i1, int i2)
{
  int r1, r2, i;

  if (s1.size() == i1 || s2.size() == i2) return 0;

  if (cache.size() == 0) {
    cache.resize(s1.size());
    for (i = 0; i < s1.size(); i++) cache[i].resize(s2.size(), -1);
  }

  if (cache[i1][i2] != -1) return cache[i1][i2];

  if (s1[i1] == s2[i2]) {
   cache[i1][i2] = 1 + LCS(i1+1, i2+1);
  } else {
    r1 = LCS(i1, i2+1);
    r2 = LCS(i1+1, i2);
    cache[i1][i2] = (r1 > r2) ? r1 : r2;
  }
  return cache[i1][i2]; 
}

Now this version is fast, and it even works on the huge input in under a second without compiler optimization!

UNIX> g++ -o subseq4 subseq4.cpp
UNIX> subseq4 AbbbcccFyy FxxxyyyAc
3
UNIX> subseq4 AbbbcccFyy FxxxyyyAccc
4
UNIX> subseq4 9009709947053769973735856707811337348914 9404888367379074754510954922399291912221
15
UNIX> time subseq4 `cat sub-big.txt`
891
0.930u 0.040s 0:00.96 101.0%  0+0k 16+0io 0pf+0w
UNIX> 
It's faster with optimization:
UNIX> g++ -O3 -o subseq4 subseq4.cpp
UNIX> time subseq4 `cat sub-big.txt`
891
0.310u 0.040s 0:00.33 106.0%  0+0k 0+0io 0pf+0w
UNIX> 
We can perform step #3 on this by realizing that you perform the recursion on larger indices, so you build the cache from high to low. This is in subseq5.cpp:

int Subseq::LCS()
{
  int r1, r2, i, i1, i2;

  cache.resize(s1.size()+1);
  for (i = 0; i < cache.size(); i++) cache[i].resize(s2.size()+1, -1);

  for (i = 0; i <= s1.size(); i++) cache[i][s2.size()] = 0;
  for (i = 0; i <= s2.size(); i++) cache[s1.size()][i] = 0;

  for (i1 = s1.size()-1; i1 >= 0; i1--) {
    for (i2 = s2.size()-1; i2 >= 0; i2--) {
      if (s1[i1] == s2[i2]) {
       cache[i1][i2] = 1 + cache[i1+1][i2+1];
      } else {
        r1 = cache[i1][i2+1];
        r2 = cache[i1+1][i2];
        cache[i1][i2] = (r1 > r2) ? r1 : r2;
      }
    }
  }
  return cache[0][0]; 
}

This is much faster than the previous code:

UNIX> time subseq5 `cat sub-big.txt `
891
0.052u 0.036s 0:00.09 88.8% 0+0k 40+0io 0pf+0w
UNIX> 
Finally, we can perform step 4 on this by only keeping two rows, and performing the arithmetic on i1 mod 2. Now, the cache only contains (2*s2.size()) entries. The code is in subseq6.cpp:

int Subseq::LCS()
{
  int r1, r2, i, i1, i2, index, other;

  cache.resize(2);
  for (i = 0; i < cache.size(); i++) cache[i].resize(s2.size()+1, 0);

  for (i1 = s1.size()-1; i1 >= 0; i1--) {
    index = i1%2;
    other = (i1+1)%2;
    for (i2 = s2.size()-1; i2 >= 0; i2--) {
      if (s1[i1] == s2[i2]) {
       cache[index][i2] = 1 + cache[other][i2+1];
      } else {
        r1 = cache[index][i2+1];
        r2 = cache[other][i2];
        cache[index][i2] = (r1 > r2) ? r1 : r2;
      }
    }
  }
  return cache[0][0]; 
}

It runs at the same speed as before, but uses much less memory:

UNIX> time subseq6 `cat sub-big.txt `
891
0.052u 0.000s 0:00.05 100.0%  0+0k 0+0io 0pf+0w
UNIX> 
Here's a curiousity -- take a look at the timings and the output and see if you can figure out what's going on:
UNIX> wc sub-big.txt
   2    2 5353 sub-big.txt
UNIX> wc sub-big-2.txt
   2    2 5186 sub-big-2.txt
UNIX> time subseq4 `cat sub-big-2.txt`
2592
0.000u 0.036s 0:00.04 75.0% 0+0k 0+0io 0pf+0w
UNIX> time subseq5 `cat sub-big-2.txt`
2592
0.060u 0.020s 0:00.08 100.0%  0+0k 0+0io 0pf+0w
UNIX> time subseq6 `cat sub-big-2.txt`
2592
0.044u 0.004s 0:00.05 80.0% 0+0k 0+0io 0pf+0w
UNIX> 
Why is this interesting? Well, sub-big-2.txt is nearly the same size as sub-big.txt. subseq5 and subseq6 take about the same amount of time to run on it, but subseq4 is much faster. Why would that be the case?

Take a look at the output -- the maximum subsequence size is 2592, pretty much half the size of the input file. You can confirm that the two strings are in fact the same:

UNIX> sort -u sub-big-2.txt | wc
      1       1    2593
UNIX> 
Why would subseq4 be so much faster than the others -- well, because it simply does the one recursion at every step, ending after 2592 recursive calls. subseq5, on the other hand, builds that 2592x2592 cache without realizing that pretty much none of the cache is getting used. Same with subseq6, except it only stores the last two rows.

Some Call Graphs

In case some pictures help, here's the call graph of calling subseq1.cpp on "Dog" and "Dodger". You should see how this gives us an answer of three.

I have a much larger call graph for "AbbbcccFyy" and "FxxxyyyAc" below (click on it to see the full-size picture). In this graph, the red edges are calls when the first characters don't match, and the blue edges are when they do match. The calls go down and to the right. I have widened the calls that lead to the correct answer.