CS302 Lecture Notes - Dynamic Programming

CS302 Lecture Notes - Dynamic Programming
Example Program #3: The Maximum Subsequence Problem

November 18, 2009
Latest revision: November, 2020.
James S. Plank
Directory: /home/jplank/cs302/Notes/DynamicProgramming

This problem also appears as Leetcode problem #1143.

This one originally comes from a topcoder article by vorthys. You are to write a program that takes two strings and returns the length of the largest common subsequence of the two strings. A subsequence of a string is any string that you can create by deleting one or more letters of the original string. For example, "dog" is a subsequence if "dodger", but "red" is not. Even though the characters appear in "dodger", you cannot create "red" by deleting characters from "dodger."

You can compile the programs with "make subseq".

Step One

As always, we start with Step 1, which is to spot the recursive solution. To do so, let's take a look at two examples. Here's the first example:

S1 = "AbbbcccFyy", S2 = "FxxxyyyAc"

And here's the second example.

S1 = "AbbbcccFyy", S2 = "FxxxyyyAccc"

You should see that these examples are related. In the first, the longest subsequence is "Fyy", and in the second, the longest subsequence is "Accc". This example highlights the difficulty of this problem -- If you start looking for subsequences from the beginning of either string, the result may be affected by the end of the string. We'll return to these examples in a bit. Let's try another example:

S1 = "9009709947053769973735856707811337348914", S2 = "9404888367379074754510954922399291912221"

Man, that looks hard. Don't even bother trying to solve it. However, one thing you do know -- the maximum subsequence starts with a 9. Why? It has to -- you can prove this to yourself by contradiction: If the subsequence is S and S doesn't start with 9, then there is a valid subsequence which is "9"+S.

I'm hoping these examples help you spot the recursion: You will compare the first letters of both substrings, and do different things depending on whether they equal each other:

If S1[0] == S2[0], then the answer is one plus the maximum subsequence of S1.substr(1) and S2.substr(1). (This method returns the substring starting at index 1 and going to the end of the string).
If S1[0] != S2[0], then you know that the maximum subsequence either does not start with S1[0] or does not start with S2[0]. Thus, you should determine the maximum subsequence of S1 and S2.substr(1), and the maximum subsequence of S2 and S1.substr(1). The answer has to be one of these.

Let's hack that up. It's in src/subseq1.cpp and is straightforward:

int Subseq::MCS(const string &s1, const string &s2)
{
  int r1, r2;

  /* Base case -- if either string is empty, return 0. */

  if (s1.size() == 0 || s2.size() == 0) return 0;

  /* If the first characters of the two strings equal each other,
     then the answer is one plus the MCS of the two string with
     the first characters chopped off. */

  if (s1[0] == s2[0]) return 1 + MCS(s1.substr(1), s2.substr(1));

  /* Otherwise, the answer is either:
       - r1: The MCS of the 1st string, and the 2nd string without its first character
       - r2: The MCS of the 2nd string, and the 1st string without its first character
   */   

  r1 = MCS(s1, s2.substr(1));
  r2 = MCS(s1.substr(1), s2);
  return (r1 > r2) ? r1 : r2;   // Return the maximum of r1 and r2.
}

It seems to work on our examples:

UNIX> bin/subseq1 AbbbcccFyy FxxxyyyAc
3
UNIX> bin/subseq1 AbbbcccFyy FxxxyyyAccc
4
UNIX> bin/subseq1 9009709947053769973735856707811337348914 9404888367379074754510954922399291912221
<CNTL-C>
UNIX>

However, it hangs on that last one. Why? Exponential blow-up of duplicate calls, of course. So, let's move to Step Two:

Step Two

This is an easy memoization on the strings -- just concatenate them with a character that can't be in either string -- I'll choose a colon here. Now, I'm going to turn the solution into a class so that storing the cache is easier. It's in src/subseq2.cpp:

class Subseq {
  public:
    map <string, int> cache;
    int MCS(const string &s1, const string &s2);
};

int Subseq::MCS(const string &s1, const string &s2)
{
  string key;    // We'll turn the arguments into a string for the cache.
  int r1, r2;

  /* Take care of the base cases without worrying about the cache. */

  if (s1.size() == 0 || s2.size() == 0) return 0;

  /* Create the memoization key, and check the cache. */

  key = s1 + ":";
  key += s2;
  if (cache.find(key) != cache.end()) return cache[key];

  /* If it's not in the cache, then do the recursion and put the answer into the cache. */

  if (s1[0] == s2[0]) {
   cache[key] = 1 + MCS(s1.substr(1), s2.substr(1));
  } else {
    r1 = MCS(s1, s2.substr(1));
    r2 = MCS(s1.substr(1), s2);
    cache[key] = (r1 > r2) ? r1 : r2;
  }
  return cache[key];
}

This works fine on our examples:

UNIX> bin/subseq2 AbbbcccFyy FxxxyyyAc
3
UNIX> bin/subseq2 AbbbcccFyy FxxxyyyAccc
4
UNIX> time bin/subseq2 9009709947053769973735856707811337348914 9404888367379074754510954922399291912221
15

real	0m0.007s
user	0m0.004s
sys	0m0.002s
UNIX>

However, this solution should make you feel uneasy. It makes me feel uneasy. The reason is that each memoization key makes a copy of the two strings, and each call to substr() creates a new string which is just one character smaller than the previous string. That's a lot of memory. To hammer home this point, the file txt/sub-big.txt has two 3000-character strings. When we call bin/subseq2 on it, it hangs because it is making so many copies of large strings:

UNIX> bin/subseq2 `cat txt/sub-big.txt`

Think about it -- each substring is a suffix of the previous one, so for S1 and S2 there are roughly 3000 suffixes. That means potentially 3000 * 3000 calls to MCS(), each time creating strings of size up to 3000, and memoization keys of size up to 6000. That's a huge amount of time and memory.

Since we are only using suffixes of S1 and S2, we can represent them with indices to their first characters, and we call MCS() on these indices rather than on the strings. I've put that code into src/subseq3.cpp. It doesn't memoize.

class Subseq {
  public:
    string s1, s2;
    int MCS(size_t i1, size_t i2);
};

/* This code doesn't memoize -- it's like the step 1 code that worked on strings,
   only now we're working on indices, and not making any copies of strings. */

int Subseq::MCS(size_t i1, size_t i2)
{
  int r1, r2;

  if (s1.size() == i1 || s2.size() == i2) return 0;

  if (s1[i1] == s2[i2]) {
    return 1 + MCS(i1+1, i2+1);
  } else {
    r1 = MCS(i1, i2+1);
    r2 = MCS(i1+1, i2);
    return (r1 > r2) ? r1 : r2;
  }
}

It works, although since it doesn't memoize, it hangs on the long input.

UNIX> bin/subseq3 AbbbcccFyy FxxxyyyAc
3
UNIX> bin/subseq3 AbbbcccFyy FxxxyyyAccc
4
UNIX> bin/subseq3 9009709947053769973735856707811337348914 9404888367379074754510954922399291912221

We memoize in src/subseq4.cpp. Our cache is on the indices, so it's a two-dimensional vector, and is of size s1.size() * s2.size():

class Subseq {
  public:
    string s1, s2;
    vector < vector <int> > cache;      // Here's our memoization cache.
    int MCS(size_t i1, size_t i2);
};

int Subseq::MCS(size_t i1, size_t i2)
{
  int r1, r2;
  size_t i;

  /* Base case. */

  if (s1.size() == i1 || s2.size() == i2) return 0;

  /* Create the cache if necessary -- sentinelize with -1. */

  if (cache.size() == 0) {
    cache.resize(s1.size());
    for (i = 0; i < s1.size(); i++) cache[i].resize(s2.size(), -1);
  }

  /* If the answer is in the cache, return it. */

  if (cache[i1][i2] != -1) return cache[i1][i2];

  /* Otherwise, calculate it with recursion. */

  if (s1[i1] == s2[i2]) {
   cache[i1][i2] = 1 + MCS(i1+1, i2+1);
  } else {
    r1 = MCS(i1, i2+1);
    r2 = MCS(i1+1, i2);
    cache[i1][i2] = (r1 > r2) ? r1 : r2;
  }
  return cache[i1][i2]; 
}

Now this version is fast, and it even works on the huge input in under a second!

UNIX> bin/subseq4 AbbbcccFyy FxxxyyyAc
3
UNIX> bin/subseq4 AbbbcccFyy FxxxyyyAccc
4
UNIX> bin/subseq4 9009709947053769973735856707811337348914 9404888367379074754510954922399291912221
15
UNIX> time bin/subseq4 `cat txt/sub-big.txt`
891

real	0m0.123s
user	0m0.110s
sys	0m0.012s
UNIX>

Step 3

We can perform step #3 on this by realizing that you perform the recursion on larger indices, so you build the cache from high to low, rather than low to high in the Fibonacci and Coin programs. This is in src/subseq5.cpp:

int Subseq::MCS()
{
  int r1, r2, i1, i2;
  size_t i;

  /* Create the cache, and fill it with zero's. 
     You'll note that cache[i][s2.size()] and cache[s1.size()][i] should equal zero,
     so we don't have to calculate them. */

  cache.resize(s1.size()+1);
  for (i = 0; i < cache.size(); i++) cache[i].resize(s2.size()+1, 0);

  /* Now start with i1 and i2 being s1.size()-1 and s2.size()-1 respectively, and 
     fill in the cache from high to low.  */

  for (i1 = (int) s1.size()-1; i1 >= 0; i1--) {
    for (i2 = (int) s2.size()-1; i2 >= 0; i2--) {
      if (s1[i1] == s2[i2]) {
       cache[i1][i2] = 1 + cache[i1+1][i2+1];
      } else {
        r1 = cache[i1][i2+1];
        r2 = cache[i1+1][i2];
        cache[i1][i2] = (r1 > r2) ? r1 : r2;
      }
    }
  }

  /* The final answer is in cache[0][0]. */

  return cache[0][0]; 
}

This is much faster than the previous code:

UNIX> time bin/subseq5 `cat txt/sub-big.txt`
891

real	0m0.039s
user	0m0.026s
sys	0m0.011s
UNIX>

Step 4

Finally, we can perform step 4 on this by only keeping two rows, and performing the arithmetic on i1 mod 2. Now, the cache only contains (2*s2.size()) entries. The code is in

int Subseq::MCS()
{
  int r1, r2, i, i1, i2, index, other;

  cache.resize(2);
  for (i = 0; i < cache.size(); i++) cache[i].resize(s2.size()+1, 0);

  for (i1 = s1.size()-1; i1 >= 0; i1--) {
    index = i1%2;
    other = (i1+1)%2;
    for (i2 = s2.size()-1; i2 >= 0; i2--) {
      if (s1[i1] == s2[i2]) {
       cache[index][i2] = 1 + cache[other][i2+1];
      } else {
        r1 = cache[index][i2+1];
        r2 = cache[other][i2];
        cache[index][i2] = (r1 > r2) ? r1 : r2;
      }
    }
  }
  return cache[0][0]; 
}

It runs a little faster than before, presumably because it's using less memory:

UNIX> time bin/subseq6 `cat txt/sub-big.txt`
891

real	0m0.026s
user	0m0.021s
sys	0m0.003s
UNIX>

When Step 1 is faster than the others

Here's a curiosity -- take a look at the timings and the outputs and see if you can figure out what's going on:

UNIX> time bin/subseq3 `cat txt/sub-big-2.txt`                 # Step 1
2592

real	0m0.007s
user	0m0.003s
sys	0m0.003s
UNIX> time bin/subseq4 `cat txt/sub-big-2.txt`                 # Step 2
2592

real	0m0.021s
user	0m0.009s
sys	0m0.009s
UNIX> time bin/subseq5 `cat txt/sub-big-2.txt`                 # Step 3
2592

real	0m0.038s
user	0m0.025s
sys	0m0.011s
UNIX> time bin/subseq6 `cat txt/sub-big-2.txt`                 # Step 4
2592

real	0m0.021s
user	0m0.017s
sys	0m0.003s
UNIX> wc txt/sub-big-2.txt                                    # The input is two strings of size 2592.
       2       2    5186 txt/sub-big-2.txt
UNIX> sort -u txt/sub-big-2.txt | wc                          # And the two strings are identical.
       1       1    2593
UNIX>

What is happening? Well, since the strings are identical, the recursive version always calls MCS(i1+1,i2+1). There is no exponential blow-up, and the answer is found in 2593 recursive calls. It's fast!

Now, in steps 2 and 3, they both employ a cache of size 2593*2593. In step 2, creating that cache dominates the running time, because like step 1, it only makes 2593 recursive calls. In step three, once you make the cache, you still have to fill it in -- every element. That's why it takes so long.

In step 4, we're still filling in 2593*2953 cache entries, but our cache's size is only 2593*2. That's why it's faster than step 3, but it doesn't gain significantly over step 2, and is still inferior to step 1.