CS302 Lecture Notes

CS302 Lecture Notes - Dynamic Programming
Example program #4: ConvertibleStrings

James S. Plank

Original Notes: Thu Nov 14 21:59:54 EST 2013.
Latest revision: Mon Nov 9 10:27:28 EST 2020

This is from Topcoder SRM 591, Division 2, 500-point problem.
Problem Statement.

In case Topcoder's servers are not working, here is a summary of the problem:

You are given two strings with upper-case characters from 'A' to 'I'.
Call these strings, A and B.
They are the same length which is ≤ 50.
You are going to choose some number of indices, and remove the characters at these indices in both A and B.
When you are done, you should be able to convert A to B with a simple substitution cipher -- in other words, for each character c from 'A' to 'I', there is a substitution S[c], such that S[A[i]] always equals Y[i].
Return the minimum number of indices that you have to delete to achieve this.

Examples

0: A: "DD"
   B: "FF"
   Answer: 0 -- If you change D's to F's, you're done.  No deletion is required.
    	
1: A: "AAAA"
   B: "ABCD"
   Answer: 3 -- Since you can only map A's to one character.  Whichever it is -- A, B, C or D,
                you'll have to delete the other indices from both strings.
    	
2: A: "AAIAIA"
   B: "BCDBEE"
   Answer: Delete indices 1, 2 and 5, and A becomes "AAI" and B becomes "BBE".  That works.

3: A: "ABACDCECDCDAAABBFBEHBDFDDHHD"
   B: "GBGCDCECDCHAAIBBFHEBBDFHHHHE"
   Answer: 9 -- We'll have to program this one to get it right.

Approach with Dynamic Programming

This one screams dynamic programming. As always, the hard part is to spot the recursion. Here's how I thought about it. You have your two strings, A and B. Consider the first character of each. Either you are going to remove that character from each string, or you are going to keep the character, which means that you'll match the character in A with the character in B. In either case, you can solve a smaller sub-problem, and use that solution to solve your problem.

Let's think about it in terms of a concrete example. I work Example 2 to completion below, but I'm going to start with a harder one here to motivate the recursion. I've put this in the main as example 4.

A = "DEFDEDFFDEED", B = "WYZYXYWYZYXY"

Now, consider the first character of each string -- this is the character 'D' for A, and 'W' for B. Our solution is going to be one of the following:

Either remove those first characters from both strings, and solve the subproblem of A = "EFDEDFFDEED", B = "YZYXYWYZYXY". The total number of character removals will be the solution to the subproblem, plus one for removing the initial 'D' and 'W'.
Or, we match 'D' to 'W'. This will have implications for the rest of the two strings. Whenever we see an 'D' in A or a 'W' in B, we need to consider the fact that we have matched 'D' to 'W'. Let's take a look at the two strings, and the D's and W's:
```
        " D E F D E D F F D E E D " 
          |     |   | |   |     | 
          |     |   | |   |     | 
          M     X   X X   X     X 
          |     |   | |   |     | 
          |     |   | |   |     | 
        " W Y Z Y X Y W Y Z Y X Y " 
```
There are six indices i where either A[i] equals 'D' or B[i] equals W. In all but the first, the D's and W's don't match. That means that if we match 'D' and 'W', we are going to have to remove those five characters from both strings, incurring a penalty of five.
Now, to make the recursive call, we should remove all six occurrences, basically getting rid of all of the D's and W's. We will make a recursive call with A = "EFEFEE" and B = "YZXYYX", and add five to the answer.

Whichever of these approaches yields the smaller number will be the answer.

Let's run through a second example, this time all the way to completion. This is example 2 from the Topcoder problem:

A = "AAIAIA". B = "BCDBEE"

Suppose we remove the first character from A and B. Then the number of overall removals is going to be one plus the minimum number of removals when you set A to "AIAIA" and B to "CDBEE".

Now suppose instead that we don't remove the first character. Then 'A' in A will match with 'B' in B. So, we run through both strings, and whenever there is an 'A' in A, or a 'B' in B, we'll have to decide whether this will cause us to remove the characters, or whether they match appropriately. Let's draw the same picture as above:

        " A A I A I A " 
          | |   |   | 
          | |   |   | 
          M X   M   X 
          | |   |   | 
          | |   |   | 
        " B C D B E E "

As you can see, two of them match, and two of them must be removed. For the recursive call, we'll remove all of those indices, leaving us with A = "II" and B = "DE". We'll add two to the recursive call, because of the two non-matching characters above.

So, to summarize, we are going to do two things with the first characters of A and B:

Remove them from both strings and solve the subproblem. The answer is the solution to the sub-problem, plus one.
Match them. This may cause us to remove other non-matching characters in the remainder of the string. Let's call the number of such removals R. Create the sub-problem by deleting all instances of the first character in A (and their corresponding characters in B), and all instances of the first character in B (and their corresponding characters in A). The answer is the solution to the sub-problem, plus R.

Whichever of these solutions is the minimum is our answer.

There's your recursion. Now, this is a dynamic program, so you have to memoize. I had my cache be a map that I key on a concatenation of A and B. With the example above, the first key is "AAIAIABCDBEE".

Hack it up. This one is a really nice practice DP problem.

My solution is here (there is a main() in that program so that you can run it from the command line).

If you want to walk through this in detail, the picture below shows the call graph of example 2. Each node makes two recursive calls, which are represented by edges to other nodes. When the edges leave the bottom of a node, it's because we are removing the first characters and recursively calling the procdure on the remaining characters. For that reason, the edge weights are always one.

When the edge leaves the right of a node, it's because we are matching the first characters, and then we have to remove any non-matching characters in the remaining strings. The edge weights are variable now. For the starting node, the weight of the edge to its right has a weight of two, because we have to change two characters when we match 'A' to 'B'. For the node "IA EE", the edge to the right goes to the empty string with a weight of one, because when you assign 'I' to 'E', you have to remove the 'A'.

"But Dr. Plank, is this Dynamic Programming, Topological Sort, or Dijkstra's Algorithm?"

Good question. You'll note that the graph above is a directed acyclic graph, and you are looking for the shortest path from the starting node to the node with the null string. So you can solve it in three different ways:

As a shortest path problem. In fact, I'm guessing that this is truly the most efficient way to solve the problem, because what you do in this instance is create the graph as you process Dijkstra's algorithm. You start just with the starting node ("AAIAIA BCDBEE") and put it onto the Dijkstra map. Then you process the map. When you process a node, you process both edges, creating the nodes if necessary, and putting them onto the map. In this way, you don't actually have to create all of the nodes, because Dijkstra's algorithm terminates when it finds the minimum path.
As a topological sort. Each time you remove a node, you know the node's minimum distance to the starting node. When you remove the last node, you have your answer. You know, this is really the "step 3" solution to the problem -- removing the recursion.
As a dynamic program. If you think about it, the dynamic program is like a DFS on an acyclic graph. That's pretty cool.

A final note on using enumeration to solve this problem

Given the constraints, enumeration is a possible way to solve this problem. Think to yourself -- what kind of enumeration will work?

Div/mod? No, we're not enumerating fixed-digit numbers in a base or fixed-size words from an alphabet.
Power set? Always a thought. You could enumerate subsets of A, and for each subset, check to see if the corresponding subset of B will work, and if so, the answer is (|A|-|subset|). Unfortunately, though, A can be up to 50 characters, to this is too expensive (2⁵⁰).
Permutations? Hmmm -- the substitution cipher is a permutation of the characters from 'A' to 'I'. So, we could generate all of those permutations, and for each permutation, determine how many characters you'd have to delete so that the remaining characters are legal according to the cipher. How many permutations? factorial(9), which is under 1,000,000. It would work! My guess is that this is why Topcoder kept the number of legal characters so low.