CS302 Lecture Notes - Dynamic Programming
Example program #4: ConvertibleStrings
James S. Plank
Original Notes: Thu Nov 14 21:59:54 EST 2013.
Latest revision: Mon Nov 9 10:27:28 EST 2020
This is from Topcoder SRM 591, Division 2, 500-point problem.
Problem Statement.
In case Topcoder's servers are not working, here is a summary of the problem:
- You are given two strings with upper-case characters from 'A' to 'I'.
- Call these strings, A and B.
- They are the same length which is ≤ 50.
- You are going to choose some number of indices, and remove the characters
at these indices in both A and B.
- When you are done, you should be able to convert A to B with
a simple substitution cipher -- in other words, for each character c from 'A' to
'I', there is a substitution S[c], such that S[A[i]] always equals Y[i].
- Return the minimum number of indices that you have to delete to achieve this.
Examples
0: A: "DD"
B: "FF"
Answer: 0 -- If you change D's to F's, you're done. No deletion is required.
1: A: "AAAA"
B: "ABCD"
Answer: 3 -- Since you can only map A's to one character. Whichever it is -- A, B, C or D,
you'll have to delete the other indices from both strings.
2: A: "AAIAIA"
B: "BCDBEE"
Answer: Delete indices 1, 2 and 5, and A becomes "AAI" and B becomes "BBE". That works.
3: A: "ABACDCECDCDAAABBFBEHBDFDDHHD"
B: "GBGCDCECDCHAAIBBFHEBBDFHHHHE"
Answer: 9 -- We'll have to program this one to get it right.
Approach with Dynamic Programming
This one screams dynamic programming. As always, the hard part is to
spot the recursion. Here's how I thought about it. You have your
two strings, A and B. Consider the first character of each. Either
you are going to remove that character from each string, or you are going to keep
the character, which means that you'll match the character in A with the character in B.
In either case, you can solve a smaller sub-problem, and use that solution to solve
your problem.
Let's think about it in terms of a concrete example. I work Example 2 to completion
below, but I'm going to start with a harder one here to motivate the recursion.
I've put this in the main as example 4.
A = "DEFDEDFFDEED", B = "WYZYXYWYZYXY"
Now, consider the first character of each string -- this is the character 'D' for A,
and 'W' for B. Our solution is going to be one of the following:
- Either remove those first characters from both strings, and solve the subproblem
of A = "EFDEDFFDEED", B = "YZYXYWYZYXY". The total number of character removals will
be the solution to the subproblem, plus one for removing the initial 'D' and 'W'.
- Or, we match 'D' to 'W'. This will have implications for the rest of the two strings.
Whenever we see an 'D' in A or a 'W' in B, we need to consider the fact that we have matched
'D' to 'W'. Let's take a look at the two strings, and the D's and W's:
" D E F D E D F F D E E D "
| | | | | |
| | | | | |
M X X X X X
| | | | | |
| | | | | |
" W Y Z Y X Y W Y Z Y X Y "
There are six indices i where either A[i] equals 'D' or B[i] equals W. In all but the
first, the D's and W's don't match. That means that if we match 'D' and 'W', we are going to
have to remove those five characters from both strings, incurring a penalty of five.
Now, to make the recursive call, we should remove all six occurrences, basically getting rid of
all of the D's and W's. We will make a recursive call with A = "EFEFEE" and B = "YZXYYX", and
add five to the answer.
Whichever of these approaches yields the smaller number will be the answer.
Let's run through a second example, this time all the way to completion. This is
example 2 from the Topcoder problem:
A = "AAIAIA". B = "BCDBEE"
Suppose we remove the first character from A and B.
Then the number of overall removals is going to be one plus the minimum number
of removals when you set A to "AIAIA" and B to "CDBEE".
Now suppose instead that we don't remove the first character. Then
'A' in A will match with 'B' in B. So, we run through both strings,
and whenever there is an 'A' in A, or a 'B' in B, we'll have to decide
whether this will cause us to remove the characters, or whether they
match appropriately. Let's draw the same picture as above:
" A A I A I A "
| | | |
| | | |
M X M X
| | | |
| | | |
" B C D B E E "
As you can see, two of them match, and two of them must be removed. For the recursive
call, we'll remove all of those indices, leaving us with A = "II" and B = "DE". We'll
add two to the recursive call, because of the two non-matching characters above.
So, to summarize, we are going to do two things with the first characters of A
and B:
- Remove them from both strings and solve the subproblem. The answer is
the solution to the sub-problem, plus one.
- Match them. This may cause us to remove other non-matching characters
in the remainder of the string. Let's call the number of such removals R.
Create the sub-problem by deleting all instances
of the first character in A (and their corresponding characters in B),
and all instances of the first character in B (and their corresponding characters in A).
The answer is the solution to the sub-problem, plus R.
Whichever of these solutions is the minimum is our answer.
There's your recursion. Now, this is a dynamic program, so you have to memoize.
I had my cache be a map that I key on a concatenation
of A and B. With the example above, the first key is "AAIAIABCDBEE".
Hack it up. This one is a really nice practice DP problem.
My solution is here (there is a main() in that
program so that you can run it from the command line).
If you want to walk through this in detail, the picture below shows the call graph
of example 2. Each node makes two recursive calls, which are represented by edges to
other nodes. When the edges leave the bottom of a node, it's because we are removing the
first characters and recursively calling the procdure on the remaining characters. For
that reason, the edge weights are always one.
When the edge leaves the right of a node, it's because we are matching the first characters,
and then we have to remove any non-matching characters in the remaining strings. The edge
weights are variable now. For the starting node, the weight of the edge to its right has
a weight of two, because we have to change two characters when we match 'A' to 'B'.
For the node "IA EE", the edge to the right goes to the empty string with a weight of one, because
when you assign 'I' to 'E', you have to remove the 'A'.
"But Dr. Plank, is this Dynamic Programming, Topological Sort, or Dijkstra's Algorithm?"
Good question. You'll note that the graph above is a directed acyclic graph, and you are looking
for the shortest path from the starting node to the node with the null string.
So you can solve it in three different ways:
- As a shortest path problem. In fact, I'm guessing that this is truly the most efficient
way to solve the problem, because what you do in this instance is create the graph as you
process Dijkstra's algorithm. You start just with the starting node ("AAIAIA BCDBEE") and
put it onto the Dijkstra map. Then you process the map. When you process a node, you process
both edges, creating the nodes if necessary, and putting them onto the map. In this way,
you don't actually have to create all of the nodes, because Dijkstra's algorithm terminates
when it finds the minimum path.
- As a topological sort. Each time you remove a node,
you know the node's minimum distance
to the starting node. When you remove the last node, you have your answer.
You know, this is really the "step 3" solution to the problem -- removing the recursion.
- As a dynamic program. If you think about it, the dynamic program is like a DFS on an
acyclic graph. That's pretty cool.
A final note on using enumeration to solve this problem
Given the constraints, enumeration is a possible way to solve this problem. Think to yourself -- what
kind of enumeration will work?
- Div/mod? No, we're not enumerating fixed-digit numbers in a base or fixed-size words from an alphabet.
- Power set? Always a thought. You could enumerate subsets of A, and for each subset, check to see
if the corresponding subset of B will work, and if so, the answer is (|A|-|subset|). Unfortunately,
though, A can be up to 50 characters, to this is too expensive (250).
- Permutations? Hmmm -- the substitution cipher is a permutation of the characters from 'A'
to 'I'. So, we could generate all of those permutations, and for each permutation, determine
how many characters you'd have to delete so that the remaining characters are legal according to the cipher.
How many permutations? factorial(9), which is under 1,000,000. It would work! My guess is that this
is why Topcoder kept the number of legal characters so low.