Hints for SRM 682, D2, 550-Pointer (TopBiologist)

James S. Plank

Writeup: Mon Feb 29 14:35:24 EST 2016
Problem Statement.

My thoughts went instantly to enumeration here. If you enumerate DNA strings, how many will you have to enumerate until you get one that's not in the sequence?

There are 4 one letter strings.
There are 16 two letter strings.
There are 64 three letter strings.
There are 256 four letter strings.
There are 1024 five letter strings.
There are 4096 six letter strings.

Since sequence is limited to 2000 characters, it can hold a maximum of 1996 five-letter strings, and 1995 six-letter strings. So, if you enumerate strings, you'll probably stop during the five letter strings, and definitely during the six letter strings. So, you enumerate roughly 2000 times and call find() on a 2000 character string -- that should fall right within topcoder's limits.

How do you enumerate strings? I'd recommend the following strategy. You keep a vector of all strings, and initialize it with "". Then, for each element e on the vector, create four strings by concatenating each of the four characters to e. Look in sequence for each string as you create it, and if it's not there, you're done. If it is, then append it to the vector, and keep going.

Can you do this more efficiently when your return value is five or six letters? Yes -- for each value l, starting at one and incrementing, use substr() to grab each substring of sequence with l characters. Insert it into a set. You can now look for the enumerated strings in the set rather than by using find() on the string. You don't have to do this, but it would improve upon the running time.