SRM 682, D2, 550-Pointer (TopBiologist)

James S. Plank

Thu Sep 13 17:18:10 EDT 2018. Last change: Wed Sep 16 00:27:27 EDT 2020

In case Topcoder's servers are down

Here is a summary of the problem:

Examples

# Sequence              A correct answer
- --------------------  -----------------
0 "AGGTCTA"             "AC"  ("AA", "GA" and "TG" will work too, as will others)
1 "AGACGACGGAGAACGA"    "T"
2 "A"                   "C"
3 "AAGATACACCGGCTTCGTG" "CAT"

Testing yourself

Testing yourself is a little different, because you can give multiple correct answers to a problem. So, I've written a shell script check_answers.sh. You give it the name of the correct answers.txt file, and the name of the file containing the output of your tests.sh. It then checks to make sure that your answers are indeed legal and minimal:
UNIX> g++ main.cpp
UNIX> sh tests.sh > tmp.txt
UNIX> sh check_answers.sh answers.txt tmp.txt
All good
UNIX> 

Hints and discussion

This is an enumeration problem. You are wanting to enumerate all l-length strings composed of the characters 'A', 'C', 'G' and 'T', where l starts with one, and increments by one after each enumeration. During the enumeration, if you generate a string which is not in sequence, then return that string.

In the parlance of the CS302 lecture on enumeration, this is a "div/mod" enumeration where n equals 4, and l starts at one.

Now, let's analyze how many strings we will be enumerating:

Since sequence is limited to 2000 characters, it can hold a maximum of 1995 different six-letter strings, so you're guaranteed to finish the problem before you finish enumerating the six-letter strings.

My Solution, #1

This is about as straightforward as you can get with a div/mod enumeration (in Solution-1.cpp):

#include <string>
#include <iostream>
#include <cstdio>
#include <cstdlib>
using namespace std;

class TopBiologist {
  public:
    string findShortestNewSequence(string sequence);
};

string TopBiologist::findShortestNewSequence(string sequence)
{
  int l, n;
  string letters = "ACGT";
  int i, j, k, top;
  string totest;

  n = letters.size();

  /* Keep enumerating this for successively larger strings. */

  for (l = 1; true;= 6; l++) {

    /* Calculate n^l */

    top = 1;
    for (i = 0; i < l; i++) top *= n;

     /* Enumerate each string, and see if it is in sequence.  If not, return it. */

    for (i = 0; i < top; i++) {
      j = i;
      totest.clear();
      for (k = 0; k < l; k++) {
        totest.push_back(letters[j%n]);
        j /= n;
      }
      if (sequence.find(totest) == string::npos) return totest;
    }
  }
  return "";
}


Running times and other solutions

Does any part of this make you uneasy? The sequence.find(totest) is really inefficient. In particular, if the length of the answer is l, and the length of the string is n, then the running time of this program is:

O(4ln).

It's fast enough for Topcoder, but you can make it better:

My Solution, #2

Before each of the enumerations, make an unordered_set all of the distinct l-letter substrings of sequence, and then use the find() method of the unordered set to look for totest. That turns the enumeration to:

O(n + 4l).

The O(n) term is the time to create the set, and the O(4l) term is doing the enumeration and looking up each word in the set.

Interestingly, when I time this version vs the first one, it is slower. So much for big-O. I'm not going to probe into it.

My Solution, #3

Instead of enumerating the strings, create a vector whose length is the number of strings in the enumeration. The vector elements can be either 0 or 1 (so you could use a vector of ints and use their bits if you really wanted). Start with all of them equal to zero. Each element of the vector represents a string, but we don't store the strings.

Instead, enumerate all of the substrings of size l in sequence. For each substring, turn it into an index into the vector, and set that vector element to one.

At the end, if any vector's element is 0, then you've discovered a string that is not a substring of sequence. If no vector's element is 0, then you need to increment l and try again.

Let's do an example, because this might be confusing. We'll do example 1, where sequence is "AGGTCTA". We start with l = 1. There are four strings with length one, and when we generated them above with the enumeration, we had:

i  string
-  ------
0    A
1    C
2    G
3    T
So, we'll create a four-element vector v and set each element to 0. We then go through "AGGTCTA", and for each one-letter substring, we'll calculate the index of v and set it to one:
substring    index in v       v after setting it to one
---------    ----------       -------------------------
    A            0                 { 1, 0, 0, 0 }
    G            2                 { 1, 0, 1, 0 }
    G            2                 { 1, 0, 1, 0 }
    T            3                 { 1, 0, 1, 1 }
    C            1                 { 1, 1, 1, 1 }
    T            3                 { 1, 1, 1, 1 }
    A            0                 { 1, 1, 1, 1 }
You'll note, there are no 0's in v, so we need to try l = 2. There are 16 strings with length two -- here's how we would generate them with an enumeration (again, we don't do it here. I'm just showing you what they would be):
     i:  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
string: AA CA GA TA AC CC GC TC AG CG GG TG AT CT GT TT
We create v as a vector of 16 ones, and go through "AGGTCTA", and for each two-letter substrings, we'll calculate the index of v and set it to one:
substring    index in v       v after setting it to one
---------    ----------       -------------------------
    AG           8                 { 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 }
    GG          10                 { 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0 }
    GT          14                 { 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0 }
    TC           7                 { 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0 }
    CT          13                 { 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0 }
    TA           3                 { 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0 }
Now, there are a bunch of entries of v whose values are zero. Any of them will correspond to a substring not in sequence.

The running time of this one is O(4l + nl). The resizing of v is what takes O(4l). Running through the substrings is O(nl), because there are O(n) of them, and calculating their index takes O(l) time.

You can get rid of that last factor of l to make the overall running time O(4l + n). The way to do that it to realize that if the index of the substring starting at character i is index, then the index of the substring starting at character i+1 is:

(index * 4) % 4l + s[index+l]

I'm going to let you ponder on that for a little bit...

When you time this on tests.sh, it is more than twice as fast as the first solution!