Challenge 07: Repeated DNA sequences

Problem overview

DNA sequences are composed of chemicals abbreviated as 'A', 'C', 'G', and 'T. There are many examples where repeated sequences with a DNA sequence is interesting. For example, certain repetitive elements called retrotransposons (similar to retroviruses) are flanked by long termninal repeats (LTRs). Similar sequences in the human genome control how mammalian placentas develop and therefore are essential for successful pregnancy.

DNA sequences can be very simple strings or complex. For example, the following is one example of a DNA sequence:

AAAAAAAAAAC

Given a string s that represents a specific DNA sequence, return all the 9-letter-long sequences (substrings) that occur more than once in a DNA molecule. You must return these repeats in lexographical order, i.e., AAAAAAAAA comes before CCCCCCCCC.

Inspiration

This is inspired by LeetCode problem #187 and discussions with your TAs.

If you take an advanced class in bioinformatics, you will learn that a special data structure enables locating all repeats of arbitrary length in linear time (aka O(n)). This data structure is fundamental to all genome sequencing efforts that will enable more personalized medicine.

Input / Output

You will be given a series of DNA strings from standard input in the following format:

DNA1 
DNA2
DNA3

Each DNA string will be arbitrary length.

Example

You are to find repeated substrings. Given the following input:

AAAAACCCCCAAAAACCCCCCAAAAAGGGTT
AAAAAAAAAAA

Your program define repeats and follow each string with a -1. So you should output the following:

AAAAACCCC
AAAACCCCC
CCCCAAAAA
CCCCCAAAA
-1
AAAAAAAAA
-1

Requirements

On top of being interview prep and related to my research area, this specific assignment is intended to both review and assess your understanding of STL containers. Your solution must do the following:

  1. Enumerate all 9 character substrings in each input DNA string
  2. Insert these substrings into an STL container of your choice
  3. (optional) Apply an algorithm on the STL container
  4. Iterate through all of the elements of the container to generate the correct/desired output

Hints (same as before)

  1. You may wish to read the DNA sequences in as std::strings.

  2. Different solutions are inspired by different challenges (and projects) you completed for CS302.


Rubric

We will grade your submission relative to the rubric below.

+2    Code is well formatted, commented (inc. name, assignment, and overview), with reasonable variable names
+4    follows the rules, i.e., solution should have an average run time complexity of O(n log n) where n is the size of the DNA string
+14   Test cases successfully solved (2 points each)

Testing your code prior to submission

To faciliate testing, you were previously asked to clone the course Github repository as follows:

git clone https://github.com/semrich/cs302-23.git cs302

For this assignment, update this clone by using the following:

git pull

We'll discuss this in class but note that your program must be named "solution.cpp" and compilable using make. To test your solution against ours, type:

make test