DNA sequences are composed of chemicals abbreviated as 'A', 'C', 'G', and 'T. There are many examples where repeated sequences with a DNA sequence is interesting. For example, certain repetitive elements called retrotransposons (similar to retroviruses) are flanked by long termninal repeats (LTRs). Similar sequences in the human genome control how mammalian placentas develop and therefore are essential for successful pregnancy.
DNA sequences can be very simple strings or complex. For example, the following is one example of a DNA sequence:
AAAAAAAAAAC
Given a string s that represents a specific DNA sequence, return all the 9-letter-long sequences (substrings) that occur more than once in a DNA molecule. You must return these repeats in lexographical order, i.e., AAAAAAAAA comes before CCCCCCCCC.
This is inspired by LeetCode problem #187 and discussions with your TAs.
If you take an advanced class in bioinformatics, you will learn that a special data structure enables locating all repeats of arbitrary length in linear time (aka O(n)). This data structure is fundamental to all genome sequencing efforts that will enable more personalized medicine.
You will be given a series of DNA strings from standard input in the following format:
DNA1 DNA2 DNA3
Each DNA string will be arbitrary length.
You are to find repeated substrings. Given the following input:
AAAAACCCCCAAAAACCCCCCAAAAAGGGTT AAAAAAAAAAA
Your program define repeats and follow each string with a -1. So you should output the following:
AAAAACCCC AAAACCCCC CCCCAAAAA CCCCCAAAA -1 AAAAAAAAA -1
On top of being interview prep and related to my research area, this specific assignment is intended to both review and assess your understanding of STL containers. Your solution must do the following:
You may wish to read the DNA sequences in as std::strings.
Different solutions are inspired by different challenges (and projects) you completed for CS302.
We will grade your submission relative to the rubric below.
+2 Code is well formatted, commented (inc. name, assignment, and overview), with reasonable variable names +4 follows the rules, i.e., solution should have an average run time complexity of O(n log n) where n is the size of the DNA string +14 Test cases successfully solved (2 points each)
To faciliate testing, you were previously asked to clone the course Github repository as follows:
git clone https://github.com/semrich/cs302-23.git cs302
For this assignment, update this clone by using the following:
git pull
We'll discuss this in class but note that your program must be named "solution.cpp" and compilable using make. To test your solution against ours, type:
make test