DNA Sequence
This question is based on real problems in text manipulation and bioinformatics. A DNA molecule is constructed from a sequence of four nucleic acids, noted as {A,C,G,T}. These encode all the information and genes stored.
In our context, the DNA is represented as a string sequence of the four letters and a gene is simply a substring in the DNA. Given a DNA sequence dnaSequence, and a gene geneSequence, create a function numOfGeneOccurences that calculates the number of times the gene geneSequence occurs in dnaSequence.
Examples
input: dnaSequence = "CATGCATGCATG", geneSequence = "TGC" output: 2 input: dnaSequence = "GTTTTAGAAAAG", geneSequence = "TT" output: 3 input: dnaSequence = "GTAGAATCATGCA", geneSequence = "GTGTGT" output: 0
Generally, interviewers prefer a finished-if-unoptimal answer vs. an unfinished, optimal answer, as long as you can explain how you'd optimize with more time. If you're getting bogged down in the details of a sophisticated answer, take a step back and write a slower naive solution, then optimize
Many calculations in the naive approach are redundant. Can you preprocess to speed things up?
Many efficient algorithms today are based on Finite State Machines, such as regular expressions. Can you implement something similar here?
This is not an easy or quick question! You might take 45 minutes to solve it. If you're running out of time, you may implement a simpler solution that runs asymptotically slower, if you explain how you'd optimize further.
The most optimal solutions are functions that run in O(N+M) runtime, where N and M are the lengths of the dnaSequence and geneSequence, respectively. Were you able to hit these targets?
Were you tempted to use a hash function for the gene, hashing every subsequence of the DNA, and checking whether the hashes are the same? Can you see why this would become problematic for large genes? Explain.
Our solution is based on using a mechanism called a finite state machine, but to demonstrate how it is better, we begin by showing the naive approach; Iterating through the text, and checking every candidate substring:
function numOfGeneOccurnces(dnaSequence, geneSequence): counter = 0 geneSeqLen = geneSequence.length dnaSeqLen = dnaSequence.length lenDiff = dnaSeqLen - geneSeqLen for i from 0 to lenDiff: if (dnaSequence.substring(i, i + geneSeqLen) == geneSequence): counter++ return counter
Since we iterate through O(N) indexes (in the worst case) and compare two strings of length M for every iteration, the time complexity of the naive solution is O(N∙M). We can do better than this.
A different approach to this is preprocessing the gene sequence, into a state machine, and reading through the DNA sequence linearly.
A finite state machine is a computational model that contains a number of predefined states - which symbolizes what letters were seen so far. The machine simulates an iterator going through the input, one letter at a time, and on each letter, it switches the machine state according to predefined rules. A state that indicates the last few letters were in fact the gene, is called an Accepting State, and when the machine switches to that state, the counter for the occurrences of the gene is incremented.
In our specific algorithm, the states in our machine are the Mprefixes in the gene sequence, and the switching rule is determined by checking what is the maximum overlap between the state + next letter,
and the gene.
For example, a state machine that checks for the gene 'AAT' is built as follows:

When going through the DNA sequence ‘AAGAAATG’, the states are switched as follows (the underscore indicates which letter is checked at every step):
Step 0: _AAGAAATG; state: s0
Step 1: AAGAAATG; state: sA
Step 2: AAGAAATG; state: sAA
Step 3: AAGAAATG; state: s0
Step 4: AAGAAATG; state: sA
Step 5: AAGAAATG; state: sAA
Step 6: AAGAAATG; state: sAA
Step 6: AAGAAATG; state: AAAT - an accepting state. Counter is incremented.
Step 7: AAGAAATG; state: s0
All state machines work in linear time to the texts length, so if the
function simulates a state machine before running through the text, the
iteration itself is done in O(N) asymptotic runtime.
A basic subroutine for building the machine:
# Input: str1 - string of length l1, and str2 - string of length l2. # Output: the length of the maximal suffix of str2 which is a prefix of str1. # Example: getMaxOverlapLength(‘AACG’, ‘GCGCAA’) = 2. # since ‘AA’ is a prefix in str1 - ’AACG’, and a suffix in ‘GCGCAA’ - str2 # (It isn’t 3 or more since there suffixes of length 3 or longer in # ‘‘GCGCAA’’ are not prefixes in ‘AACG’). function getMaxOverlapLength(str1, str2): str2Len = str2.length for i from str2Len to 1: if (str1.substring(1, i) == str2.substring(str2Len - i, str2Len)): return i return 0 # Input: a gene sequence of length m. # Output: an (m+1)×{A,C,G,T} array representing the states, # and switching rules. e.g. If the machine is in state 1, # and sees ‘G’, the next state is in array[1][G]. function buildStateMachine(geneSequence): letters = ['A', 'C', 'G' ,'T'] machine = Array(geneSequence.length+1, 4) for i from 0 to geneSequence.length: for j from 0 to 3: current = gene.substeing(1, i) + letters[j] machine[i][j] = getMaxOverlapLength(geneSequence, current)
Pseudocode for the iteration on the DNA sequence:
# Input: dnaSequnce of length n, geneSequence of length m. # Output: number of times the geneSequence occurs in dnaSequnce. function numOfGeneOccurences(dnaSequence, geneSequence): machine = buildStateMachine(geneSequence) state = 0 for i from 0 to dnaSequence.length: char letter = dna[i] state = machine[state][letter] if (state == geneSequence.length+1): counter++ return counter
Complexity of FSM solution: As we can see, the function numOfGeneOccurences itself runs in O(N), since each iteration in the for loop takes constant time. The preprocessing function runs m loops, that each take up to O(M2) time, so the summarized time is O(M3). Overall, the whole process takes O(N+M3) time, using O(M) space for the machine.
We could improve O(M3) by using Dynamic Programming methods. The algorithm presented below is called Knuth-Morris-Pratt algorithm. It is based on the fact that if str1.substring(1, i) == str2.substring(str2.length - i, str2.length), then in our previous calculation, we already concluded that str1.substring(1, (i-1)) == str2.substring(str2.length - (i-1)), str2.length). Because of this, the overlapping function needs to checks overlapping from the index of the last overlap.
# Input: str1 - string for prefix, str2 - string for suffix, # machine - the array which was calculated in previous steps., # Output: the length of the maximal suffix of str2 which is a prefix of str1. function getMaxOverlapLength(str1, str2, machine): str1Len = str1.length str2Len = str2.length char lastLetter = str1[str1Len - 1] for i from (str1Len - 1) to 0: if (str1.substring(1, i) == str2.substring(str2Len - i, str2Len)): return i return 0
Complexity of KMP solution: Although in the worst case, the function still checks k = O(M) times when called, since long iterations backwards are initiated only after the same amount k calls where the function succeeded in the first time, then the average, amortized time for this function is still O(1), Which gives us a preprocessing function with O(M) asymptotic time complexity.
Currently, there's no option to write Python code for the solution to this problem. Is it possible to add it?