Skip to main content

DNA Sequence

HardPremium

This question is based on real problems in text manipulation and bioinformatics. A DNA molecule is constructed from a sequence of four nucleic acids, noted as {A,C,G,T}. These encode all the information and genes stored.

In our context, the DNA is represented as a string sequence of the four letters and a gene is simply a substring in the DNA. Given a DNA sequence dnaSequence, and a gene geneSequence, create a function numOfGeneOccurences that calculates the number of times the gene geneSequence occurs in dnaSequence.

Examples

input: dnaSequence = "CATGCATGCATG", geneSequence = "TGC" output: 2 input: dnaSequence = "GTTTTAGAAAAG", geneSequence = "TT" output: 3 input: dnaSequence = "GTAGAATCATGCA", geneSequence = "GTGTGT" output: 0

Generally, interviewers prefer a finished-if-unoptimal answer vs. an unfinished, optimal answer, as long as you can explain how you'd optimize with more time. If you're getting bogged down in the details of a sophisticated answer, take a step back and write a slower naive solution, then optimize

Many calculations in the naive approach are redundant. Can you preprocess to speed things up?

Many efficient algorithms today are based on Finite State Machines, such as regular expressions. Can you implement something similar here?

This is not an easy or quick question! You might take 45 minutes to solve it. If you're running out of time, you may implement a simpler solution that runs asymptotically slower, if you explain how you'd optimize further.

The most optimal solutions are functions that run in O(N+M) runtime, where N and M are the lengths of the dnaSequence and geneSequence, respectively. Were you able to hit these targets?

Were you tempted to use a hash function for the gene, hashing every subsequence of the DNA, and checking whether the hashes are the same? Can you see why this would become problematic for large genes? Explain.