Question: High score DNA motif search
0
8 months ago by
svetlana.lockwood20 wrote:

I have a DNA sequence (of a chromosome) where each nucleotide has a score. For example (na indicates missing scores):

``````T  C  G   A  T  T  A  G  G  A  T  C
0  1  10  20 5  na na 0  5  2  0  0
``````

My goal is to select all subsequences with high energy scores. No any other a priori information is known except that subsequences are not expected to be long, approximately 5-15 nucleotides. Could someone suggest some ideas how to approach this?

Edit 1: The presented sequence of 12 positions is a snippet of a much longer nucleotide sequence, ~230K positions.

Edit 2: Answer found in "A linear time algorithm for finding all maximal scoring subsequences." Ruzzo-Tompa algorithm.

python dna sequence R genome • 372 views
modified 8 months ago • written 8 months ago by svetlana.lockwood20

My first question would be: high relative to what? For instance, in your example, is 5 or 10 high?

That's one of the things which is not defined. That's why I'm looking for ideas/pointers. One thing is definite - there is some noise in the measurements. For example, score '1' or '2' could be a noise but noise level is not known.

If you can just read the 2 lines in as 2 arrays, I would first get some ballpark figures. If you can get the usual 5 part summaries, you can set some estimates about what are going to be reasonable thresholds.

I'd treat the whole thing as 2 arrays, or better yet probably a dict, and loop over it finding which index positions correspond to a particular value. You'd get sequences of positions which are, say > 10, or maybe >= the mean of the data, and the key of the key-value pair of the dict could be returned to get the sequences.

How do you expect to deal with `na` values? Do you intend to ignore/skip/impute them?

For now I treat them the same as 0 to downplay the importance of those positions but I wonder if there are better ways to deal with them.

0
8 months ago by
Alex Reynolds23k
Seattle, WA USA
Alex Reynolds23k wrote:

What you might to be looking for are solutions to the maximum subarray problem for subarrays between 5 and 15 elements in length. It's not really clear what data you're working with, however, and this doesn't make much sense for your 12-base sequence. Maybe you're just showing a small snippet of data. It's not really clear from your question.

In any case, you need to decide how you deal with `na` values. If `0` is a score with specific meaning, then applying `0` to `na` values may not make sense. You might impute the value by taking the mean of values over nearby windows, or replace them with mean or median values over background. Or you might skip over such regions entirely. This really depends on your score data and its meaning.

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.