Question: High score DNA motif search
0
gravatar for svetlana.lockwood
3 months ago by
svetlana.lockwood20 wrote:

I have a DNA sequence (of a chromosome) where each nucleotide has a score. For example (na indicates missing scores):

T  C  G   A  T  T  A  G  G  A  T  C
0  1  10  20 5  na na 0  5  2  0  0

My goal is to select all subsequences with high energy scores. No any other a priori information is known except that subsequences are not expected to be long, approximately 5-15 nucleotides. Could someone suggest some ideas how to approach this?

Edit 1: The presented sequence of 12 positions is a snippet of a much longer nucleotide sequence, ~230K positions.

Edit 2: Answer found in "A linear time algorithm for finding all maximal scoring subsequences." Ruzzo-Tompa algorithm.

python dna sequence R genome • 239 views
ADD COMMENTlink modified 3 months ago • written 3 months ago by svetlana.lockwood20

My first question would be: high relative to what? For instance, in your example, is 5 or 10 high?

ADD REPLYlink written 3 months ago by jrj.healey2.6k

That's one of the things which is not defined. That's why I'm looking for ideas/pointers. One thing is definite - there is some noise in the measurements. For example, score '1' or '2' could be a noise but noise level is not known.

ADD REPLYlink written 3 months ago by svetlana.lockwood20

If you can just read the 2 lines in as 2 arrays, I would first get some ballpark figures. If you can get the usual 5 part summaries, you can set some estimates about what are going to be reasonable thresholds.

I'd treat the whole thing as 2 arrays, or better yet probably a dict, and loop over it finding which index positions correspond to a particular value. You'd get sequences of positions which are, say > 10, or maybe >= the mean of the data, and the key of the key-value pair of the dict could be returned to get the sequences.

ADD REPLYlink written 3 months ago by jrj.healey2.6k

How do you expect to deal with na values? Do you intend to ignore/skip/impute them?

ADD REPLYlink modified 3 months ago • written 3 months ago by Alex Reynolds21k

For now I treat them the same as 0 to downplay the importance of those positions but I wonder if there are better ways to deal with them.

ADD REPLYlink written 3 months ago by svetlana.lockwood20
0
gravatar for Alex Reynolds
3 months ago by
Alex Reynolds21k
Seattle, WA USA
Alex Reynolds21k wrote:

What you might to be looking for are solutions to the maximum subarray problem for subarrays between 5 and 15 elements in length. It's not really clear what data you're working with, however, and this doesn't make much sense for your 12-base sequence. Maybe you're just showing a small snippet of data. It's not really clear from your question.

In any case, you need to decide how you deal with na values. If 0 is a score with specific meaning, then applying 0 to na values may not make sense. You might impute the value by taking the mean of values over nearby windows, or replace them with mean or median values over background. Or you might skip over such regions entirely. This really depends on your score data and its meaning.

ADD COMMENTlink modified 3 months ago • written 3 months ago by Alex Reynolds21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 967 users visited in the last hour