Find highly similar regions of specific lengths to a query in a genome
0
0
Entering edit mode
7.4 years ago
Jenez ▴ 540

I have a collection of gene sequences, for each I would like to identify any highly similar 150bp regions within a specific assembled genome.

I have initially tried megablast, where I blasted each query gene against the genome and looked for regions with a sequence identity > 85%. However, it became apparent that even if this worked for most genes, it may sometimes prove that an aligned region of 200bp may have an identity of 84% to my query gene, whereas in reality had the alignment been shortened by 40bp, the identity would rise above 85%.

Tweaking the blast parameters such as gap penalties and the like is not a perfect solution, and searching biostars for this particular problem proved difficult.

Do you have any suggestions on how to proceed?

Thank you...

blast genome alignment • 1.8k views
ADD COMMENT
0
Entering edit mode

I want to make sure I understand the question: So you would like to take a query sequence longer than 150bp, and then try to find high percent identity matches to any subsequence 150bp long? So, then, there would be 51 possible subsequences of length 150 in a query that is 200 bp long.

ADD REPLY
0
Entering edit mode

I suppose I should go one step further. If that is the case, would querying against all possible subsequences solve the problem? Two short PERL or Python scripts would do the trick. One to make the new queries with informative query names, and then one to summarize all the results to find matches meeting the criteria. Sometimes, we ignore the ends of shotgun reads. You could trim off 25 bases from the front and back. That's sloppy, but fast.

ADD REPLY
0
Entering edit mode

You've summarized it accurately.

I did consider writing a python script for this comparison, but I fear that it will run incredibly slow. I might just time a subset and estimate how long it will run for.

Thank you for your input

ADD REPLY
0
Entering edit mode

In my opinion blastn is an ideal choice for this purpose. Have you considered bit score? It's gives fair idea of alignment length and sequence identity.

ADD REPLY
0
Entering edit mode

What I fear might happen is as I expressed in the question; that the settings can not be tweaked enough to ensure that all 150bp regions of all genes are checked against the query's 150bp regions put in.

'Fair idea' is unfortunately not good enough in this case.

Thank you for your input

ADD REPLY

Login before adding your answer.

Traffic: 3007 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6