Question: Find highly similar regions of specific lengths to a query in a genome
gravatar for Jenez
4.1 years ago by
Jenez520 wrote:

I have a collection of gene sequences, for each I would like to identify any highly similar 150bp regions within a specific assembled genome.

I have initially tried megablast, where I blasted each query gene against the genome and looked for regions with a sequence identity > 85%. However, it became apparent that even if this worked for most genes, it may sometimes prove that an aligned region of 200bp may have an identity of 84% to my query gene, whereas in reality had the alignment been shortened by 40bp, the identity would rise above 85%.

Tweaking the blast parameters such as gap penalties and the like is not a perfect solution, and searching biostars for this particular problem proved difficult.

Do you have any suggestions on how to proceed?

Thank you...

blast alignment genome • 1.0k views
ADD COMMENTlink written 4.1 years ago by Jenez520

I want to make sure I understand the question: So you would like to take a query sequence longer than 150bp, and then try to find high percent identity matches to any subsequence 150bp long? So, then, there would be 51 possible subsequences of length 150 in a query that is 200 bp long.

ADD REPLYlink written 4.1 years ago by jeremy.cox.290

I suppose I should go one step further. If that is the case, would querying against all possible subsequences solve the problem? Two short PERL or Python scripts would do the trick. One to make the new queries with informative query names, and then one to summarize all the results to find matches meeting the criteria. Sometimes, we ignore the ends of shotgun reads. You could trim off 25 bases from the front and back. That's sloppy, but fast.

ADD REPLYlink written 4.1 years ago by jeremy.cox.290

You've summarized it accurately.

I did consider writing a python script for this comparison, but I fear that it will run incredibly slow. I might just time a subset and estimate how long it will run for.

Thank you for your input

ADD REPLYlink written 4.1 years ago by Jenez520

In my opinion blastn is an ideal choice for this purpose. Have you considered bit score? It's gives fair idea of alignment length and sequence identity.

ADD REPLYlink written 4.1 years ago by ramesh.8v210

What I fear might happen is as I expressed in the question; that the settings can not be tweaked enough to ensure that all 150bp regions of all genes are checked against the query's 150bp regions put in.

'Fair idea' is unfortunately not good enough in this case.

Thank you for your input

ADD REPLYlink written 4.1 years ago by Jenez520
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2167 users visited in the last hour