Question

Percent identity in Blast

0

Entering edit mode

2.4 years ago

pooryamb • 0

Dear All,

In my project, I want genes sharing a stretch of at least 100 amino acids, with percent identity > 90 percent with genes of my database. Can I run blast by setting the -perc_identity option as high as I want? I am afraid there would be cases that hit, and a query will align through a stretch of over 500 amino acids with a percentage identity near 60 percent. Yet, there might be a sub-alignment of the proteins with shorter alignment length but high enough percentage identity (for example, alignment length = 150 amino acids, percentage identity > 90 ). In such a case, even though the highest-scoring alignment (500 amino acids, 60 percentage identity) is not of my interest, there is a subalignment precisely like what I am looking for. So, my question is: In such a hypothetical case, if I set -perc_identity to 90, will blast report the hit? Or it misses it because, in the highest-scoring alignment, percentage identity is less than 90?

In case the blast is not suited for my application. Do you suggest an alternative?

Best wishes,

blast • 2.4k views

ADD COMMENT • link 2.4 years ago by pooryamb • 0

0

Entering edit mode

BLAST will return the local alignments that maximize the e-value - you can't force it to align all 100 amino acids of your query sequence in global manner. So the answer is yes, it will return that high scoring local alignment (what you call a "subalignment") because that's what BLAST is for.

ADD REPLY • link 2.4 years ago by Jeremy Leipzig 22k

0

Entering edit mode

blat may be a better tool for this case. Blat of DNA is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. So if those limitations work then take a look at LINK.

ADD REPLY • link 2.4 years ago by GenoMax 142k

score 0 · Answer 1 · 2021-12-20

I don't think what you want can be done with BLAST, or with any other sequence searching and/or alignment tool. It goes against the purpose of these tools. They are not primarily meant to identify relatively short alignments with high identity, which seems to be what you want. Quite the contrary: they are meant to identify all the relatives using E-value statistics. The idea is to include as many distant relatives as possible, and to do so in reliable fashion.

With all that said, there may be a way to get this done, but it will require some luck. You may restrict the search to related sequences by setting high percent identity threshold, but it is still unlikely that setting can control the alignment length. However, if you have a short region with high alignment identity and a longer region with lower identity, setting high gap penalties (both for opening and extension) may force BLAST to drop the low-identity portion of the alignment. This is assuming that your high-identity part is mostly gapless, and that the low-identity portion has more gaps. This is usually a safe assumption, but you may still need some luck (and lots of experimenting) to get exactly what you want.