Question

Finding short, highly similar matches to long sequences using BLASTN

1

Entering edit mode

4.1 years ago

CC ▴ 50

When I compare a 331bp sequence (JQ749729.1) to nt using megablast (via the 'Run BLAST' button on the right-hand side), I get only one match, with 98.43% identity for the query region.

However, when I do the reverse and compare the matching 11,612bp sequence (MN733821.1) to nt using megablast, I get many hits, however none of them are the highly similar JQ749729.1 sequence. They are all long sequences with only 70-80% identity.

I assume this is because the blast algorithm scores longer, less similar matches higher than shorter, more similar matches. I have tried changing the settings (scoring higher reward for matches and penalty for mismatches; increasing the word size; increasing the gap cost etc.) but I cannot get blast to find that short, highly similar match. I also tried doing this via command line so I could try other parameters, such as -perc_identity, which I set to 95, but this ended up with 0 matches to nt.

Is there a way to adjust blast's parameters so that it will find that short, highly similar sequence when using the long sequence as a query and nt as the database? Or is there a different method more suited to this task? Thank you for your help.

blast alignment • 1.2k views

ADD COMMENT • link 4.1 years ago by CC ▴ 50

0

Entering edit mode

have you tried changing the e-value threshold? might be that the hit oyu are talking about is not getting to the e-value threshold.

otherwise you could also change the number of hits returned, if the query sequence has lots of hits yours might not be among the best 250(?) reported ones

ADD REPLY • link 4.1 years ago by lieven.sterck 15k

score 0 · Answer 1 · 2020-03-11

However, when I do the reverse and compare the matching 11,612bp sequence (MN733821.1) to nt using megablast, I get many hits, however none of them are the highly similar JQ749729.1 sequence. They are all long sequences with only 70-80% identity.

Think you need to keep in mind that BLAST does a local alignment. Also, Blast uses the query coverage to calculate scores, if you blast a 11,612bp sequence against a 331bp one this would be extremely low. (Think 2.8% if the identity is 100%.) So that is I think the reason that this does not work.

increasing the word size

You need to lower the word size. Try blastn instead of megablast (default). In the web blast you can find that option in the Program Selection section. You will see that you will have many more hits now. You will not get more 99% hits but that is not a problem of blast, that is a problem of the lack of reference sequences.