Only one hit per query on blastn?
0
0
Entering edit mode
3.5 years ago
bioguy ▴ 50

Am trying to run blastn to check if a sequence has a similar match within a database within a certain percent identity– to speed things up, I'd like blast to stop searching after finding a single hit (it doesn't matter what the match is, just that it exists).

Is there a parameter that does this? From my understanding, max target seqs does not do what I want...

blast genomics • 3.8k views
2
Entering edit mode

I think it does exactly what you need it to do. What it does not do is get you the "best" n hits. It gets you the first n hits, which may or may not be the best.

See this paper:

A common strategy involves using the ‘-max_target_seqs’ parameter of the NCBI BLAST+ suite. According to the BLAST documentation itself (2008), this parameter represents the ‘number of aligned sequences to keep’. This statement is commonly interpreted as meaning that BLAST will return the top N database hits for a sequence query if the value of max_target_seqs is set to N. For example, in a recent article (Wang et al., 2016) the authors explicitly state ‘Setting “max target seqs” as “1,” only the best match result was considered.’

To our surprise, we have recently discovered that this intuition is incorrect. Instead, BLAST returns the first N hits that exceed the specified E-value threshold, which may or may not be the highest scoring N hits. The invocation using the parameter ‘-max_target_seqs 1’ simply returns the first good hit found in the database, not the best hit as one would assume.

Further reading: BLAST, setting maximum number of hits

0
Entering edit mode

Not exactly:

cat qry.fa
>qryseq
aatctcgcgcttacgctgtgctcgtcgttcccgcgtgccccgcgctgcccgcgctcgggctcgcccgcgctcgcg

cat sbj.fa
>sbjseq (1st and 3rd line are the same than qryseq except 1st line ends ccg instead of gcg)
aatctcgcgcttacgctgtgctcgtcgttcccgcgtgccccgcgctgcccgcgctcgggctcgcccgcgctcccg
aaaaaatctctcggggggcggcgcgcgcgcgcgcgctttcgagaggagatctcgcgcgccggcgctttgcgcgct
aatctcgcgcttacgctgtgctcgtcgttcccgcgtgccccgcgctgcccgcgctcgggctcgcccgcgctcgcg
cgcgcgcggggggggggccgcgcggcgaaagaggagaggaaggcgcgcgcggaggaaattctcctcttcttcttc

blastn -query qry.fa -subject sbj.fa -outfmt 6 -max_target_seqs 1
qryseq  sbjseq  100.000 75      0       0       1       75      151     225     1.83e-38        139
qryseq  sbjseq  100.000 72      0       0       1       72      1       72      8.50e-37        134

blastn -query qry.fa -subject sbj.fa -outfmt 6 -max_target_seqs 1 -max_hsps 1
qryseq  sbjseq  100.000 75      0       0       1       75      151     225     1.83e-38        139


max_target_seqs for how many maximum target sequences and max_hsps for how many maximum hits within target sequences, but still in this case it's picking the latter hit although presumably it has encountered the other hit first. AFAIK there's no grep -m 1 equivalent to blast although perhaps it's possible to disable search within hsps by changing the source code. In other words, it may not return the optimal hit, but it seems to return the optimal hit within the first good enough target sequence. If subject sequences are long, this may have significant speed implications..

Edit.

cat sbj.fa GCF_002055395.1_ASM205539v1_genomic.fna > newSbj.fa
time blastn -query qry.fa -subject sbj.fa -outfmt 6 -max_target_seqs 1 -max_hsps 1 > /dev/null

real    0m0.030s
user    0m0.000s
sys     0m0.000s

time blastn -query qry.fa -subject newSbj.fa -outfmt 6 -max_target_seqs 1 -max_hsps 1 > /dev/null

real    0m0.255s
user    0m0.000s
sys     0m0.000s

cat newSbj.fa newSbj.fa newSbj.fa newSbj.fa > bigger.fa

time blastn -query qry.fa -subject bigger.fa -outfmt 6 -max_target_seqs 1 -max_hsps 1 > /dev/null

real    0m0.926s
user    0m0.000s
sys     0m0.000s


Looks to me like it's going through the entire sequence space despite -max_target_seqs 1. This was with v. 2.7.1+