I am having trouble getting blast to give me "correct" results.
I am trying to retrieve as many hits with e-value better than 1. I query the database with a sequence that should have several thousand hits in the database. However, at best, using tblastn, I get more or less 1000 matches (~250 independent hits). I am at a loss understanding what is wrong with my command:
tblastn -query protein.fasta -db nucl.blastdb -out results.tblastnout -evalue 1 -outfmt 7 -num_descriptions 100000
I get several matches within one same sequence hit, all with e-values better than 1. But what confuses me is that the best e-value of the worst hit (sorry if this is confusing ;) is nowhere near the -evalue limit, and is usually lower than 1E-60... Obviously, even including the redundancy of matches within a hit, I certainly do not reach the 100000 limit I asked for.
So I have three questions:
- Is it possible to only list one (the best) "match" per "hit"?
- Any idea why I do not get a larger number of descriptions, considering that I expect to have close to 30000 positives in my database?
- Any comments/suggestions to improve my search?
EDIT:
After investigation and testing (a bit)...
For question 1)
In order to limit the number of matches (HSPs) in a sequence hit, I tried the "Best-Hits filtering algorithm". As a result, this brought my newly found 50000 hits (with -max_target_seqs
) down to 253. In reality however, it increased the number of positive hits by 2. So I'm guessing that the number of "hits" listed in the header of the output is only the number of HSPs.
In any case, this seems to help with the multiple hit per target sequence. But is it the full answer? Or are there other ways?
For question 2)
With the argument -num_descriptions
, changing the value doesn't really change the result. It turns out that with tabular formatting, one should use the argument -max_target_seqs
instead of the previous one (see latest BLAST user manual). As a result, I get an interestingly different output: the number of "hits" announced in the output header dramatically increases, in this case to >50000. This is the result that I'd expect... However, the actual list of results is exactly the same as with -num_descriptions
:(
I am having the same problem with tblastx (version 2.2.28+). I am comparing two bacterial genomes, and when I run tblastx, I only get 400 hits, independently of
max_target_seqs
andnum_alignments
values...I believe that there is a bug, because when I compare a genome with itself, I only get 400 hits...