Question: Obtaining A Maximum Number Of Blast Hits: Problem...
3
gravatar for Nicojo
6.1 years ago by
Nicojo1.1k
Kyoto, Japan
Nicojo1.1k wrote:

I am having trouble getting blast to give me "correct" results. I am trying to retrieve as many hits with e-value better than 1. I query the database with a sequence that should have several thousand hits in the database. However, at best, using tblastn, I get more or less 1000 matches (~250 independent hits). I am at a loss understanding what is wrong with my command:

tblastn -query protein.fasta -db nucl.blastdb -out results.tblastnout -evalue 1 -outfmt 7 -num_descriptions 100000

I get several matches within one same sequence hit, all with e-values better than 1. But what confuses me is that the best e-value of the worst hit (sorry if this is confusing ;) is nowhere near the -evalue limit, and is usually lower than 1E-60... Obviously, even including the redundancy of matches within a hit, I certainly do not reach the 100000 limit I asked for.

So I have three questions:

  1. Is it possible to only list one (the best) "match" per "hit"?
  2. Any idea why I do not get a larger number of descriptions, considering that I expect to have close to 30000 positives in my database?
  3. Any comments/suggestions to improve my search?

EDIT:

After investigation and testing (a bit)...

For question 1)

In order to limit the number of matches (HSPs) in a sequence hit, I tried the "Best-Hits filtering algorithm". As a result, this brought my newly found 50000 hits (with -max_target_seqs) down to 253. In reality however, it increased the number of positive hits by 2. So I'm guessing that the number of "hits" listed in the header of the output is only the number of HSPs. In any case, this seems to help with the multiple hit per target sequence. But is it the full answer? Or are there other ways?

For question 2)

With the argument "-num_descriptions", changing the value doesn't really change the result. It turns out that with tabular formatting, one should use the argument "-max_target_seqs" instead of the previous one (see latest BLAST user manual). As a result, I get an interestingly different output: the number of "hits" announced in the output header dramatically increases, in this case to >50000. This is the result that I'd expect... However, the actual list of results is exactly the same as with "-num_descriptions" :(

blast • 17k views
ADD COMMENTlink modified 20 months ago by CB10 • written 6.1 years ago by Nicojo1.1k
2
gravatar for Nicojo
6.1 years ago by
Nicojo1.1k
Kyoto, Japan
Nicojo1.1k wrote:

NOTE: This question and answer concerns BLAST 2.2.25+... I do not know if any of this is valid for other versions of BLAST.

Answer to question #2:

My initial (main) problem was that the following blast+ command...

tblastn -query protein.fasta -db nucl.blastdb -out results.tblastnout -evalue 1 -outfmt 7 -num\_descriptions 100000

...did not give me the result that I was expecting.

Normally, in BLAST (whether the old or the new), you can say how many descriptions and alignments you want (depending on the version):

  • v or num_descriptions determine the number of descriptions that are output. By default these are 500.
  • b or num_alignments determine the number of alignments that are output. By default these are 250.

Indeed, with these default settings, and provided there are at least 500 hits, a blast will produce 500 descriptions and 250 alignments. This should imply that each argument is independent of the other.

When it comes to the tabular output formats (flag "-outfmt 6" and above) the output is devoid of alignments. In BLAST+, the documentation even suggests that one should use the flag "-max_target_seqs" instead of "-num_descriptions". This should imply that the flag "-num_alignments" has no impact on tabular format outputs.

Oddly enough, it seems that neither of these assumptions are true, when requesting tabular output formatting: although the num_alignment flag can be ignored for pairwise output, it will limit the num_descriptions output.

In order to retrieve the maximum number of hits in a tabular format, with BLAST+, one should use a command similar to the following:

tblastn -query protein.fasta -db nucl.blastdb -out results.tblastnout -evalue 1 -outfmt 7 -num\_descriptions 100000 -num\_alignments 100000

Why this would be the case beats me. I will ask the NCBI BLAST team and report back if I get an answer...

Update:

I have contacted the BLAST Help team, and received a reply. However, they only pointed to the use of -maxtargetseqs when using outfmt flags >4. I emailed back to point out that in version 2.2.25+, that this does not return the expected result. I'm still waiting to hear back from them...

So, for anybody having issues, I highly recommend to use the latest version of BLAST: 2.2.27+. I have verified: the use of the option "maxtargetseqs" works as expected!

Update 2:

No news from the BLAST Help team... I guess they don't like it when people point out bugs :(

Conclusion: USE THE LATEST VERSION ;)

ADD COMMENTlink modified 6.0 years ago • written 6.1 years ago by Nicojo1.1k

I am having the same problem with tblastx (version 2.2.28+). I am comparing two bacterial genomes, and when I run tblastx, I only get 400 hits, independently of max_target_seqs and num_alignments values...

I believe that there is a bug, because when I compare a genome with itself, I only get 400 hits... 

ADD REPLYlink written 3.5 years ago by fhsantanna410
1
gravatar for Ketil
6.1 years ago by
Ketil3.9k
Germany
Ketil3.9k wrote:

There is a limit on the number of alignments that will be produced. From the (old) BLAST documentation:

-v  Number of database sequences to show one-line descriptions for (V) [Integer]
    default = 500
-b  Number of database sequence to show alignments for (B) [Integer]
    default = 250

I'm not familiar with BLAST+ (which you seem to be using), but I suppose it has the same limits, and similar options to modify them?

ADD COMMENTlink written 6.1 years ago by Ketil3.9k

@Ketil, thanks for the answer! However, I'm not quite sure that the "default" means "max number limit"... In any case, changing the number to e.g. 5000 does not improve the result. Since I already reach >1000, I don't think there is such a limit. Any other idea?

ADD REPLYlink written 6.1 years ago by Nicojo1.1k

Which of these numbers did you try to change? Looking at some of my files (i.e. looking at Hit_num in BLAST XML output), there is a cap on hits at 500, I imagine increasing -v would lift that.

Anyway, if that doesn't work, I'm out of ideas.

ADD REPLYlink written 6.1 years ago by Ketil3.9k

@Ketil: my command line calls for "-outfmt 7", which in BLAST+ is a tabular format. Therefore no alignments are output. The only arguments that influence the number of hits are "-num_descriptions" and, as I just found out, "-max_target_seqs". Initially I changed the first. Now I've tested both: check out the edit to the question, as it gets even stranger...

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Nicojo1.1k

thanks for following up - many of these tools make use of a number of tacit assumptions when they are operated outside of the most common parameter range - I'm learning a lot from it!

ADD REPLYlink written 6.1 years ago by Istvan Albert ♦♦ 77k

Indeed, the documentation is good, but not extensive... I'm still hesitating between obscureness of the tool options or bad assumptions on my part. I am in the process of testing my dataset to be sure that it contains what I believe it does. Will update as I go!

ADD REPLYlink written 6.1 years ago by Nicojo1.1k

can i limit the number of hits to 10 best ones

the command line i use is

./blastn -query /home/Downloads/esembl/EDL36652.1d.fasta -db /home/Desktop/ncbi-blast-2.2.30+/output2/mart -outfmt 7 -max_target_seqs 1 -out /home/CCMB/Desktop/ncbi-blast-2.2.30+/outformat/result

but i end up with 31 hits

ADD REPLYlink written 3.7 years ago by vigneshprbh3720
0
gravatar for AGS
6.1 years ago by
AGS230
Brooklyn, ny
AGS230 wrote:

@Nicojo,

"Is it possible to only list one (the best) "match" per "hit"?"

If I remember correctly, using the -K 1 flag will limit output to the best hit.

ADD COMMENTlink written 6.1 years ago by AGS230

please add this as as comment not as a new answer

ADD REPLYlink written 6.1 years ago by Istvan Albert ♦♦ 77k

This is an answer to question #1 that the OP asked.

ADD REPLYlink written 6.1 years ago by AGS230

ok I see, my bad, since you had the @Nicojo there I thought that you are referring to his comment above.

ADD REPLYlink written 6.1 years ago by Istvan Albert ♦♦ 77k

@AGS: thanks for your answer. Unfortunately, that flag seems to be from the old NCBI BLAST, and not the current NCBI BLAST+. I have just found a recent version of the BLAST Command Line Applications Manual, with the description of all arguments. Unfortunately, there is no "K" argument. The only thing I could imagine were the "-best_hit_overhang" and "-best_hit_score_edge". See question edit for details.

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Nicojo1.1k
0
gravatar for Rm
6.1 years ago by
Rm7.8k
Danville, PA
Rm7.8k wrote:

To control number of top hits from blast+ : For descriptions: [-num_descriptions int_value] ; for alignments: [-num_alignments int_value]

which are equivalent to -v and -b options from older blast

ADD COMMENTlink written 6.1 years ago by Rm7.8k

@Rm: thanks. Unfortunately those do not do the job, which is why I'm asking the question. Please read the full post for the details...

ADD REPLYlink written 6.1 years ago by Nicojo1.1k
1

My two cents: Output all possible hits in default (pair wise) format using very large values for the [-numdescriptions intvalue] and [-numalignments intvalue] and see how many hits you will actually get; then you can play with the "-outfmt 7" options , or you can write a parser script to get desired output from default output format.

ADD REPLYlink written 6.1 years ago by Rm7.8k

@Rm: thanks again... I tried the default output format (only with -num_descriptions, not -num_alignments), and that gave me the result I was expecting. But that doesn't explain why I do not get that result with the tabular output format. In any case, your suggestion put my on the right track. I'll post a complete answer for more details. So, thanks!

ADD REPLYlink written 6.1 years ago by Nicojo1.1k
0
gravatar for CB
20 months ago by
CB10
US
CB10 wrote:

I had the same issue using blast 2.5.0 I used the option max_target_seqs 10, and I had many cases with more than 10 hits. For some genes I had 10 hits, but for many I had >50 hits. I used the tabular output (-outfmt '7 qseqid sseqid pident evalue stitle qcovs length sstart send' )

ADD COMMENTlink modified 20 months ago • written 20 months ago by CB10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1279 users visited in the last hour