Question: Cannot reproduce BLAST results returned by the WEB API on the shell-based BLAST+ 2.7.1
1
gravatar for xerygi
17 months ago by
xerygi10
xerygi10 wrote:

I am trying to BLAST several sequences using the latest command line-based BLAST+ on a Linux machine. Whatever I do, I get the unfortunate "No hits found" message.

I randomly chose some of my sequences and tried the web-based API. To my surprise I received dozens of results for almost all of them with identities higher than 92%!

Here is one example sequence: CAGTTTNCATTTTATAACT

On the web, I simply navigate to 'blastn' (https://blast.ncbi.nlm.nih.gov/Blast.cgi) and I leave all settings unchanged (it uses megablast and the nt database collection), and I receive many hits for the sequence above.

When I try the shell-based BLAST on the same sequence above, I get nothing!

I have tried both a remote search:

blastn -task megablast -db nt -remote -query test_sequence.fasta -out remote_blast_test_megablast_output.txt

blastn -task blastn-short -db nt -remote -query test_sequence.fasta -out remote_blast_test_blastnshort_output.txt

...And also a local search on my own computer:

blastn -task megablast -db /home/blast_databases/nt -query test_sequence.fasta -out blast_test_megablast_output_localdatabase.txt

blastn -task blastn-short -db /home/blast_databases/nt -query test_sequence.fasta -out blast_test_blastnshort_output_localdatabase.txt

I have also considered adding the -outfmt and -perc_identity 92 parameters, and those have also solved nothing.

How can I reproduce the same results that BLAST's web API returns?


UPDATE #1

I have so far gotten closer to the web results by specifying:

  1. word_size 7
  2. evalue 1000 (on the command line version, it was set to 10 by default!!)
  3. task -blastn-short

Similar topic: Why the local blast and online blast produce different results?

bash shell blast ncbi • 716 views
ADD COMMENTlink modified 17 months ago • written 17 months ago by xerygi10

Please do not delete posts, especially once they have received answers/comments.

ADD REPLYlink modified 17 months ago • written 17 months ago by genomax69k
1
gravatar for genomax
17 months ago by
genomax69k
United States
genomax69k wrote:

Even though you may not have changed anything NCBI changes parameters of the search based on the length of the sequence. With the example sequence you have they seem to be using word size = 7, gap costs: existence:5, extension: 2, match/mismatch score: 1,-3. I am not sure if you can see this link but try those for a start on command line.

ADD COMMENTlink written 17 months ago by genomax69k

I just tried the same thing in the meanwhile after posting this thread. I changed the word word_size and to my surprise I got almost the same results, of course not all of them. Is there a way to do this auto-optimization (that the Web API does) on the command line version of Blast? I have thousands of sequences, so how can I myself find out the optimal parameters? Besides, my search on my computer (using the database on my hard drive) takes an insanely long time for just that one sequence above, whereas the Web API returns it in 20 seconds!!! Is there something there as well that needs to be optimized to give me a better speed?

ADD REPLYlink written 17 months ago by xerygi10

Besides, my search on my computer (using the database on my hard drive) takes an insanely long time for just that one sequence above, whereas the Web API returns it in 20 seconds!!!

Talk about comparing apples to oranges :-) You can never match compute resources at NCBI locally.

ADD REPLYlink modified 17 months ago • written 17 months ago by genomax69k

Supposing I have 200,000 sequences which are all the same length as the example I have posted and require the same sort of parameters, how much computing power would I need to finish the search in a reasonable amount of time (say 48 hours)?

ADD REPLYlink written 17 months ago by xerygi10

You know how long it took for one search locally so you can extrapolate the time from there (be sure to use multiple threads if they are available). If you are willing to look at something like AWS then try a few searches and then extrapolate to estimate resources you will need to stay under a certain time.

You may be able to do the search remotely at NCBI (using the command line --remote) if you are careful about sending small batches through and being patient.

ADD REPLYlink written 17 months ago by genomax69k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 637 users visited in the last hour