Question

Cannot reproduce BLAST results returned by the WEB API on the shell-based BLAST+ 2.7.1

1

Entering edit mode

6.2 years ago

xerygi ▴ 10

I am trying to BLAST several sequences using the latest command line-based BLAST+ on a Linux machine. Whatever I do, I get the unfortunate "No hits found" message.

I randomly chose some of my sequences and tried the web-based API. To my surprise I received dozens of results for almost all of them with identities higher than 92%!

Here is one example sequence: CAGTTTNCATTTTATAACT

On the web, I simply navigate to 'blastn' (https://blast.ncbi.nlm.nih.gov/Blast.cgi) and I leave all settings unchanged (it uses megablast and the nt database collection), and I receive many hits for the sequence above.

When I try the shell-based BLAST on the same sequence above, I get nothing!

I have tried both a remote search:

blastn -task megablast -db nt -remote -query test_sequence.fasta -out remote_blast_test_megablast_output.txt

blastn -task blastn-short -db nt -remote -query test_sequence.fasta -out remote_blast_test_blastnshort_output.txt

...And also a local search on my own computer:

blastn -task megablast -db /home/blast_databases/nt -query test_sequence.fasta -out blast_test_megablast_output_localdatabase.txt

blastn -task blastn-short -db /home/blast_databases/nt -query test_sequence.fasta -out blast_test_blastnshort_output_localdatabase.txt

I have also considered adding the -outfmt and -perc_identity 92 parameters, and those have also solved nothing.

How can I reproduce the same results that BLAST's web API returns?

UPDATE #1

I have so far gotten closer to the web results by specifying:

word_size 7
evalue 1000 (on the command line version, it was set to 10 by default!!)
task -blastn-short

Similar topic: Why the local blast and online blast produce different results?

blast shell bash ncbi • 1.8k views

ADD COMMENT • link 6.2 years ago by xerygi ▴ 10

0

Entering edit mode

Please do not delete posts, especially once they have received answers/comments.

ADD REPLY • link 6.2 years ago by GenoMax 141k

score 1 · Answer 1 · 2018-02-02

1

Entering edit mode

6.2 years ago

GenoMax 141k

Even though you may not have changed anything NCBI changes parameters of the search based on the length of the sequence. With the example sequence you have they seem to be using word size = 7, gap costs: existence:5, extension: 2, match/mismatch score: 1,-3. I am not sure if you can see this link but try those for a start on command line.

ADD COMMENT • link 6.2 years ago by GenoMax 141k

0

Entering edit mode

I just tried the same thing in the meanwhile after posting this thread. I changed the word word_size and to my surprise I got almost the same results, of course not all of them. Is there a way to do this auto-optimization (that the Web API does) on the command line version of Blast? I have thousands of sequences, so how can I myself find out the optimal parameters? Besides, my search on my computer (using the database on my hard drive) takes an insanely long time for just that one sequence above, whereas the Web API returns it in 20 seconds!!! Is there something there as well that needs to be optimized to give me a better speed?

ADD REPLY • link 6.2 years ago by xerygi ▴ 10

0

Entering edit mode

Besides, my search on my computer (using the database on my hard drive) takes an insanely long time for just that one sequence above, whereas the Web API returns it in 20 seconds!!!

Talk about comparing apples to oranges :-) You can never match compute resources at NCBI locally.

ADD REPLY • link 6.2 years ago by GenoMax 141k

0

Entering edit mode

Supposing I have 200,000 sequences which are all the same length as the example I have posted and require the same sort of parameters, how much computing power would I need to finish the search in a reasonable amount of time (say 48 hours)?

ADD REPLY • link 6.2 years ago by xerygi ▴ 10

0

Entering edit mode

You know how long it took for one search locally so you can extrapolate the time from there (be sure to use multiple threads if they are available). If you are willing to look at something like AWS then try a few searches and then extrapolate to estimate resources you will need to stay under a certain time.

You may be able to do the search remotely at NCBI (using the command line --remote) if you are careful about sending small batches through and being patient.

ADD REPLY • link 6.2 years ago by GenoMax 141k