How to filter BLAST database before search?
1
0
Entering edit mode
5 weeks ago
asumani ▴ 70

Hi,

I want to ignore certain sequences for BLAST search. It takes longer if the db is huge and I guess bigger db might inflate Hit score. My solution was to use -entrez_query option using nt database available in our server. But -entrez_query option needs -remote option and this is imcompatible with using database in the server. To get around this, I access to NCBI nt database instead of the one downloaded to our server.

Here is the code which does not work but it should work, because it works in my laptop:

nohup tblastn -query NP_040593.1.fa \
-db nt \
-remote \
-entrez_query "Viruses[ORGN] NOT (SYNTHETIC[TI] OR ENVIRONMENTAL[TI] OR PATENT[TI]) NOT (UNVERIFIED[KYWD] OR STANDARD_DRAFT[KYWD] OR VIRUS_LOW_COVERAGE[KYWD] OR VIRUS_AMBIGUITY[KYWD])" \
-outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qframe sframe qcovs qcovhsp' \
-out tblastn_allFiltered.out\
-export_search_strategy export.txt  &> nohup.out &

When I run this exact code in our server, I get this error:

Error: NCBI C++ Exception:
    T0 "/home/ross/ncbi-blast-2.10.0+-src/c++/src/serial/rpcbase.cpp", line 233: Error: ncbi::CRPCClient_Base::x_Ask() - Failed to receive reply after 3 tries

I use NCBI API key which is supposeed to accept 10 request/second. I got the API key and export to ~/.bash_profile.

So, why is this not running in the cluster? Hope someone will help me here!

Thanks a million,

Asuman

BLAST Entrez EntrezQuery • 263 views
ADD COMMENT
0
Entering edit mode

Do you get the error right away or after some time?

ADD REPLY
0
Entering edit mode

Here it is(I run in server and blast exits with the error above and nohup gives error of Exit 255:

real 1m39.167s
user 0m0.192s
sys 0m0.014s

When I run the same command on my local, I get this:

real 4m12.736s
user 0m0.307s
sys 0m0.059s

Hence, it is likely a server connection problem. If there is a way to filter database search without EntrezQuery, it can solve the problem I guess. But how to filter alternatively?

ADD REPLY
0
Entering edit mode

If you are running on a cluster is it possible to submit this as a job to a job scheduler without the nohup?

ADD REPLY
0
Entering edit mode

I corrected, I meant server not cluster..some old habits. It should be much easier to filter a database than accessing via Entrez which requires remote access and which gives the connection error. But what alternatives?

ADD REPLY
1
Entering edit mode
5 weeks ago
Mensur Dlakic ★ 20k

I think you could make your life easier by compiling a custom local database. Most of viral sequences are short, and it is relatively fast to download them despite thousands of entries. This program can do that in under an hour, provided that you have good internet connection:

https://github.com/pirovc/genome_updater

This would be my second recommendation:

https://github.com/kblin/ncbi-genome-download

When all the sequences are downloaded, the next step is to concatenate them and make a BLAST-indexed database using makeblastdb.

I guess bigger db might inflate Hit score

You may be thinking about E-values getting larger with increased database size. Raw bit-scores are the same regardless of database size.

ADD COMMENT

Login before adding your answer.

Traffic: 721 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6