Question

remote tblastn database name(s)

0

Entering edit mode

11 months ago

timothy.kirkwood ▴ 140

Hello,

I have ~2000-20000 protein queries. I wish to find homologs of these queries in NCBI's microbial genomes. From what I can tell, vanilla BLASTP doesn't do this - it just gives hits in a protein DB, but these are not linked to a specific nucleotide DB entry. Please let me know if this is incorrect. Also, given the number of queries I'm doing this with BLAST+ v2.12.0+ (not the web form).

I've considered (i) downloading genomes from ftp, making a local db (assuming the genomes are annotated), and BLASTP-ing that or (ii) getting a list of protein accessions and trying to link them to a genome entry with an entrez pipeline (similar to here, but different use case), but the easiest option seems to be a simple TBLASTN of my queries with the -remote flag.

Assuming there's not a better option for achieving the above, can I clarify what the database name should be for TBLASTN - is it still nr? On the web service it gives the DB as Representative genomes (ref_prok_rep_genomes). I have tried both ref_prok_rep_genomes, nr and nt and all are very slow - I've ended up cancelling jobs for a single query because they are taking much longer than I would expect, leading me to suspect an error on my end. Note, I assume the spaces mean Representative genomes is not a valid name.

Any suggestions? My cmds:

tblastn -task tblastn-fast -query "path/to/single_query.txt" -remote -db nr -out "path/to/results.xml" -outfmt 5 -evalue 0.00005

tblastn -task tblastn-fast -query "path/to/single_query.txt" -remote -db ref_prok_rep_genomes -out "path/to/results.xml" -outfmt 5 -evalue 0.00005

tblastn -task tblastn-fast -query "path/to/single_query.txt" -remote -db nt -out "path/to/results.xml" -outfmt 5 -evalue 0.00005

Cheers!

blast database tblastn • 771 views

ADD COMMENT • link updated 11 months ago by Mensur Dlakic ★ 27k • written 11 months ago by timothy.kirkwood ▴ 140

score 1 · Answer 1 · 2023-05-10

1

Entering edit mode

11 months ago

Istvan Albert 100k

You should try a faster blast replacement like diamond

https://github.com/bbuchfink/diamond

there might be other options as well

ADD COMMENT • link 11 months ago by Istvan Albert 100k

0

Entering edit mode

Hello, thanks for your reply. Isn't diamond protein to protein alignment? I am trying to find protein homologs that are linked to cognate nucleotide sequences (ideally without having to download huge databases from FTP), so I need something that can translate proteins for alignment to nucleotide sequences (unless one of the other approaches in my post makes more sense!).

ADD REPLY • link 11 months ago by timothy.kirkwood ▴ 140

0

Entering edit mode

DIAMOND is a sequence aligner for protein and translated DNA searches

but you are correct that it needs a protein database, so it runs blastp and blastx and not tblastn

you could still look up the nucleotides against the proteins, if it is indeed 10,000x faster then it is worth it

the remote NCBI blast is not designed to align a large number of sequences against NR - the process is very time-consuming and they can't provide that much computing power with no restrictions

if you want to run batch programs, your best bet is to run it locally with an accelerated tool

ADD REPLY • link 11 months ago by Istvan Albert 100k

score 1 · Answer 2 · 2023-05-10

Regardless how this is done, it requires a large amount of resources. The primary intent of remote searches by NCBI is not to support queries on thousands of proteins against nr/nt, so they will not let you query many proteins at the same time. I don't know of a way to do this with NCBI without spending a massive amount of time.

It won't work locally either, unless you have a dedicated cluster that will simultaneously run hundreds of these jobs just for you over several weeks.

I can't help but think that there must be a better way to get the information you want without doing all these searches. For example, why not cluster these proteins, make HMMs from the clusters, and search databases that way? If the number of clusters is an order of magnitude smaller than your total number of proteins, that reduces the number of searches by a similar factor. Another possibility is to download only bacteria+archaea entries from UniProt and cluster them down to 90% redundancy. A total UniProt clustered to 90% is ~171 million sequences, so I suspect a similar prokaryote database should be no more than 100 million sequences.

https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/taxonomic_divisions/

Lastly:

https://www.ncbi.nlm.nih.gov/protfam