Question

Most efficient way to run Diamond against a very very large database (i.e., NCBI's NR)?

0

Entering edit mode

12 months ago

O.rka ▴ 710

I have downloaded the entire NR from NCBI and then I create a giant diamond database that I query. I'm wondering if it would be more efficient computationally if I break NR into about 100 smaller databases that I query individually.

Would this help with the resource requirements and compute time?

protein annotation alignment diamond nr • 2.0k views

ADD COMMENT • link updated 8 months ago by GenoMax 141k • written 12 months ago by O.rka ▴ 710

1

Entering edit mode

Keep in mind the potential effect on e-values brought about by splitting a database into chunks then combining the results, discussed here: Blast E-Value To Database Size. While that's focused on NCBI Blast, I assume the same is true for Diamond.

ADD REPLY • link 12 months ago by cfos4698 ★ 1.1k

1

Entering edit mode

then I create a giant diamond database that I query.

That is no longer needed. Recent DIAMOND versions can now use pre-formatted NCBI databases.

NCBI now offers clustered nr database for web searches though it is not downloadable as yet for local use.

ADD REPLY • link 12 months ago by GenoMax 141k

0

Entering edit mode

This will save me a lot of time and compute resources! As long as it contains the taxonomy info I should be good to go. Any word on when it will be available to the public?

ADD REPLY • link 12 months ago by O.rka ▴ 710

1

Entering edit mode

Normal nr database contains taxonomy info. I don't know if the clustered DB will include that info. It is available for use now via web interface now. You can try a sequence out to confirm.

ADD REPLY • link 12 months ago by GenoMax 141k

0

Entering edit mode

In terms of using pre-formatted NCBI databases, would we just download all the different NR db files: https://ftp.ncbi.nlm.nih.gov/blast/db/nr.00.tar.gz https://ftp.ncbi.nlm.nih.gov/blast/db/nr.01.tar.gz ... https://ftp.ncbi.nlm.nih.gov/blast/db/nr.66.tar.gz Individually and then give diamond the prefix? For example,

mkdir -p ncbi_nr/
wget -P ncbi_nr/ https://ftp.ncbi.nlm.nih.gov/blast/db/nr.00.tar.gz
wget -P ncbi_nr/ https://ftp.ncbi.nlm.nih.gov/blast/db/nr.01.tar.gz
...
wget -P ncbi_nr/ https://ftp.ncbi.nlm.nih.gov/blast/db/nr.66.tar.gz

# Decompress the archives
diamond blastp -d ncbi_nr/nr -q queries.fasta -o matches.tsv

Would it be that type of usage?

I also see the prepdb command but I'm not sure if that has to be run on each component of nr. https://github.com/bbuchfink/diamond/wiki

ADD REPLY • link 12 months ago by O.rka ▴ 710

0

Entering edit mode

Is there a way to run Diamond against online NR database without downloading it to the local computer?

ADD REPLY • link 8 months ago by cvanhnollat • 0

0

Entering edit mode

No there is not.

ADD REPLY • link 8 months ago by GenoMax 141k

0

Entering edit mode

I think it depends on the speed of your local disks and the memory amount. On a single node, breaking up the database doesn't sound like a good idea, or that is even possible as you would likely run into I/O problems. If you have access to a cluster with speedy disks, and can run these processes on independent nodes without worrying about memory and disk I/O, I suspect there could be some speed-up. I would still think that breaking it to 5-10 parts would be more productive and could avoid the I/O bottleneck.

ADD REPLY • link 12 months ago by Mensur Dlakic ★ 27k

score 2 · Answer 1 · 2023-03-28

The dataset used by diamond is a table of k-mers and a list of sequences the k-mer appears in. Assuming k-mers in nr database are not very unique you could expect a big overlap between any two 1/100th of nr so the resulting datasets will not be 1/100 the size of the complete dataset but much much bigger.

In addition, it's also a good idea to run all your queries together as the queries are also indexed and the search is not linear to the size of the query.