"No space left on device" error when making DIAMOND database, alternative download method?
1
0
Entering edit mode
8 weeks ago

The computer I am using has 32 GB RAM and 12 cores, with about 200 GB of empty space remaining on the C drive.

I downloaded "nr.gz" from the ftp blast site. I downloaded diamond from github. From the folder containing diamond.exe, I ran the command "diamond makedb --in [path to nr.gz] -d [path to my downloads].

The process ran fine for a while, but eventually I got the error, "No space left on device." Since I still had hundreds of GB of remaining empty space on the C drive, I assume this is an issue with the RAM.

So is there a different way I can get maybe a nr database pre-built for diamond? Or a way to get around the RAM issue without just using a computer that has higher RAM?

Thanks in advance.

database diamond • 807 views
ADD COMMENT
1
Entering edit mode

diamond makedb --in [path to nr.gz] -d [path to my downloads].

Did you get the nr.gz that is fasta format.

Did you see this note in the README at NCBI FTP site. The file you downloaded is from APRIL 2024. For your use case it may not matter but ...

In April 2024, the BLAST FASTA files in this directory will no longer be available. You can easily generate FASTA files yourself from the formatted BLAST databases by using the BLAST utility blastdbcmd that comes with the standalone BLAST programs.

As for this

Or a way to get around the RAM issue without just using a computer that has higher RAM?

No because even if you get a pre-formatted database from somewhere the actual alignments will again require more than 32GB of RAM.

ADD REPLY
1
Entering edit mode
8 weeks ago
Mensur Dlakic ★ 30k

The process ran fine for a while, but eventually I got the error, "No space left on device." Since I still had hundreds of GB of remaining empty space on the C drive, I assume this is an issue with the RAM.

I don't think you should assume that. The error message is clear. DIAMOD could have tried to write a file, failed, and then deleted the partial file.

It appears that you are still going on with your attempts to BLAST your large database against the NR. I don't mean to be pushy, but it won't work with the resources you have. That's even with DIAMOND being slightly faster than BLAST. You will just end up opening one thread after another with various problems you are likely to encounter along the way.

If you absolutely feel like you want to do this, you were already given a suggestion to use cluster_nr database. That would be nr clustered at 90% identity, which for practical purposes has the same functionality as nr, but with 60-70% of its size.

I feel like giving it one more try: with resources you have (both memory and disk space, and the fact you are doing this on a Windows computer), this will not work in a reasonable amount of time if you are still trying to search 23,000 proteins against nr, or really any database similar in size to nr. I implore you to read through other suggestions that were given to you as that will save you a lot of time while also resulting in a better outcome. This is all assuming your goal is to annotate a genome, which as of right now you are yet to confirm.

ADD COMMENT
0
Entering edit mode

Alright, I will use the cluster_nr database. I understand that it will still take weeks. About my goal, I don't know if this is what is considered annotation of a genome, but I am trying to find the closest protein sequences from the database for each gene.

ADD REPLY
2
Entering edit mode

I understand that it will still take weeks.

You are again reading the information selectively. It will take on the order of weeks if you had something like 512 GB RAM and 50+ cores, which you don't have. For your setup the estimation was on the order of many months to years. It is highly unlikely that you would have a computer on without interruption for that period of time, not to mention that you wouldn't be able to do almost anything else. Even if it were weeks, are you prepared to completely surrender your computer and forget about any other activity for weeks?

I am trying to find the closest protein sequences from the database for each gene.

As suspected, you are going about it the wrong way. Presumably you know what organism you have. If you find several related species from NCBI and download their proteomes, concatenate them into a single protein database, you would be guaranteed to get the same information in orders of magnitude less time. The beauty of this approach is that it would likely work even with your computer setup as the protein database would be significantly smaller.

I will stop here because it might feel like I am pestering you, even though that's not the intent.

ADD REPLY

Login before adding your answer.

Traffic: 3349 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6