Trouble with Standalone Blast Tool (running for 2 hours and no output)
2
0
Entering edit mode
9 weeks ago

I'm trying to blastx a Fasta file with a little over 23000 genes in it. I have downloaded the full nr database and unzipped all the tar.gz files. I have downloaded the standalone Blast tool. This is the command I am running:

blastx -query "C:\Program Files\NCBI\blast-2.17.0+\bin\Ha_Trinity_denovo_assembly.fasta" -db nr -outfmt 10 -out "Ha_blastx_results.csv" -max_target_seqs 100 -num_threads 4

I am using the correct path for the Fasta file. I am running it in a command line from the folder that contains the database. But after waiting for 2 hours, nothing has appeared in the .csv file. There is no error message in the command prompt window. It just shows that the command is still running. I also tried this with one gene (362bp) and waiting 10 minutes, and one gene with the output file being a text file instead of csv. The problem is the same every time. Any idea what I might be doing wrong?

Sorry if this is a very base-level question, I have looked everywhere online and can't find an answer.

genome fasta blastx blast • 1.2k views
ADD COMMENT
0
Entering edit mode

How much memory does your windows machine have? With nr database you will need at least 50-60 GB of RAM.

ADD REPLY
0
Entering edit mode

Hi, thanks for the reply. I currently have the nr database on a hard drive which has about 450 GB remaining. My computer says I have 16 GB of installed RAM. The local disk has about 160 GB remaining. I'm not great with computers, so I don't know if you're asking about my installed RAM or just memory, sorry about that.

Also just some additional information that I forgot to include in the original post, my computer has 4 cores, and I'm using all of them for this.

ADD REPLY
2
Entering edit mode
9 weeks ago
GenoMax 154k

With 16GB of RAM I don't think that blastx is going to work with nr. You could wait and see if a result is finally produced but with ~23000 input sequences you will run out of patience. You may want to find alternate hardware to do this search.

ADD COMMENT
0
Entering edit mode

Alternate hardware as in cluster/cloud. There is no single machine that can do this in a sensible time frame for OP, not in a budget friendly way at least.

ADD REPLY
0
Entering edit mode

With alternate hardware/cloud, using the cluster_nr database (which will drop the memory requirement somewhat) may be another option. At that point, using DIAMOND (LINK) may make more sense as well.

ADD REPLY
1
Entering edit mode
9 weeks ago
Mensur Dlakic ★ 30k

Any idea what I might be doing wrong?

Chances are you are not doing anything wrong. You simply have inadequate resources. As this seems to work with a single, short sequence, I would guess that with longer sequences and with your resources it will take a very long time. I don't want to be gloomy, but it is very unlikely that you can do this project on your computer in a finite period of time. It would probably take days to weeks to search 23000 sequences even if you had 512 GB RAM and 50+ cores. I can't even predict how long it would take with your resources, but I think years.

I think you will have to split your sequences into several groups and run them simultaneously on a computer cluster with much higher resources. Better yet, translate your ORFs into proteins and that will speed up things at least to some degree.

ADD COMMENT
0
Entering edit mode

512 GB RAM and 50+ cores would help significantly. I'd say weeks is a better scale. But yeah, 16GB RAM will crawl its way to completion with 3-5 sequences in ~48 hours. As far as 23K sequences goes, you're right, one might as well give up.

ADD REPLY
0
Entering edit mode

If I used 3 virtual machines and split the fasta file between them, and translated the full file from nucleotides to proteins before starting, how much time do you think it would take?

ADD REPLY
0
Entering edit mode

how much time do you think it would take?

You will need the same resources, RAM and cores, for all three machines. We don't know the length of sequences in your dataset and/or config of VM you will use. You will need to experiment and try this yourself with a smaller subset to get an idea.

That said what is your aim here? Based on the file name it sounds like this is some sort of assembled transcriptome. Do you need to blast against entire nr? You could use a smaller set of sequences and/or close relative to do this and things should go much quicker.

Also consider using DIAMOND instead of blast+, which is more efficient for this sort of thing. You should expect similar compute needs and since you appear to have access to cloud compute this would be a viable option. DIAMOND can use pre-formatted blast indexes so you can re-use those.

ADD REPLY
0
Entering edit mode

what is your aim here

Good point - this does seem like an XY problem. If they're going for an assembled transcriptome, they're using the wrong tools. velvet/oases as far as I know (from more than a decade ago) might help.

ADD REPLY
0
Entering edit mode

If your goal is to annotate the genome, BLAST is inefficient both in terms of time and the ability to do it well. Most modern approaches for this problem include databases of profiles or hidden Markov models (HMMs). For example, RPS-BLAST can annotate 23,000 proteins against the NCBI profile database in a matter of hours, and it is almost guaranteed to produce better results than a brute-force BLAST searching. Similarly, you can download an HMM database from InterPro and annotate all your proteins with HMMer in a matter of hours. I suggest you think hard whether BLASTing this many proteins is actually needed to achieve your final goal before you invest more effort.

ADD REPLY

Login before adding your answer.

Traffic: 3806 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6