Question

Running blastp locally though the NCBI nr dabase taking is weeks to run.

1

Entering edit mode

3.2 years ago

Bacteria_forever ▴ 10

Hi all

I think I already know the answer, but I am finding it impossible to find a concrete answer online anywhere. And I want to ask anyway.

I am running a eukaryotic proteome (15mb, >30,000 sequences) through the most recent NCBI nr database (370G unzipped) locally using blastp. Thus far it has been running 8 days, with no end in sight. Everything is running as I can see it from the system monitor, and the output file is slowly filling up (15MB sofar, after 8 days).

I am not running this on a powerful university computer. My home desktop is a Unix environment (Linux Mint) 16G Ram, quadro-core machine.

Is this normal ?(I know, I have read many places that at least 50G RAM may be needed for this to be completed in about an hour). What further time frame might I be looking at here? Should I stop it and beg some university to allow me use their systems for a day?

I really don't want to kill it after 8 days thus far, the info that will come back (if I don't kill it) is very important to me. But I cannot wait weeks.

Thanks in advance

assembly • 2.2k views

ADD COMMENT • link 3.2 years ago by Bacteria_forever ▴ 10

0

Entering edit mode

Thanks, good to hear Diamond is a viable alternative for me. I have this loaded up and ready for use as I was already thinking of using it. All my output files will go through Alienness, and Diamond is an option for that. I knew as soon as I pressed "go" on local blastp search I would regret it. Cheers

ADD REPLY • link 3.2 years ago by Bacteria_forever ▴ 10

0

Entering edit mode

Less than a month!! I have ten more proteomes to run!! I've killed it. Still, there was enough information in the output file to tell me I am on the right track. I am going to try the Diamond aligner. Thanks for replying with advice guys.

ADD REPLY • link 3.2 years ago by Bacteria_forever ▴ 10

1

Entering edit mode

What is your justification for using NR as your reference database, i.e. why do you need to compare your eukaryotic proteome against "all known" protein sequences? What is the question that you're trying to answer?

ADD REPLY • link 3.2 years ago by 5heikki 11k

0

Entering edit mode

You are going to have the same exact problem with DIAMOND if you are planning to use nr as your reference.

ADD REPLY • link 3.2 years ago by GenoMax 141k

0

Entering edit mode

@5heikki This work is part of a HGT discovery pipeline, so sort of need to run through the nr database. Standard practice for extrinsic HGT characterization. Genomax is right, diamond is as slow as blast+. On my machine anyway. Need a new machine.

ADD REPLY • link 3.2 years ago by Bacteria_forever ▴ 10

1

Entering edit mode

Bacteria_forever : Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question.

ADD REPLY • link 3.2 years ago by GenoMax 141k

0

Entering edit mode

HGT specific to your organisms? You could reduce your proteome a lot by excluding all the proteins which get a good hit to the proteome of its closest sequenced relative, no? If you want something that is many orders of magnitude faster than blast then check out Mash. Creating a reference database would take quite a while thou. I recently did a Mash all vs all of ~210k bacterial genomes. This took about 24h with 128 threads..

ADD REPLY • link 3.2 years ago by 5heikki 11k

0

Entering edit mode

Thanks for all the advice guys. Very useful.

ADD REPLY • link 3.2 years ago by Bacteria_forever ▴ 10

score 2 · Answer 1 · 2021-02-17

For your computer configuration, this is normal. Sadly, it may take many more days.

You need a computer with large enough memory to hold the whole NR database in it for this to complete in reasonable time. I don't know what exactly that memory is, but it should definitely be at least 64 Gb, and I would say likely 128 Gb. And it won't be super fast even if you manage this memory upgrade, because >30,000 sequences is a lot. Think about it this way: even if a single protein search takes only 10 seconds (and it takes longer), that would be 300,000 seconds, or more than 83 hours. Honestly, I don't think there is much chance your present search will complete in less than a month.

You can do this much faster in your present configuration by running hmmscan from HMMer package against the whole PFAM database. It is not the same as comparing to NR, but most your proteins will still be annotated after PFAM comparison. Another option is to use rpsblast from the BLAST package against the conserved-domains (CD) database, which should still be orders of magnitude faster than a plain BLAST search against NR.

score 1 · Answer 2 · 2021-02-17

Given the size of this database, running BLAST on a 4-core machine with 16 GB RAM can take a very long time. You should check the output file for how many queries have been processed. Alternatively, use the fast protein aligner Diamond (https://github.com/bbuchfink/diamond) or consider renting a cloud instance.