Question: Blastn, need help to increase speed
0
gravatar for chiachoong_leong93
10 days ago by
chiachoong_leong930 wrote:

Hi all, I am facing some difficulties in blasting my de novo assembled unigenes.

I have about 85000 unigenes and was planning to blast it against the nt preformatted database from the ncbi ftp link.

I used this command

to download and check whether the db is up to date

  # update_blastdb.pl --passive --decompress nt

to blast my query (query3.fa which only have 4 sequence) against the 90+ GB nt database

 # blastn -db nt -query query3.fa -task blastn -dust no -outfmt "6 delim=, qacc stitle sacc evalue bitscore qcovus pident" -max_target_seqs 1 -num_threads 4 -out results.txt

my CPU is intel i5-7300hq which has 4 cores and thread, 8gb ram

However, the time taken to blast only this 4 sequence took about 30 minutes, and my whole sequence is about 85000. It would probably take about 1.5 years for me to fully blast all my sequence at this rate.

Is there no other way to speed this up other that using a more powerful CPU?

Will formatting my query file or even using the fasta version of nt database will help?

This is how my query file look like (I have already deleted a big portion of the sequence to show here)

>H42_1_(paired,_trimmed_pairs)_contig_1_consensus
CATCACCTCCAAGATCCGGCTTGTGAATTCAACTTGTCGCCCGGAGGCTTCCCAAATTCT
TAGACTGCGCGCCTGCCTAAGCCAGCTACCTAACAATATACCACTCTCATTGCACTCAAT
GATGTCTGCAGAGTCGGCGCGCTG

>H42_1_(paired,_trimmed_pairs)_contig_2_consensus
GCAGAACCGAGCTTCAAGCTCCAAGATCCGGCTTTTGAATTCAACTTGTCGCCTGGAGGC
TTCCCAAATTCTTAGACTGCGCGCCTGCCTGAGCCAGCTACTTAACAATATACCACCCCC
ATTGAACTCAATGATGTCTCAATCGAACGTGTAAGGCTTGGAGCTTGGAGCTTGAAGCTC
GGTTC

>H42_1_(paired,_trimmed_pairs)_contig_3_consensus
GAGGAATATGAATCCGGATAACAATATTACAATGATGCGATGTTTAACTGCTACTGCCTC
TTAACTATCAACGTCTACATAC

>H42_1_(paired,_trimmed_pairs)_contig_4_consensus
ACCGCCGGATGGGTCTGCAGAGAGGTTAACGAAAGTCGGTGCGGAGACGCCTTTCTCGCC
GCCGATA

Thank you very much in advance!

rna-seq blast+ blastn • 136 views
ADD COMMENTlink modified 10 days ago by Mensur Dlakic9.1k • written 10 days ago by chiachoong_leong930

Your contigs look rather short, I almost had to look up unigene, but these are de-novo assembled transcripts, correct? I think you might want to check if you can improve the assembly to increase the length of the contigs and the reduce their number, this might not save you that much time, but make your result more informative. Then you should ask if performing BlastN is very informative because you will only catch very similar sequences, and waiting 1.5 years for that is maybe not worth it :) I would prefer getting alignments on the amino-acid level, but then you need BlastX - or Diamond - vs NR. With Diamond you might even be able to finish the job on your Hardware, for Blast you need either a cluster or at least a multi-core machine, or a large cloud instance (will be expensive).

ADD REPLYlink modified 10 days ago • written 10 days ago by Michael Dondrup48k

Your contigs look rather short,

@Michael OP has said.

I have already deleted a big portion of the sequence to show here

ADD REPLYlink written 10 days ago by GenoMax96k

Ok, sorry, I didn't get that. Still it might be better to use blastx, diamond or a pipeline like trinotate.

ADD REPLYlink modified 10 days ago • written 10 days ago by Michael Dondrup48k

Have you tried reducing your query set by clustering it some high threshold to see how many representatives remain?

ADD REPLYlink written 10 days ago by 5heikki9.3k
2
gravatar for Michael Dondrup
10 days ago by
Bergen, Norway
Michael Dondrup48k wrote:

So, I think there are a few steps you can take anyway:

  1. Identify the right search strategy for your application, likely BlastN or BlastX. One could argue that it is required to use both, but if your resources are limited, you might get more from a BlastX run in this case, even though it might run for even longer.

  2. If your search strategy is BlastX, then you can use DIAMOND on GhostX as a replacement. This is the only approach that will work on desktop hardware.

  3. Even if you have enough resources, like a 50+ CPU cluster and want to run NCBI blast, it still pays off to optimize the search: database size matters, so if you have a eukaryote you can at least throw out bacterial taxa and vice versa, or even more. Of course that also has its draw backs, like not detecting contaminants.

  4. For BlastN, you can further use the task "megablast" that will speed up your search but only find highly similar matches.

  5. Using GNU parallel might further speed up your query over simply using -num_threads see Truly Parallel Blasts With Blast+ for further links but your milage may vary.

Finally, your estimate of 1.5 years to complete does not take into account the significant startup-time required for loading the NT/NR Blast database. So, in the end the whole search might be a bit more efficient, but it definitely still take too long.

ADD COMMENTlink modified 10 days ago • written 10 days ago by Michael Dondrup48k
1

If a complete/annotated genome of a close relative is available then it may still be possible for OP to use their own hardware and stay with blastn searches.

Does DIAMOND run with 8G RAM? Even if it does would it be possible for OP to create DIAMOND indexes for nr with this hardware?

ADD REPLYlink written 10 days ago by GenoMax96k

Not sure how much RAM is required for that. Maybe a pre-built database is lying around somewhere?

ADD REPLYlink written 10 days ago by Michael Dondrup48k
0
gravatar for GenoMax
10 days ago by
GenoMax96k
United States
GenoMax96k wrote:

Unfortunately there is no way to speed this up with the hardware you have. If you have 85K sequences you should find alternate hardware. If you are working with a specific species then find genome of a close relative to cut down on the search space.

ADD COMMENTlink modified 10 days ago • written 10 days ago by GenoMax96k
0
gravatar for Mensur Dlakic
10 days ago by
Mensur Dlakic9.1k
USA
Mensur Dlakic9.1k wrote:

With your hardware, specifically with your low memory, there is no way to make this substantially faster - see a recent discussion here on a similar topic. This will hold regardless of which program you use, because nt is a gigantic database and it will not fit into the memory you have, which translates into lots of disk swapping.

You already have many good suggestions, so I will add a couple that were not mentioned.

  • Use something like average nucleotide identity (ANI) to quickly compare your sequences with a collection of genomes. See an example here. It will not give you an answer on a per-sequence basis, but it will identify what collections of sequences are most similar to yours on a global level.
  • Use hashing algorithms for the comparison - see here and here for details.

To give you better suggestions - and there are other options - you would need to provide more details of what your sequences are and what is the minimum amount of information you are hoping to get. For example, there are other strategies to employ if you predict genes from your sequences and search with proteins instead.

ADD COMMENTlink written 10 days ago by Mensur Dlakic9.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2261 users visited in the last hour
_