Hi all, I am facing some difficulties in blasting my de novo assembled unigenes.
I have about 85000 unigenes and was planning to blast it against the nt preformatted database from the ncbi ftp link.
I used this command
to download and check whether the db is up to date
# update_blastdb.pl --passive --decompress nt
to blast my query (query3.fa which only have 4 sequence) against the 90+ GB nt database
# blastn -db nt -query query3.fa -task blastn -dust no -outfmt "6 delim=, qacc stitle sacc evalue bitscore qcovus pident" -max_target_seqs 1 -num_threads 4 -out results.txt
my CPU is intel i5-7300hq which has 4 cores and thread, 8gb ram
However, the time taken to blast only this 4 sequence took about 30 minutes, and my whole sequence is about 85000. It would probably take about 1.5 years for me to fully blast all my sequence at this rate.
Is there no other way to speed this up other that using a more powerful CPU?
Will formatting my query file or even using the fasta version of nt database will help?
This is how my query file look like (I have already deleted a big portion of the sequence to show here)
>H42_1_(paired,_trimmed_pairs)_contig_1_consensus
CATCACCTCCAAGATCCGGCTTGTGAATTCAACTTGTCGCCCGGAGGCTTCCCAAATTCT
TAGACTGCGCGCCTGCCTAAGCCAGCTACCTAACAATATACCACTCTCATTGCACTCAAT
GATGTCTGCAGAGTCGGCGCGCTG
>H42_1_(paired,_trimmed_pairs)_contig_2_consensus
GCAGAACCGAGCTTCAAGCTCCAAGATCCGGCTTTTGAATTCAACTTGTCGCCTGGAGGC
TTCCCAAATTCTTAGACTGCGCGCCTGCCTGAGCCAGCTACTTAACAATATACCACCCCC
ATTGAACTCAATGATGTCTCAATCGAACGTGTAAGGCTTGGAGCTTGGAGCTTGAAGCTC
GGTTC
>H42_1_(paired,_trimmed_pairs)_contig_3_consensus
GAGGAATATGAATCCGGATAACAATATTACAATGATGCGATGTTTAACTGCTACTGCCTC
TTAACTATCAACGTCTACATAC
>H42_1_(paired,_trimmed_pairs)_contig_4_consensus
ACCGCCGGATGGGTCTGCAGAGAGGTTAACGAAAGTCGGTGCGGAGACGCCTTTCTCGCC
GCCGATA
Thank you very much in advance!
Your contigs look rather short, I almost had to look up unigene, but these are de-novo assembled transcripts, correct? I think you might want to check if you can improve the assembly to increase the length of the contigs and the reduce their number, this might not save you that much time, but make your result more informative. Then you should ask if performing BlastN is very informative because you will only catch very similar sequences, and waiting 1.5 years for that is maybe not worth it :) I would prefer getting alignments on the amino-acid level, but then you need BlastX - or Diamond - vs NR. With Diamond you might even be able to finish the job on your Hardware, for Blast you need either a cluster or at least a multi-core machine, or a large cloud instance (will be expensive).
@Michael OP has said.
Ok, sorry, I didn't get that. Still it might be better to use blastx, diamond or a pipeline like trinotate.
Have you tried reducing your query set by clustering it some high threshold to see how many representatives remain?