How to Speed Up BLASTp
2
0
Entering edit mode
12 months ago
twangxxx • 0

Hello,

I have a fasta file including 140 protein sequences from distinct viruses and I would like to identify which protein comes from which virus.

I am using a Linux cluster, BLAST is available as a cluster module, and the viruses and NCBI nr databases are stored in my own directory(correct me if I used the wrong terminology) in the cluster.

I set up my blastp as below:

 blastp -db nr -query proteins.fa -outfmt 6 -out ./output.txt  -num_threads 10 -max_target_seqs 1


and requested the resources from cluster as:

#PBS -l mem=64gb,nodes=10:ppn=1,walltime=10:00:00


It has been running for around 10 hours and I haven’t got any results written in the output.txt. I am wondering if there is a better way to set up RAM, nodes, or process per node to speed up BLASTp run. Thank you so much!

Here is the info about the Linux cluster:

66 compute nodes. Each node has two 14-core Intel processors (2.40GHz) sharing 128 GB of memory.

blastp linux-cluster BLAST nr-database • 1.3k views
1
Entering edit mode

Have you downloaded all files for nr database from NCBI and uncompressed them in your directory. If you take a single sequence and try to run a quick search against this database do you see results in < 30 min (it will take a while to read the database files).

0
Entering edit mode

Thank you so much for the reply. I did download and uncompress all nr databases from NBCI in my directory. Taking your suggestion and suggestions from below. I am running a -num-threads 10 blastp to search single sequence against all nr databases, by using mem=120gb,nodes=1:ppn=14. Hope this will run faster.

Also, do you have any suggested method to limit the protein sequence database to that only comes from viruses?

0
Entering edit mode

You can use -taxids 10239 (taxID for viruses) option in your blastp to limit your local search for viruses. This will require you to download the taxonomy file from the same location where you downloaded nr indexes and keep it in the same directory as your blast indexes.

0
Entering edit mode

It's over two hours since I initiated a single sequence blastp against all nr databases as I mentioned in my previous reply, and It hasn't completed it.

I am running a -num-threads 10 blastp to search single sequence against all nr databases, by using mem=120gb,nodes=1:ppn=14. Hope this will run faster.

So, I am considering building a local database only including protein sequences from viruses.

I found a website here, but not sure how to download all fasta files from the command line or using any available tool.

0
Entering edit mode

considering building a local database only including protein sequences from viruses.

I think you are best off getting the viral proteins from the link Mensur Dlakic had provided below for UniProt.

That said you can download using Download button on the page you linked above from NCBI.

0
Entering edit mode

Loading the nr DB in memory (especially with the newest binaries) you will need to request all the mem of node (120GB should be OK to use the DB, the requested 64gb will likely not work).

1
Entering edit mode
12 months ago
h.mon 34k

You are requesting 10 nodes and 1 processor per node, however, blastp can only use one node. You should use:

#PBS -l mem=128gb,nodes=1:ppn=14,walltime=10:00:00


There are ways of splitting the input fasta file and submitting to several nodes, but with 140 sequences as input, it is not necessary.

You should contact the cluster administrators for instructions on how to properly use Torque / PBS resource manager. And before downloading NT / NR, you should also ask if these databases are already available at a centrally managed location - as they are widely used, this is commonly the case.

0
Entering edit mode

Thank you so much, I changed my PBS setting as you suggested. I am afraid there is no database available in a shared location in the cluster, so I downloaded and uncompressed the whole NCBI nr database in my directory.

Also, I am wondering how to properly set up -num_threads in blastp command to speed up based on this PBS request.

0
Entering edit mode

You can try the variable $PBS_NUM_PPN (number of CPUs per node):  blastp -db nr -query proteins.fa -outfmt 6 -out ./output.txt \ -num_threads$PBS_NUM_PPN -max_target_seqs 1


1
Entering edit mode
12 months ago
Mensur Dlakic ★ 20k

Another thing that may help is searching against a virus-only database, since at least 99.5% of nr are non-viral entries. Specific taxonomic entries can be downloaded from this link:

https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/

There are two files (sprot and trembl) for each group, and you would need the .dat.gz files. Those are in EMBL format, so you will need a program to convert them to FASTA. I know that a little utility called esl-reformat from the HMMer package can do it, and there are likely to be others.

0
Entering edit mode

Thank you for reply. I read the manual of HMMer package and found that esl-reformat utility is for nucleotide sequence format conversion. It probably won't work for protein sequence. Do you have any other tools recommended?

1
Entering edit mode

esl-reformat works for protein sequences. In fact, it will automatically figure out the type of sequence, although it can be specified on the command-line if needed. It is easy enough, why don't you give it a try?