Question: NR BLAST on AWS
0
gravatar for Chirag Parsania
4 weeks ago by
University of Macau
Chirag Parsania240 wrote:

Hi,

Does anyone know howmuch it would cost to perform blast on AWS for 10,000 sequences against NR database ? Which amazone instance should we buy to perform NR blast ?

Thanks, Chirag.

cloud blast amazone aws • 150 views
ADD COMMENTlink modified 12 days ago by Biostar ♦♦ 20 • written 4 weeks ago by Chirag Parsania240
2

This is tricky to price out! More cores makes the blast run faster (to an extent), but is more expensive. Fewer cores would require more time to run the blast, also affecting the price.

To partially answer your question you would want an EC2 web instance running ubuntu. Then you would need to install blast, download nr and format the db (this takes several hours to a day). The number of cores you use will correlate with the speed of the blast but its not 1:1. 2 cores gives you something like a 1.5x speedup, 8 cores a 6x speedup etc. You will need at least 120g of storage to fit nr and your sequences.

If price (AWS instance time) is a motivator, you can look into plast: https://plast.inria.fr/ It's a blast like program that is much faster than blast and handles multi-threading more efficiently. I use it for all my large scale nr alignments.

ADD REPLYlink written 4 weeks ago by Jacob Warner310
1

download nr and format the db (this takes several hours to a day)

To download the data time would be needed but no formatting is needed with pre-made blast indexes. Those can be found here.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax29k

Hi Jacob,

Thanks for useful suggestion. Blast has already been install on amazone by NCBI. you can check it here. They also provide perl script to run blast on AWS server. I am not sure about database. User may need to download the blast database in his/her amazone instance or it has also been provided by ncbi. I am using HPC as of now with 24 cores. It's taking bit long. (it can only finish ~1000 sequences in 24 hrs against NR database with 24 cores and 132 GB memory.). I can try plast.

ADD REPLYlink written 4 weeks ago by Chirag Parsania240

it can only finish ~1000 sequences in 24 hrs against NR database with 24 cores

You have to qualify how long those sequences are. If they are several kb long then that would not be unusual. Are you sure you are using all cores for your blast? Can you provide the command line you are using?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax29k

I am using sbatch command to submit job on hpc through sh script. see the script below

#!/bin/bash
#SBATCH --job-name          NR_BLAST
#SBATCH --partition         FHS_LONG
#SBATCH --nodes             1
#SBATCH --tasks-per-node        24
#SBATCH --mem               128g
#SBATCH --time              124:00:00
#SBATCH --output            job.%j.out
#SBATCH --error             job.%j.err
#SBATCH --mail-type         FAIL
#SBATCH --mail-user         user@umac.mo



blastp -db <database> -query query.fa  -num_threads 24  -outfmt 6 -out blastout.txt
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Chirag Parsania240

Please use ADD COMMENT/ADD REPLY when responding to existing questions to keep threads logically organized.

You are using 24 tasks per node (not 24 cores per job). You can check with your HPC support but n= Number is what you need to check into.

(This option advises the Slurm controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources)

Note that --cpus-per-task=<ncpus> changes the default, which is generally 1.

You did not answer the question about the length of the query though.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax29k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 812 users visited in the last hour