Question

NR BLAST on AWS

0

Entering edit mode

6.9 years ago

Chirag Parsania ★ 2.0k

Hi,

Does anyone know howmuch it would cost to perform blast on AWS for 10,000 sequences against NR database ? Which amazone instance should we buy to perform NR blast ?

Thanks, Chirag.

blast amazone aws cloud • 1.7k views

ADD COMMENT • link updated 6.9 years ago by Biostar 20 • written 6.9 years ago by Chirag Parsania ★ 2.0k

2

Entering edit mode

This is tricky to price out! More cores makes the blast run faster (to an extent), but is more expensive. Fewer cores would require more time to run the blast, also affecting the price.

To partially answer your question you would want an EC2 web instance running ubuntu. Then you would need to install blast, download nr and format the db (this takes several hours to a day). The number of cores you use will correlate with the speed of the blast but its not 1:1. 2 cores gives you something like a 1.5x speedup, 8 cores a 6x speedup etc. You will need at least 120g of storage to fit nr and your sequences.

If price (AWS instance time) is a motivator, you can look into plast: https://plast.inria.fr/ It's a blast like program that is much faster than blast and handles multi-threading more efficiently. I use it for all my large scale nr alignments.

ADD REPLY • link 6.9 years ago by Jake Warner ▴ 830

1

Entering edit mode

download nr and format the db (this takes several hours to a day)

To download the data time would be needed but no formatting is needed with pre-made blast indexes. Those can be found here.

ADD REPLY • link 6.9 years ago by GenoMax 141k

0

Entering edit mode

Hi Jacob,

Thanks for useful suggestion. Blast has already been install on amazone by NCBI. you can check it here. They also provide perl script to run blast on AWS server. I am not sure about database. User may need to download the blast database in his/her amazone instance or it has also been provided by ncbi. I am using HPC as of now with 24 cores. It's taking bit long. (it can only finish ~1000 sequences in 24 hrs against NR database with 24 cores and 132 GB memory.). I can try plast.

ADD REPLY • link 6.9 years ago by Chirag Parsania ★ 2.0k

0

Entering edit mode

it can only finish ~1000 sequences in 24 hrs against NR database with 24 cores

You have to qualify how long those sequences are. If they are several kb long then that would not be unusual. Are you sure you are using all cores for your blast? Can you provide the command line you are using?

ADD REPLY • link 6.9 years ago by GenoMax 141k

0

Entering edit mode

I am using sbatch command to submit job on hpc through sh script. see the script below

#!/bin/bash
#SBATCH --job-name          NR_BLAST
#SBATCH --partition         FHS_LONG
#SBATCH --nodes             1
#SBATCH --tasks-per-node        24
#SBATCH --mem               128g
#SBATCH --time              124:00:00
#SBATCH --output            job.%j.out
#SBATCH --error             job.%j.err
#SBATCH --mail-type         FAIL
#SBATCH --mail-user         user@umac.mo



blastp -db <database> -query query.fa  -num_threads 24  -outfmt 6 -out blastout.txt

ADD REPLY • link 6.9 years ago by Chirag Parsania ★ 2.0k

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing questions to keep threads logically organized.

You are using 24 tasks per node (not 24 cores per job). You can check with your HPC support but n= Number is what you need to check into.

(This option advises the Slurm controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources)

Note that --cpus-per-task=<ncpus> changes the default, which is generally 1.

You did not answer the question about the length of the query though.

ADD REPLY • link 6.9 years ago by GenoMax 141k