Entering edit mode
7.4 years ago
Chirag Parsania
★
2.0k
Hi,
Does anyone know howmuch it would cost to perform blast on AWS for 10,000 sequences against NR database ? Which amazone instance should we buy to perform NR blast ?
Thanks, Chirag.
This is tricky to price out! More cores makes the blast run faster (to an extent), but is more expensive. Fewer cores would require more time to run the blast, also affecting the price.
To partially answer your question you would want an EC2 web instance running ubuntu. Then you would need to install blast, download nr and format the db (this takes several hours to a day). The number of cores you use will correlate with the speed of the blast but its not 1:1. 2 cores gives you something like a 1.5x speedup, 8 cores a 6x speedup etc. You will need at least 120g of storage to fit nr and your sequences.
If price (AWS instance time) is a motivator, you can look into plast: https://plast.inria.fr/ It's a blast like program that is much faster than blast and handles multi-threading more efficiently. I use it for all my large scale nr alignments.
To download the data time would be needed but no formatting is needed with pre-made blast indexes. Those can be found here.
Hi Jacob,
Thanks for useful suggestion. Blast has already been install on amazone by NCBI. you can check it here. They also provide perl script to run blast on AWS server. I am not sure about database. User may need to download the blast database in his/her amazone instance or it has also been provided by ncbi. I am using HPC as of now with 24 cores. It's taking bit long. (it can only finish ~1000 sequences in 24 hrs against NR database with 24 cores and 132 GB memory.). I can try plast.
You have to qualify how long those sequences are. If they are several kb long then that would not be unusual. Are you sure you are using all cores for your blast? Can you provide the command line you are using?
I am using sbatch command to submit job on hpc through sh script. see the script below
Please use
ADD COMMENT/ADD REPLY
when responding to existing questions to keep threads logically organized.You are using 24 tasks per node (not 24 cores per job). You can check with your HPC support but
n= Number
is what you need to check into.Note that
--cpus-per-task=<ncpus>
changes the default, which is generally 1.You did not answer the question about the length of the query though.