Question

Looking for AWS EC2 advice for a metagenomic pipeline

0

Entering edit mode

3.9 years ago

jonathon481 ▴ 10

Hi all,

I am looking to setup a EC2 instance/s to run my pipeline for the purpose of virus discovery. I have ~20 libraries that are paired-end (100 bp) (HiSeq 2500) likely around this size 19.4M spots, 3.9G bases.

Basic pipeline:

Trinity to assemble paired reads
Estimate abundance using RSEM
blastn assembled contigs against Nucleotide database
diamond blastx assembled contigs against nr database

I have run this on a r5.8xlarge (memory 256gb vCPU's 32) before, but I am wonder if:

There is a better instance type to run this on/would it be more efficient to run this on multiple instances in parallel?

Is the on demand pricing model the way to go or should I try to make use of spot (never tried it before)

My time frame is flexible but faster is always better, I have a budget ~$1300 (US) to work with and likely use a server in the Asia pacific. I'm not certain if this will cover all samples (any remaining samples will be completed on a local server).

Due to storing the file and databases I estimate I would need 700gb of storage to work with. Is general purpose ssd storage recommended for this?

I am familiar with AWS so I most likely want to use it for this case but I am always willing to look into other options

Thank you!

assembly blast aws • 807 views

ADD COMMENT • link 3.9 years ago by jonathon481 ▴ 10