I am looking to setup a EC2 instance/s to run my pipeline for the purpose of virus discovery. I have ~20 libraries that are paired-end (100 bp) (HiSeq 2500) likely around this size 19.4M spots, 3.9G bases.

Basic pipeline:

  1. Trinity to assemble paired reads

  2. Estimate abundance using RSEM

  3. blastn assembled contigs against Nucleotide database

  4. diamond blastx assembled contigs against nr database

I have run this on a r5.8xlarge (memory 256gb vCPU's 32) before, but I am wonder if:

There is a better instance type to run this on/would it be more efficient to run this on multiple instances in parallel?

Is the on demand pricing model the way to go or should I try to make use of spot (never tried it before)

My time frame is flexible but faster is always better, I have a budget ~$1300 (US) to work with and likely use a server in the Asia pacific. I'm not certain if this will cover all samples (any remaining samples will be completed on a local server).

Due to storing the file and databases I estimate I would need 700gb of storage to work with. Is general purpose ssd storage recommended for this?

I am familiar with AWS so I most likely want to use it for this case but I am always willing to look into other options

Thank you!

