I am looking to setup a EC2 instance/s to run my pipeline for the purpose of virus discovery. I have ~20 libraries that are paired-end (100 bp) (HiSeq 2500) likely around this size 19.4M spots, 3.9G bases.
Trinity to assemble paired reads
Estimate abundance using RSEM
blastn assembled contigs against Nucleotide database
diamond blastx assembled contigs against nr database
I have run this on a r5.8xlarge (memory 256gb vCPU's 32) before, but I am wonder if:
There is a better instance type to run this on/would it be more efficient to run this on multiple instances in parallel?
Is the on demand pricing model the way to go or should I try to make use of spot (never tried it before)
My time frame is flexible but faster is always better, I have a budget ~$1300 (US) to work with and likely use a server in the Asia pacific. I'm not certain if this will cover all samples (any remaining samples will be completed on a local server).
Due to storing the file and databases I estimate I would need 700gb of storage to work with. Is general purpose ssd storage recommended for this?
I am familiar with AWS so I most likely want to use it for this case but I am always willing to look into other options