Hi all,
I am looking to setup a EC2 instance/s to run my pipeline for the purpose of virus discovery. I have ~20 libraries that are paired-end (100 bp) (HiSeq 2500) likely around this size 19.4M spots, 3.9G bases.
Basic pipeline:
Trinity to assemble paired reads
Estimate abundance using RSEM
blastn assembled contigs against Nucleotide database
diamond blastx assembled contigs against nr database
I have run this on a r5.8xlarge (memory 256gb vCPU's 32) before, but I am wonder if:
There is a better instance type to run this on/would it be more efficient to run this on multiple instances in parallel?
Is the on demand pricing model the way to go or should I try to make use of spot (never tried it before)
My time frame is flexible but faster is always better, I have a budget ~$1300 (US) to work with and likely use a server in the Asia pacific. I'm not certain if this will cover all samples (any remaining samples will be completed on a local server).
Due to storing the file and databases I estimate I would need 700gb of storage to work with. Is general purpose ssd storage recommended for this?
I am familiar with AWS so I most likely want to use it for this case but I am always willing to look into other options
Thank you!