Buying computational resources vs. renting allocation on cluster
2
1
Entering edit mode
7.2 years ago
ebrown1955 ▴ 320

My lab is about to receive about WGS (30X) data (fastqs) on about 400-1500 samples (depending on how many we get at a time) that need to be processed and analyzed. We are trying to figure out the best route for computational resources. Here are some options:

1. We can buy an allocation from Google Cloud Platform for $1,146 per month (assuming it would be on 24/7 since Google charges by the minute), which would include: • 32-core vCPU with 120GB of RAM • 1.5TB of SSD space 2. We can purchase a node through our university for about$700 per month, however we would be limited to 14,000 core hours per month.
• 20 cores
• 256GB RAM
3. We can purchase a system through Dell for about 15-20,000 which has 100+ cores, a few TB of HDD space (we already have NAS storage), and a 3 year warranty. Now, it seems as though there are pros and cons to each of these choices. For example: purchasing a server directly from Dell would have hidden costs such as upkeep, and would depreciate over time, whereas renting from GCP or our university would be constantly upgraded. On the other hand, in the long term, it seems as though purchasing a system is more economical (granted our computational needs don't significantly raise over the next 3 years) Are there any options that I may be missing? Also, am I over/underestimating our computational needs for the type of analysis? cluster cloud computing server processing power • 3.3k views ADD COMMENT 2 Entering edit mode I think you are seriously underestimating the scale of your analysis. Going from fastq to vcf for one 30x WGS sample may take days on a single (multi-core) node, depending on hardware and pipeline (GATK is particularly resource hungry). Multiply that by 1000 and you see that your options 2 and 3 are not really feasible. Same for storage. Assuming 150 GB per 30x WGS sample (lower end), you need 150 TB to only hold the raw fastq data. I would multiply that number by at least 3 to store intermediate data and results. For the inital heavy lifting, I would either go for the cloud or -- if you have access to -- a local computing cluster with at least several dozens (better hundreds) of computing nodes at unlimited disposal. After that, a high-end server as in your option 3 _could_ suffice, although it would be better to keep cloud/cluster access to speed up parallel processing of your samples. ADD REPLY 1 Entering edit mode 7.2 years ago matted 7.7k There are cloud-based providers that automate alignment and sample processing, including DNAnexus and SevenBridges. I don't know what your specific needs are, but that could be useful if you want to run "standard" pipelines and get to variant calls quickly. The pricing is per sample and may or may not be competitive, depending on how much other processing you want to do and how long you'd leave a homegrown EC2 cluster running. I should point out that I've never used these services myself, so I can't vouch for them. The GCP/EC2 costs are hard to predict, since usage will probably be very elastic. Presumably you need/want a lot of computing early on, but after the heavy tasks (alignment and variant calling) are done, you can scale back. So one frugal option may be to use a cloud solution for ~1 month to do the heavy lifting, then analyze the processed data (e.g. VCF files) on more modest servers that you have locally (some shared existing server, or maybe purchasing something in the5K range instead of a 100-core monster).

Storage will be complicated, especially at the scale you're discussing (unless you discard the raw reads). Leaving that in the cloud, even on something like Amazon Glacier, will be expensive after a few years. However, keeping it locally means you have to really trust whatever backup solution you have around you. If your local NAS storage is well-managed and can handle the size requirements, that may be the simplest option. I guess my main point here is to think carefully about both computing and storage, and try to optimize each one (and not to neglect the cost of long-term storage, including adequate backup/recovery).

1
Entering edit mode
7.2 years ago

I suggest you generate synthetic data, or download some public data, and run through your planned pipeline to calculate how much computation it actually needs per unit of data, how much memory, disk, etc. You could do that on rented/leased compute resources. To determine whether you should buy or rent, and what sort of system you should buy if you are buying, you must know (at least approximately) how much data you will get over what timeframe, what resources you will need per unit of data (this depends specifically on your chosen pipeline, and can vary by orders of magnitude), and what kind of latency you (or your customers) are willing to accept.

What kind of system has 100+ cores, anyway? If you buy a cluster, you will probably also need someone to configure and manage it... I'm not aware of any single computer with 100+ cores.

0
Entering edit mode

Yeah, good answer. I was thinking about some 1000 genomes data with similar coverage, and run them on one of our current systems at different levels of parallelization, and check how long this would take.

I was actually surprised to see a single machine with 100+ threads. This Dell system had 4 very-core-dense processors. http://www.dell.com/us/business/p/poweredge-r930/pd?~ck=anav

I'd imagine that with a single system (as opposed to a cluster) management wouldn't be as complex, and we could just have the university IT department manage it for us (for a fee of course). The thing that worries me about external cloud is storage space and data transfer. If we choose cloud we will have to leave the data on the cloud systems, and 100+ TB on cloud systems will be pricey. If we stay local, we can buy our own NAS systems (we already have about 100TB of space, and add in a tape archive system for another \$3-4,000 and we should have a good, stable backup and storage solution.

0
Entering edit mode

If you want to avoid the complexity of managing a cluster altogether, you could consider in-memory supercomputers like the SGI UV. They deliver terrabytes of RAM and 100s of computing nodes on a single Linux instance, but come at a price compared to "standard" servers.