Question

How [Much] Would You Spend For Bioinformatics Hpc Infrastructure With Iaas Capability?

7

Entering edit mode

12.4 years ago

Hanif Khalak ★ 1.3k

Let's say you were involved/responsible in a bioinformatics HPC infrastructure project, you'd have to come up with a solution in the current market which met requirements with functional, budgetary, and technical aspects.

Assume the infrastructure was going to serve a standard biomedical research institution with various biology labs and projects, including nextgen sequencing for DNA/RNA/etc, proteomic and metabolic profiling, microarrays, microscopy and and possibly medical imaging, and various general related databasing and applications. Let's also include some IaaS capability for easier provisioning for general and even HPC requests. That hopefully gives a rough scope for the functional aspects.

To accomodate large memory tasks like denovo assembly, RAM is a priority with a good # of cores accessing shared system memory. Of course, storage scaling and performance are salient as well.

Using budget as a surrogate for scale, let's define 3 approximate levels:

small: $50K
medium: $250K
large: $1M

Keeping in mind issues related to computational and environmental power, density, scaling, configuration, and maintenance - how would you go about spec'ing your solution / spend your budget for each level? Also assume additional costs for install, service, etc will not need to be spent from budget. Some HPC server options I've seen out there include Dell, TransTec, and Supermicro.

Please share thoughts and experiences from the wise to the wary!

hardware cloud • 11k views

ADD COMMENT • link updated 12.4 years ago by lh3 33k • written 12.4 years ago by Hanif Khalak ★ 1.3k

1

Entering edit mode

I am obviously biased but some clarifying questions. Have you considered facilities costs (do you need to build a data center)? Your RAS costs (replacements, servicing), how long you plan to amortize your facility and your equipment? What's your networking infrastructure (IB, 10gigE, gigE), network topology? All of these are things you should be thinking about if you want to do anything reasonably serious and sustainable.

I would argue that today 3 years is an eternity for a cluster given that Intel has a two ticks and a tock in that timeframe.

ADD REPLY • link 12.4 years ago by Mndoci ★ 1.2k

0

Entering edit mode

Excellent and relevant question.

ADD REPLY • link 12.4 years ago by Daniel Standage 4.1k

0

Entering edit mode

I currently have the luxury of having adequate space in an existing enterprise data center; for simplicity I'd focus on hardware/software, but I included environmental issues in the question as a factor in specifications

ADD REPLY • link 12.4 years ago by Hanif Khalak ★ 1.3k

Ram · Answer 1 · 2011-12-21

My thoughts on how I would manage this:

Use multiple small/medium budget resources for different group and scale them over the years
Resources should be grouped according to OS and other requirements
Typical turn-around time is 3-5 years for the clusters that I have used, after this they can be recycled as a data storage devices
For every computing cluster, there should be a corresponding storage cluster

My approximation is based on a pricing of ~<10K for 24 cores, 64 GB RAM, 10TB space. So as per your budget of 50K for small cluster will fetch you 120 core, 320 gigs of RAM and 50TB of space.

NGS analysis: Medium computing cluster. If comfortable with Galaxy, deploying a Galaxy on top of it until you have your custom workflow is designed and optimized on the new system. Genome, RNA, exome, targeted resequencing, bisulfite sequencing, ... etc should go here. Assign a small data cluster in collaboration with your sequencing facility for saving .fastq / SRA format files for local storage of raw data from the institute.
Chip analysis: Small computing cluster. All chip based data could go here. Microarray, genotyping, Chip-Chip etc. Make sure you have all generic data files. RAM should be assigned depending upon average number of samples your lab generally deal with. Think about steep computing power required for imputation (database wide, genome-wide)etc.
Imaging & Microscopy : Medium storage cluster and computing cluster: From my experience with genome-wide RNAi screening, Imaging usually takes a lot of storage, finding single optimal format will be key here. Labs may already have legacy programs/codes for image analysis tied to the vendor, this may mostly work on windows. Find corresponding Linux versions or use open source tools.
Similar way you could group resources and get people who use similar applications organized into different HPC groups.

In my opinion, it is better to assemble your HPC using vendors with experience in the area of big-data-biology, rather than buying it off the shelf from generic vendors. I recently had good experience with PSSC Labs.

PS: This is based on my personal experience in working in my graduate lab (mostly high-throughput computational analysis, imaging, metabolomics, proteomics etc) and recent experience (mostly expression, genotyping, sequencing) in post-doc lab. I was part of purchase decisions on HPC solutions for local computing of up to 200 quad-core nodes with infiniband switches, HPC solution with 40TB usable space etc. I am not associated with vendor(s) mentioned in this answer.

score 3 · Answer 2 · 2011-12-27

3

Entering edit mode

12.3 years ago

lh3 33k

Just give you an idea about how much memory is needed to de novo assemble a mammalian genome. Here is the configuration used by the ALLPATHS-LG group (Gnerre et al, published at the end of 2010):

[?]

ALLPATHS-LG is one of the best state-of-art assemblers. It is really worth trying if you have the right type of data. SGA uses the least memory among the published assemblers. It reportedly uses <64GB memory for 30X human, but you may still need >64GB for deeper coverage or more noisy data. Also considering caches and other programs running on the same machine, you probably want to have at least 128GB shared memory.

EDIT: Just read about the Dell machine you mentioned in your post. Note that you do not need a GPU. Few bioinformatic software can take advantage of that. I will certainly drop the Dell one in your list. Also, my limited experience is that Intel CPUs are visibly faster than AMD at a similar clock speed, but this is a minor point.

ADD COMMENT • link 12.3 years ago by lh3 33k

0

Entering edit mode

Thanks for the very relevant point re: denovo assembly - I've gotten similar recommendations re: big RAM. Re: GPU, I've also been told the 16-GPU chassis is not practical or proven in the field; however, GPGPU is not a side-story anymore, and for research purposes, especially image analysis, it's something we want to use.

ADD REPLY • link 12.3 years ago by Hanif Khalak ★ 1.3k

0

Entering edit mode

Yes, GPGPU may deliver a dramatic speedup for image analyses, protein folding (I heard) and matrix operations, but just do not put too much money in that if your focus is sequence analysis.

ADD REPLY • link 12.3 years ago by lh3 33k

0

Entering edit mode

In sequence analysis, GPU is a side-story and probably will be in the next few years. There are too few production-quality GPU algorithms and the few are so tied to hardware, which makes them less useful when you do not have the hardware the developers had at hand. Really investigate enough before investing on GPUs.

ADD REPLY • link 12.3 years ago by lh3 33k

0

Entering edit mode

In sequence analysis, GPU is a side-story and probably will be in the next few years. There are too few production-quality GPU algorithms and the few are so tied to hardware, which makes them less useful when you do not have the hardware the developers had at hand. Really think over carefully before investing on GPUs.

ADD REPLY • link 12.3 years ago by lh3 33k

score 2 · Answer 3 · 2011-12-26

I will assume that you don't have to pay for the server room, power, and cooling.

If you are scaling horizontally, then hardware is relatively cheep nowadays. Probably $350k of hardware costs will be good enough for NGS data analysis. In addition, would be good idea to spend $250k initially and then $100k for an upgrade in 1 or 2 years.

I maintained an HPC cluster for Genomics research. The cluster has a low-latency Infiniband network and a high-end storage system (SAN). These components are over $100k each. I build-up and maintained the cluster for 4 years.

The most important part of a cluster handling NGS data is the person that will build and maintain it. I would suggest heiring a rock star with relevant senior system administration experience (i.e. big data, big read/write load) at a senior software engineering level. The software engineering skills will be necessary for troubleshooting, monitoring, and automation. Not for actual software engineering. The salary would be no less than $100k/year.

ADDENDUM: I just noticed that you mentioned shared memory. I don't have much experience with shared memory clusters. However, I estimate that such a requirement would be very very expensive.

score 1 · Answer 4 · 2011-12-21

Hi Hanif,

Recently I read somewhere that,

Computation becoming “third leg of science” with theory and experimentation

Assuming this to be an standalone HPC center, many things should be taken into consideration. You mentioned power, scaling, configuration, and maintenance which more or less cover those aspects. Taking one by one we can see here:

1) Power:

GREEN is the buzz word here.

for this part I would strong urge you to read the article published in Linux Magazine. This article sheds good insights of the hidden costs involved with these types of operations.

2) Configuration:

Considering Dell,a typical node can cost somewhere between $3K-$4K. Assuming the purchase of a node which includes 2 cpu`s and 8 cores, a 64-node cluster (with 128 processors and 512 cores) costs about $192K-$256K.

3) Scaling and Maintenance:

Time and resources required to manage a cluster is very important. A room for parallel computing systems in the budget will definitely help.

Taking all of the points into account, I would spec out the budget:

+---------+-------+---------------+-------------------------+
|         | power | Configuration | Scaling and Maintenance |
+---------+-------+---------------+-------------------------+
| Level 1 |   $5K |      $30K     |           $15K          |
| Level 2 |  $30K |     $180K     |           $40K          |
| Level 3 | $200K |     $650K     |          $150K          |
+---------+-------+---------------+-------------------------+

I would also like to point out to another paper Formulating the Real Cost of DSM-Inherent Dependent Parameters in HPC Clusters.

I hope this helps.

score 1 · Answer 5 · 2011-12-26

This presentation by Transtec matches well with my current ideas for small and medium solutions - actual HPC spec. starts at slide 18. I wish I'd known about the November MEW22 meeting!

[?]Summary

small = deskside superworkstation: 12+ cores, 48GB+ RAM
medium = ~300 cores in 24 nodes (Supermicro Twin2), total ~1TB memory, 100TB+ storage (pNFS)

[?]Comments

Small - get bigger deskside option (i.e. more cores and RAM) e.g. PSSC or Cray CX1
Medium - Supermicro nodes make a great $/perf choice, allowing budget for costly Panasas storage
Medium - need good storage alternative for 200-300TB range with scalability
Large - consider vendor-direct options (Dell, HP, IBM, Cisco), or guys like T-Platforms
Medium-Large - consider budgeting for HPC-cloud management tools like Bright and Techila (whitepaper) to maximize and simplify utilization of infrastructure
Medium-Large - consider budgeting for a system integrator like ClusterVision or Transtec, even if you've got a large team (in which case you're probably thinking big solution) -- complete hardware spec, fit, burn-in testing, etc can be a lot of work/risk
Medium - open-source HPC-cloud software options: ProxMox, Eucalyptus, OpenStack, OpenNebula

[?]Disclaimer: I am not affiliated in any way with Transtec or any other vendor.[?]