Question: What Are The It Requirements For High Throughput Sequence Analysis
1
gravatar for Davy
6.6 years ago by
Davy360
United States
Davy360 wrote:

Hi All,

So my department is considering spending some money on upgrading our computing facilities which are pretty under-powered for any kind of serious sequencing analysis. I've been asked to come up with a rough idea of what were going to need for the next 5 to 10 years.

The obvious points are

  1. lots of CPUs
  2. lots of RAM
  3. lots of Storage

but I was hoping someone might know some resource I could read or take a look at, that might enable me to come up with something that isn't a complete guess.

I was thinking somewhere in the region of 8 or 12 cores per node, at 100 nodes total, 96 - 128 GB RAM per node, and (probably ridiculous, but) 5000 Tb storage. We have a lot of samples that will be sequenced (probably not whole genome) exome, I would imagine, plus various other sequencing activities like RNA-seq and Chip-Seq. Things I'm woefully ignorant of are the architecture of these systems. Should we be building a distributed system (all the tech will likely be housed in one place), what kind of tech do we need to run the right software that I'll be able to make full use of, for the mapping and variant calling, etc. Power and cooling requirments, space requirments.

Since it will be mostly me setting up the pipelines and pushing the data through it, I want to come up with some concrete numbers that will ensure that we can get analyses done quickly, and that we will have a system that we can scale as our needs increase (future-proofing).

Hope someone knows of something, Cheers, Davy.

bioinformatics next-gen • 3.9k views
ADD COMMENTlink modified 6.2 years ago by William4.5k • written 6.6 years ago by Davy360
1

Here is a somewhat older but still relevant post Big Ass Servers & Storage

ADD REPLYlink written 6.6 years ago by Istvan Albert ♦♦ 81k

Would you consider a cloud solution that wouldn't require extra hardware and is more easily scalable?

ADD REPLYlink written 6.5 years ago by shadowsage5540
5
gravatar for Daniel Swan
6.6 years ago by
Daniel Swan13k
Aberdeen, UK
Daniel Swan13k wrote:

How many samples is 'a lot'? I'm just a bit worried that you state you're ignorant of the 'architecture of the systems' (do you mean the analysis or the infrastructure?). If you're ignorant of the systems, you're not the best person to be pricing one up or speccing one out. You need to speak to someone (preferably a large range of vendors) with a concrete set of requirements. Work from there. And from long experience I can tell you that I/O is likely to be more critical to your choices than how many cores you have available. There has been some discussion on this site before: NGS data centers: storage, backup, hardware etc..

There seems to be some presentations from a workshop last year on HPC infrastructure for NGS:

http://bioinfo.cipf.es/courses/hpc4ngs

And another from 2011;

http://www.bsc.es/marenostrum-support-services/hpc-events-trainings/res-scientific-seminars/next-gen-sequencing-2011

There's a good primer here:

http://www.slideshare.net/cursoNGS/pablo-escobar-ngs-computational-infrastructure

And vendors with interest in the space:

http://www.emc.com/industry/healthcare.htm

ADD COMMENTlink modified 6.6 years ago by Istvan Albert ♦♦ 81k • written 6.6 years ago by Daniel Swan13k
4

"And from long experience I can tell you that I/O is likely to be more critical to your choices than how many cores you have available." - just wanted to stress this part of your answer, no point having badass HPC if it takes hours (or days!) to pull up files from archive for analysis.

ADD REPLYlink written 6.6 years ago by zx87548.2k

Thanks for the info. As you say, I know full well I am not the best person to be doing this, I'm plenty familiar with the analysis of small number of samples (20 - 100) in targeted regions (usually about 10 mb in total), but the architecture difficult to get my around, which is why I've been chosen. An impetus to learn I suppose.

ADD REPLYlink written 6.6 years ago by Davy360
0
gravatar for William
6.2 years ago by
William4.5k
Europe
William4.5k wrote:

Just based on my own own experience:

1) A high performance Sun Grid Engine cluster. Most NGS analysis (mapping with bwa, snp calling with gatk etc) can easily be run in parallel by just splitting the data. You need a high performance head node and a number of cluster nodes. I wouldn't worry to much about the size or speed of the cluster nodes, just make sure the cluster is upgradable and extendable when newer faster machines become cheaper or you get more money (because other groups also want to chip in ). Something to start with for example would be 10 nodes with 8 cores and 32 GB mem each. Or a multiple of this.

2) High performance shared data storage were the cluster can read and write to from network shares. This is an important part were spending money makes sense. Most NGS compute clusters are IO limited. (input out to a central server and reading writing to the compute nodes is a bottleneck. NGS is not just a compute problem but also a data problem ). Look at the high end solution that the big NGS centers have and see if you can buy the same.

3) A hardware agnostic massively scalable object storage system for archiving (long term storage) of raw NGS and derived data. Once the computing is done you want to move the data to a less expensive storage system. I have good experience with commercial hardware agnostic massively scalable object storage systems. These are software based and run on a operating system derived from linux. You can buy / use any hardware as long as it can run linux. You put all the nodes in a lan and they boot from a usb key with the software. They form a storage cluster. Every file is stored in duplicate or triplicate on different nodes. Read and write request are broadcasted to the cluster and the node that has the most resources free executes the request. If a node breaks down you can trow it out of the window and the missing data is automaticly replicated from the other copies to the same redundancy level as specified on other nodes. If you want to upgrade or extend your storage cluster you buy new machines, put them in the network, and throw away the old ones, and you don't need to do any administration.

4) For analysis that can't be run in parallel or when you don't want to invest the work to make it parallel you need a big ass server. Something like 48 cpu's 1 TB memory and a smaller variant of this so you can have the big one run denovo assembly for weeks and still have smaller big ass server for other work.

5) A fast network between all your machines.

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by William4.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1235 users visited in the last hour