Question

Hardware For Crop Ngs Data Tools

4

Entering edit mode

10.3 years ago

Justin Dyson ▴ 40

We are creating a specification for a new server for our research scientists in applied genetics and breeding (I am a computer scientist).

Any advice regarding the number of processors/cores and RAM would be most helpful. Here is a background to the work being carried out, kindly written by a colleague of mine:

We work in applied genetics, breeding and plant pathology related research on crops and their pathogens. In this context we will increasingly use next generation sequence (NGS) data, mostly Illumina HiSeq data but occasionally 454 data. We work with diploid and polyploid crops, e.g. wheat, with medium to large genome sizes (up to ~17Gbases). We deal with genomic as well as transcriptomic sequencing data. The sequencing itself is contracted out, but the downstream analyses we'd prefer to do in-house.

What we do with NGS data:

Trimming and assembly to reference genomes or de-novo assembly, starting from genomic or RNAseq reads. Some assemblies would be done
from multiple datasets, the biggest project proposed so far would
involve assembly and RNASeq analyses with 24 30Gb files with approx. 2.6 bn reads, but most projects would be considerably smaller, up to 50% of that volume of data.

Mapping of reads to reference (genomes and sets of contigs) and variant calling, incl. detection of new variants (SNPs,
insertions/deletions, SSRs), e.g. for marker development.

RNAseq expression analysis and related stats. We are likely to use several open source and commercial software packages as user
preferences/needs vary. The number of people working on NGS data is
small (<10), so usually only one assembly or mapping job is likely to be run in parallel at any given time. Having to run two small-ish
ones in parallel may be required on the odd occasion.

Unfortunately, the budget is small for the server itself - approximately £10,000 (~ 16,000 USD), though it could be increased if it is not enough to work with. Some of the servers I have looked at offer 256GB of RAM. Would this be enough to run the tools cited above? Do they typically page to disk if memory is not available or just fall over?

On the data storage side, I think we would be ok to start with as there will be a dedicated SAN of 24TB with spare bays.

Finally, we would like to use VMWare and use the server as a host (but just for one virtual server). Would this present any issues?

Thank you for any advice that can be provided.

ngs hardware illumina 454 server • 5.4k views

ADD COMMENT • link updated 10.3 years ago by IV ★ 1.3k • written 10.3 years ago by Justin Dyson ▴ 40

score 5 · Answer 1 · 2014-01-09

For your initial question, yes, 256GB Ram is adequate for most jobs (especially RNA-Seq). For genetic studies and for analyses of multiple samples I really cannot say, since it depends on the study design, tools, etc.

OSs usually utilize virtual memory (HDD space) if physical RAM does not suffice but for NGS analysis this is not an option. We kill any process that utilizes more than the available RAM, since it will never (ever) end.

For NGS analysis there are three major bottlenecks: available cores & memory and I/O speed. In order to maximize the workload you can analyze, you have to invest in all three domains, i.e. you need as many cores as possible, RAM and fast local disks (at least for the analysis, for storage SAN is ok).

For that budget I would've gone for a custom built server and not a brand-name option (e.g. Dell, IBM, HP, etc).

10.000£ could get you a long way in a custom built rig. There should be many dealers eager to built it for you and also offer support.

First you should decide if this is ok, since many facilities require next day service and so on and you might not find a dealer offering such support where you are.

Second, is to decide if you want a rack-mount solution (4U) or a normal tower.

There are great cases for both options. For towers there are many offering multiple (20+) HDD bays, since you will experience significant differences in speed using local disks for alignment compared to drives over the network. For towers I like cases such as the Corsair Obsidian 9000 or other cases from Lian Li, offering more than 20 HDD bays. For rack-mount options, Norco and supermicro are also fantastic.

For your budget you could build a 2 X Xeon 2680v2, 384-512GB RAM, offering 20 real cores and 40 threads, which should be great for alignment. Depending on the prices you get for the parts you could always opt for a lower Xeon. You can see some nice benchmarks here http://www.cpubenchmark.net/high_end_cpus.html

Please keep in mind that the number of available cores is crucial for the CPU selection and not only benchmark results.

Brands such as Asus and Supermicro offer great motherboards for such a setup, with lots of RAM slots, 2XCPU support, multiple high speed lan controlers, multiple PCIe slots for GPU setups, etc. I would've also put some SSDs in the deal for the OS or for some I/O intensive tasks. Putting everything on RAID (10, 5 or 6) is a must.

For a brand-name solution then multiple HDDs is an absolute nono, since SSDs and HDDs will get you over your budget. I would've gone for a 2 X Xeon setup (again v2 chips and as many cores per chip as possible) and RAM in 32GB sticks, in order to be able to increase it later on. Dual CPU setups should be made in the beginning, since it's not advisable to add a second CPU later on.

Finally, using a VM is great in many ways (administration, maintenance, etc) but we've experienced a noticeable speed reduction in VMs vs physical machines. The difference was aproximately 20% when comparing rigs with similar settings. I don't know if this is a special case.

Cheers,

IV