Hardware For Crop Ngs Data Tools
1
4
Entering edit mode
10.3 years ago
Justin Dyson ▴ 40

We are creating a specification for a new server for our research scientists in applied genetics and breeding (I am a computer scientist).

Any advice regarding the number of processors/cores and RAM would be most helpful. Here is a background to the work being carried out, kindly written by a colleague of mine:

We work in applied genetics, breeding and plant pathology related research on crops and their pathogens. In this context we will increasingly use next generation sequence (NGS) data, mostly Illumina HiSeq data but occasionally 454 data. We work with diploid and polyploid crops, e.g. wheat, with medium to large genome sizes (up to ~17Gbases). We deal with genomic as well as transcriptomic sequencing data. The sequencing itself is contracted out, but the downstream analyses we'd prefer to do in-house.

What we do with NGS data:

  • Trimming and assembly to reference genomes or de-novo assembly, starting from genomic or RNAseq reads. Some assemblies would be done
    from multiple datasets, the biggest project proposed so far would
    involve assembly and RNASeq analyses with 24 30Gb files with approx. 2.6 bn reads, but most projects would be considerably smaller, up to 50% of that volume of data.

  • Mapping of reads to reference (genomes and sets of contigs) and variant calling, incl. detection of new variants (SNPs,
    insertions/deletions, SSRs), e.g. for marker development.

  • RNAseq expression analysis and related stats. We are likely to use several open source and commercial software packages as user
    preferences/needs vary. The number of people working on NGS data is
    small (<10), so usually only one assembly or mapping job is likely to be run in parallel at any given time. Having to run two small-ish
    ones in parallel may be required on the odd occasion.

Unfortunately, the budget is small for the server itself - approximately £10,000 (~ 16,000 USD), though it could be increased if it is not enough to work with. Some of the servers I have looked at offer 256GB of RAM. Would this be enough to run the tools cited above? Do they typically page to disk if memory is not available or just fall over?

On the data storage side, I think we would be ok to start with as there will be a dedicated SAN of 24TB with spare bays.

Finally, we would like to use VMWare and use the server as a host (but just for one virtual server). Would this present any issues?

Thank you for any advice that can be provided.

ngs hardware illumina 454 server • 5.4k views
ADD COMMENT
5
Entering edit mode
10.3 years ago
IV ★ 1.3k

For your initial question, yes, 256GB Ram is adequate for most jobs (especially RNA-Seq). For genetic studies and for analyses of multiple samples I really cannot say, since it depends on the study design, tools, etc.

OSs usually utilize virtual memory (HDD space) if physical RAM does not suffice but for NGS analysis this is not an option. We kill any process that utilizes more than the available RAM, since it will never (ever) end.

For NGS analysis there are three major bottlenecks: available cores & memory and I/O speed. In order to maximize the workload you can analyze, you have to invest in all three domains, i.e. you need as many cores as possible, RAM and fast local disks (at least for the analysis, for storage SAN is ok).

For that budget I would've gone for a custom built server and not a brand-name option (e.g. Dell, IBM, HP, etc).

10.000£ could get you a long way in a custom built rig. There should be many dealers eager to built it for you and also offer support.

First you should decide if this is ok, since many facilities require next day service and so on and you might not find a dealer offering such support where you are.

Second, is to decide if you want a rack-mount solution (4U) or a normal tower.

There are great cases for both options. For towers there are many offering multiple (20+) HDD bays, since you will experience significant differences in speed using local disks for alignment compared to drives over the network. For towers I like cases such as the Corsair Obsidian 9000 or other cases from Lian Li, offering more than 20 HDD bays. For rack-mount options, Norco and supermicro are also fantastic.

For your budget you could build a 2 X Xeon 2680v2, 384-512GB RAM, offering 20 real cores and 40 threads, which should be great for alignment. Depending on the prices you get for the parts you could always opt for a lower Xeon. You can see some nice benchmarks here http://www.cpubenchmark.net/high_end_cpus.html

Please keep in mind that the number of available cores is crucial for the CPU selection and not only benchmark results.

Brands such as Asus and Supermicro offer great motherboards for such a setup, with lots of RAM slots, 2XCPU support, multiple high speed lan controlers, multiple PCIe slots for GPU setups, etc. I would've also put some SSDs in the deal for the OS or for some I/O intensive tasks. Putting everything on RAID (10, 5 or 6) is a must.

For a brand-name solution then multiple HDDs is an absolute nono, since SSDs and HDDs will get you over your budget. I would've gone for a 2 X Xeon setup (again v2 chips and as many cores per chip as possible) and RAM in 32GB sticks, in order to be able to increase it later on. Dual CPU setups should be made in the beginning, since it's not advisable to add a second CPU later on.

Finally, using a VM is great in many ways (administration, maintenance, etc) but we've experienced a noticeable speed reduction in VMs vs physical machines. The difference was aproximately 20% when comparing rigs with similar settings. I don't know if this is a special case.

Cheers,

IV

ADD COMMENT
1
Entering edit mode

Agreed, these specs are what we have for some of our servers (costing about £9k) and more than adequate for a RNASeq project. I recommend RAID6 for good use of disk space with decent performance. Some server-builders might provide a zfs filesystem with the machine which in addition to a useful 'snapshot' utility, can almost double the effective size of the storage with compression with little CPU overhead. But proper BACKUPs must be also budgeted for, e.g. a mirrored server or tapes.

ADD REPLY
0
Entering edit mode

Alastair, we tend to use Dell for our servers as this is what our support staff have always used. Do you have any recommendations for UK vendors? We are a not-for-profit research institution so can usually use academic partners.

ADD REPLY
1
Entering edit mode

In the past we have used http://www.dnuk.com/ and http://www.eclipsecomputing.co.uk/ and found both of them very competitive in price and very helpful.

ADD REPLY
0
Entering edit mode

Thank you both for the information; it's very useful. From what you have both suggested, as we've just purchased a new NAS with about 12TB of space (50% bays used), we'll probably use this for backup to start with and use RAID6 with the a similar amount of disk space on the server. We would save on the cost of the NAS, but lose that and a bit more on the server, but it seems prudent.

I would be comfortable building a server myself, but I'm not sure our support team would like this.

ADD REPLY
1
Entering edit mode

It's the same here, so we usually order a custom rig to a vendor that can build and test it for us. This saves us a lot of time and provides some piece of mind.

ADD REPLY
1
Entering edit mode

12TB seems a bit on the small side long term but it will depend obviously on how much sequencing you are doing. If you do complete hierarchical backups the 12TB will vanish quickly. For perspective we have just had to buy another 250TB to keep us going for another few years and only selected data is hierarchically backed up. We support ~20 groups with most of the data from ~6 of them.

ADD REPLY

Login before adding your answer.

Traffic: 1944 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6