We are creating a specification for a new server for our research scientists in applied genetics and breeding (I am a computer scientist).
Any advice regarding the number of processors/cores and RAM would be most helpful. Here is a background to the work being carried out, kindly written by a colleague of mine:
We work in applied genetics, breeding and plant pathology related research on crops and their pathogens. In this context we will increasingly use next generation sequence (NGS) data, mostly Illumina HiSeq data but occasionally 454 data. We work with diploid and polyploid crops, e.g. wheat, with medium to large genome sizes (up to ~17Gbases). We deal with genomic as well as transcriptomic sequencing data. The sequencing itself is contracted out, but the downstream analyses we'd prefer to do in-house.
What we do with NGS data:
Trimming and assembly to reference genomes or de-novo assembly, starting from genomic or RNAseq reads. Some assemblies would be done
from multiple datasets, the biggest project proposed so far would
involve assembly and RNASeq analyses with 24 30Gb files with approx. 2.6 bn reads, but most projects would be considerably smaller, up to 50% of that volume of data.Mapping of reads to reference (genomes and sets of contigs) and variant calling, incl. detection of new variants (SNPs,
insertions/deletions, SSRs), e.g. for marker development.RNAseq expression analysis and related stats. We are likely to use several open source and commercial software packages as user
preferences/needs vary. The number of people working on NGS data is
small (<10), so usually only one assembly or mapping job is likely to be run in parallel at any given time. Having to run two small-ish
ones in parallel may be required on the odd occasion.
Unfortunately, the budget is small for the server itself - approximately £10,000 (~ 16,000 USD), though it could be increased if it is not enough to work with. Some of the servers I have looked at offer 256GB of RAM. Would this be enough to run the tools cited above? Do they typically page to disk if memory is not available or just fall over?
On the data storage side, I think we would be ok to start with as there will be a dedicated SAN of 24TB with spare bays.
Finally, we would like to use VMWare and use the server as a host (but just for one virtual server). Would this present any issues?
Thank you for any advice that can be provided.
Agreed, these specs are what we have for some of our servers (costing about £9k) and more than adequate for a RNASeq project. I recommend RAID6 for good use of disk space with decent performance. Some server-builders might provide a zfs filesystem with the machine which in addition to a useful 'snapshot' utility, can almost double the effective size of the storage with compression with little CPU overhead. But proper BACKUPs must be also budgeted for, e.g. a mirrored server or tapes.
Alastair, we tend to use Dell for our servers as this is what our support staff have always used. Do you have any recommendations for UK vendors? We are a not-for-profit research institution so can usually use academic partners.
In the past we have used http://www.dnuk.com/ and http://www.eclipsecomputing.co.uk/ and found both of them very competitive in price and very helpful.
Thank you both for the information; it's very useful. From what you have both suggested, as we've just purchased a new NAS with about 12TB of space (50% bays used), we'll probably use this for backup to start with and use RAID6 with the a similar amount of disk space on the server. We would save on the cost of the NAS, but lose that and a bit more on the server, but it seems prudent.
I would be comfortable building a server myself, but I'm not sure our support team would like this.
It's the same here, so we usually order a custom rig to a vendor that can build and test it for us. This saves us a lot of time and provides some piece of mind.
12TB seems a bit on the small side long term but it will depend obviously on how much sequencing you are doing. If you do complete hierarchical backups the 12TB will vanish quickly. For perspective we have just had to buy another 250TB to keep us going for another few years and only selected data is hierarchically backed up. We support ~20 groups with most of the data from ~6 of them.