Hard disk capacity
Just to archive the sequences from a single run, you'll need on the order of 100 GB (1-2 copies of the basic gzipped FASTQ data). Temporary disk space for working with the data will average on the order of 500 GB (uncompressing, copying, moving, temporary data files, index data files, etc.) That space can be re-used once the basic analysis has been done. Assuming you generate 4-8 data sets each month with a single Illumina machine, I would guesstimate that about 1 TB a month of permanent archival space, and 4 TB of working disk space, would be good.
So, 100 GB disk per data set, permanent, and 500 GB working disk, per data set per month, for 8 data sets, + fudge factor: 1 TB of disk space a month, permanent, and 4 TB of working disk space. Cost? Negligible, \$200/mo plus \$2000 a year.
Oddly enough, CPU is rarely a huge concern (in my experience). Unless you're doing things like really, really large BLASTs against unassembled short reads (which is inadvisable on pretty much any planet, not just ours), you probably will have enough CPU on medium sized computers your data center has. 4 to 8 cores, for 1 month per data set, are probably enough to do the basic mapping or assembly analyses, although of course more is better. Mapping to a reference genome/transcriptome is particularly parallelizable so you can take advantage of as many cores as you have. Bottom line, if you have one reasonably sized dedicated computer per data set, you should be OK. I would suggest 8-16 GB of RAM minimum (but see next section) on a 2 to 4 CPU machine, with each CPU having 2 to 4 cores. You can easily buy this kind of thing for way less than \$5000 -- it's what a lot of kids have at home for gaming these days, I think.
So, 1 medium sized computer (2-4 multicore CPUs, 8 GB of RAM) for 1 month, per data set, for 8 data sets: 8 computers. Cost: let's say \$40,000/year.
Memory is sort of the big bugaboo for me. I've been focusing on de novo assembly, which is a memory hog; I've just put in an order for a 500 GB machine, and I'm writing a 1 TB machine into my next grant. Mapping is much less memory intensive, requiring at most a few GB (although performance can always be improved by buying more memory, of course).
Many de novo assemblers scale with the number of unique k-mers in the data set, which means that for big, deeply sequenced data sets with lots of sequencing errors, you are going to need lots of memory. For bacterial genomes, you only need a few GB. For anything more challenging, you will need 100s of GBs. I would recommend a 512 GB machine, and strongly suggest a 1 TB machine (because who really wants to run only one analysis at a time, anyway?)
The only published machine estimate I've seen for assembly, BTW, is from the Broad Center, in the ALLPATHS-LG paper, where they estimate that they can assemble a human genome de novo in about two weeks with under 512 GB of RAM.
If Amazon Web Services wants to be really, really friendly to me, they can start providing 512 GB RAM machines for rent... and then give them to me for free, hint hint.
Note that I haven't said much about CPU power. That's because by the time you get a machine that has 512 GB of RAM, it probably has enough CPU power to run the assembly just fine. Some assemblers can make use of multiple CPUs: ABySS does, and Velvet recently released an update supporting it. I assume ALLPATHS-LG, SOAPdenovo, and others are keeping pace.
But the overall problem is it only takes ~1 week to generate a data set that can require 2-4 weeks to assemble in 512 GB of RAM. And these machines are expensive: figure \$20-40k for something robust, with decent CPU and memory performance. And you need one of these babies per de novo assembly project, dedicated for 1-3 months (because de novo assembly is slow and data intensive).
If you're an HPC admin sitting there, sweating, you might think you don't need to worry, because biologists will tell you that they're going to be resequencing lots of genomes, and doing lots of transcriptomes, etc., so de novo assembly isn't going to be required much. They'll tell you it'll mostly be mapping.
Unfortunately I think they're wrong. De novo assembly is going to be a big challenge going forward, as we sequence more and more odd genomes. I think humanity is going to sequence between 10**3 and 10**6 more novel genomes in the next 5 years than we have to date, and many of these genomes will have no reasonably close reference. (Don't believe me? Check out the Tree of Life from Norm Pace's Web site. Humans and corn are the two little teeny branches over on the upper left of the Eucarya branch; I believe we have fewer than 20 draft genomes from the non-plant/animal/fungi segments of the Eucarya branch, i.e. it's completely unsampled!)
In sum, at least 1 bigmem computer (512 GB RAM) available for dedicated use by biologists doing assembly, preferably more. Cost? \$50k/year for one.
Hardware requirements. The hardware requirements vary with the size of the genome project. Both Intel and AMD x64 architectures are supported. The general guidelines for hardware configuration are as follows:
* Bacteria (up to 10Mb): 16Gb RAM, 8+ cores, 10Gb disk space
* Insect (up to 500Mb): 128Gb RAM, 16+ cores, 1Tb disk space
* Avian/small plant genomes (up to 1Gb): 256Gb RAM, 32+ cores, 1Tb disk space
* Mammalian genomes (up to 3Gb): 512Gb RAM, 32+ cores, 3Tb disk space
* Plant genomes (up to 30Gb): 1Tb RAM, 64+cores, 10Tb disk space
Minimum hardware specs are hard to give without further information about what is acceptable running time. Most analyses would run on old or cheap harware, but would simply take terribly long. The most important spec is memory size. Your genome, as indexed by the read mapper should fit in memory, that's what can be said without further details of genome size and software. Hardware Suitable For Generic Nextgen Sequencing Processing? seems to still be valid, just double the RAM figures. Otherwise, get as many CPU cores and fastest IO as you can afford. You will need a lot of disk space as well.
Other things to consider:
- Support, system admin
- disk space
- do you really need a single new server for this, e.g. Is cloud an option, can you use existing large servers for this?
To echo what Michael posted: You'll need to be quite clear on the size of your genomes (you'll need something with a lot of RAM and storage memory for plant genomes) and what you want to do at your workstation (will you be doing transcriptome assembly or just SNP calling?). Once you have an exact idea of what you'll be doing and the time frame you need for your analysis, then you can plan for the specs of your workstation. Storage For Miseq In-House may be some help in addition to the one that Michael posted.