Question: System Configuration For Plant Genome Data Analysis Using Illumina Reads
gravatar for vaibhavbarot
6.2 years ago by
vaibhavbarot30 wrote:


I want to purchase workstation for analyse illumina reads of plant genome (Read mapping and SNP calling).What minimum hardware configuration required???

illumina hardware • 2.5k views
ADD COMMENTlink modified 3.3 years ago by Saeid Kadkhodaei90 • written 6.2 years ago by vaibhavbarot30
gravatar for Saeid Kadkhodaei
3.3 years ago by
Saeid Kadkhodaei90 wrote:

Hard disk capacity

Just to archive the sequences from a single run, you'll need on the order of 100 GB (1-2 copies of the basic gzipped FASTQ data). Temporary disk space for working with the data will average on the order of 500 GB (uncompressing, copying, moving, temporary data files, index data files, etc.) That space can be re-used once the basic analysis has been done. Assuming you generate 4-8 data sets each month with a single Illumina machine, I would guesstimate that about 1 TB a month of permanent archival space, and 4 TB of working disk space, would be good.

So, 100 GB disk per data set, permanent, and 500 GB working disk, per data set per month, for 8 data sets, + fudge factor: 1 TB of disk space a month, permanent, and 4 TB of working disk space. Cost? Negligible, \$200/mo plus \$2000 a year.

Compute capacity

Oddly enough, CPU is rarely a huge concern (in my experience). Unless you're doing things like really, really large BLASTs against unassembled short reads (which is inadvisable on pretty much any planet, not just ours), you probably will have enough CPU on medium sized computers your data center has. 4 to 8 cores, for 1 month per data set, are probably enough to do the basic mapping or assembly analyses, although of course more is better. Mapping to a reference genome/transcriptome is particularly parallelizable so you can take advantage of as many cores as you have. Bottom line, if you have one reasonably sized dedicated computer per data set, you should be OK. I would suggest 8-16 GB of RAM minimum (but see next section) on a 2 to 4 CPU machine, with each CPU having 2 to 4 cores. You can easily buy this kind of thing for way less than \$5000 -- it's what a lot of kids have at home for gaming these days, I think.

So, 1 medium sized computer (2-4 multicore CPUs, 8 GB of RAM) for 1 month, per data set, for 8 data sets: 8 computers. Cost: let's say \$40,000/year.


Memory is sort of the big bugaboo for me. I've been focusing on de novo assembly, which is a memory hog; I've just put in an order for a 500 GB machine, and I'm writing a 1 TB machine into my next grant. Mapping is much less memory intensive, requiring at most a few GB (although performance can always be improved by buying more memory, of course).

Many de novo assemblers scale with the number of unique k-mers in the data set, which means that for big, deeply sequenced data sets with lots of sequencing errors, you are going to need lots of memory. For bacterial genomes, you only need a few GB. For anything more challenging, you will need 100s of GBs. I would recommend a 512 GB machine, and strongly suggest a 1 TB machine (because who really wants to run only one analysis at a time, anyway?)

The only published machine estimate I've seen for assembly, BTW, is from the Broad Center, in the ALLPATHS-LG paper, where they estimate that they can assemble a human genome de novo in about two weeks with under 512 GB of RAM.

If Amazon Web Services wants to be really, really friendly to me, they can start providing 512 GB RAM machines for rent... and then give them to me for free, hint hint.

Note that I haven't said much about CPU power. That's because by the time you get a machine that has 512 GB of RAM, it probably has enough CPU power to run the assembly just fine. Some assemblers can make use of multiple CPUs: ABySS does, and Velvet recently released an update supporting it. I assume ALLPATHS-LG, SOAPdenovo, and others are keeping pace.

But the overall problem is it only takes ~1 week to generate a data set that can require 2-4 weeks to assemble in 512 GB of RAM. And these machines are expensive: figure \$20-40k for something robust, with decent CPU and memory performance. And you need one of these babies per de novo assembly project, dedicated for 1-3 months (because de novo assembly is slow and data intensive).

If you're an HPC admin sitting there, sweating, you might think you don't need to worry, because biologists will tell you that they're going to be resequencing lots of genomes, and doing lots of transcriptomes, etc., so de novo assembly isn't going to be required much. They'll tell you it'll mostly be mapping.

Unfortunately I think they're wrong. De novo assembly is going to be a big challenge going forward, as we sequence more and more odd genomes. I think humanity is going to sequence between 10**3 and 10**6 more novel genomes in the next 5 years than we have to date, and many of these genomes will have no reasonably close reference. (Don't believe me? Check out the Tree of Life from Norm Pace's Web site. Humans and corn are the two little teeny branches over on the upper left of the Eucarya branch; I believe we have fewer than 20 draft genomes from the non-plant/animal/fungi segments of the Eucarya branch, i.e. it's completely unsampled!)

In sum, at least 1 bigmem computer (512 GB RAM) available for dedicated use by biologists doing assembly, preferably more. Cost? \$50k/year for one.


Hardware requirements. The hardware requirements vary with the size of the genome project. Both Intel and AMD x64 architectures are supported. The general guidelines for hardware configuration are as follows:
* Bacteria (up to 10Mb): 16Gb RAM, 8+ cores, 10Gb disk space
* Insect (up to 500Mb): 128Gb RAM, 16+ cores, 1Tb disk space
* Avian/small plant genomes (up to 1Gb): 256Gb RAM, 32+ cores, 1Tb disk space
* Mammalian genomes (up to 3Gb): 512Gb RAM, 32+ cores, 3Tb disk space
* Plant genomes (up to 30Gb): 1Tb RAM, 64+cores, 10Tb disk space



ADD COMMENTlink written 3.3 years ago by Saeid Kadkhodaei90
gravatar for Michael Dondrup
6.2 years ago by
Bergen, Norway
Michael Dondrup45k wrote:

Minimum hardware specs are hard to give without further information about what is acceptable running time. Most analyses would run on old or cheap harware, but would simply take terribly long. The most important spec is memory size. Your genome, as indexed by the read mapper should fit in memory, that's what can be said without further details of genome size and software. Hardware Suitable For Generic Nextgen Sequencing Processing? seems to still be valid, just double the RAM figures. Otherwise, get as many CPU cores and fastest IO as you can afford. You will need a lot of disk space as well.

Other things to consider:

  • Support, system admin
  • Backup
  • disk space
  • do you really need a single new server for this, e.g. Is cloud an option, can you use existing large servers for this?
ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Michael Dondrup45k

Thanks for respond,

I want minimum hardware configuration for mapping and SNP calling. Can you suggest ideal Hardware configuration for plant genome data of illumina.I dont have any existing server but planning to buy.

ADD REPLYlink written 6.2 years ago by vaibhavbarot30

minimum or ideal? that makes a difference. Also, what software 'exactly' are you going to run? What is the size of the largest genome you are working with? How many users in parallel will use the machine? You should use a Linux based server, will you use it interactively, or via ssh/telnet?

To find the minimum/optimal RAM size: get hold of/borrow a high-memory server, e.g. on amazon cloud, run a typical large job of the alignment step and monitor the process and its memory usage (e.g. 8GB), double the maximum memory required by that process and buy a decent computer that has this amount of RAM installed (e.g. 16 GB)and can host at least double this amount (e.g. 32GB, better up to 128GB) for later upgrades.

ADD REPLYlink written 6.2 years ago by Michael Dondrup45k

Thanks Michael

I want generalize configuration for Plant genome illumina reads mapping and SNP calling.Can i do this using 4Quadcore, 16 GB RAM and 1TB storage computer.

ADD REPLYlink written 6.2 years ago by vaibhavbarot30

Depends exactly on what you want to do (read Michael's post above on finding minimal RAM size). The specs you have listed above are not nearly enough for what you want to do.

ADD REPLYlink written 6.2 years ago by Josh Herr5.6k

Well, RAM and CPU might be sufficient, but the system would very soon run out of storage for the read data. The memory might be just sufficient. From the BWA manual: "With bwtsw algorithm, 2.5GB memory is required for indexing the complete human genome sequences. For short reads, the ‘aln’ command uses ~2.3GB memory and the ‘sampe’ command uses ~3.5GB." The largest sequenced plant genome is barley (5.1Gb, if that scales linearly, 6GB should be enough), if it is one of the smaller plant genomes, then it might even work with less. All assuming the intended pipeline used BWA for read mapping.

ADD REPLYlink written 6.2 years ago by Michael Dondrup45k

Thanks Michael, Can i call SNPs using configuration stated in previous post.

ADD REPLYlink written 6.2 years ago by vaibhavbarot30

No warranty, but most likely the analysis would work, but you will have no space to store it. You need to buy additional storage very soon, because your 1TB disk can be full after a few runs, assuming 100GB per run. Even the smallest compute solution offered by illumina has 20TB of disk space.

ADD REPLYlink written 6.2 years ago by Michael Dondrup45k

Thanks Among from above configuration I've mentioned, RAM & Processors (16GB & 4 Quadcore) are enough for SNP calling. Please suggest. I have to increase storage capacity.

ADD REPLYlink written 6.2 years ago by vaibhavbarot30

Average plant genome size right now is in the range of 6Gb to 8Gb. I might have a skewed view of RAM & CPU since the plant genomes I work with are in the range from 4Gb to 20Gb. If we knew what plant we were talking about and had an estimate of the genome size, we could give a little more information to you, vaibhavbarot.

ADD REPLYlink written 6.2 years ago by Josh Herr5.6k
gravatar for Josh Herr
6.2 years ago by
Josh Herr5.6k
University of Nebraska
Josh Herr5.6k wrote:

To echo what Michael posted: You'll need to be quite clear on the size of your genomes (you'll need something with a lot of RAM and storage memory for plant genomes) and what you want to do at your workstation (will you be doing transcriptome assembly or just SNP calling?). Once you have an exact idea of what you'll be doing and the time frame you need for your analysis, then you can plan for the specs of your workstation. Storage For Miseq In-House may be some help in addition to the one that Michael posted.

ADD COMMENTlink written 6.2 years ago by Josh Herr5.6k

I want minimum hardware configuration for mapping and SNP calling. Can you suggest ideal Hardware configuration for plant genome data of illumina.I dont have any existing server but planning to buy.

ADD REPLYlink written 6.2 years ago by vaibhavbarot30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2204 users visited in the last hour