Question: How many storage needs for Whole genome sequencing?
gravatar for star
8 weeks ago by
star170 wrote:

I am going to start the whole genome study and I like to have an estimation about the storage which is used for analyzing, I know it depends on several things like coverage, single end or paired-end and so on, but if anyone can help me and give an estimation it would be great!

I like to have WGS from 100 individuals with 50X coverage and paired-end.

ADD COMMENTlink modified 8 weeks ago by ATpoint19k • written 8 weeks ago by star170

It may be helpful to look through your own datasets to see what you have. Extrapolate from those numbers and add 10% to account for variation.

ADD REPLYlink written 8 weeks ago by genomax70k

mentioning which genome/species you're working on might be the crucial bit of info here.

for viral genomes a few GBs will do, for conifer genomes for instance go think the in the range of several TBs

ADD REPLYlink written 8 weeks ago by lieven.sterck5.5k

I am working on the Human Genome.

ADD REPLYlink written 8 weeks ago by star170
gravatar for ATpoint
8 weeks ago by
ATpoint19k wrote:

If memory serves from a cohort I analyzed in the past of 20 matched-normal WGS (human, somewhat 30-50x, 2x100bp), one BAM file at compression level 5 was roughly 100GB. Roughly double that size for the raw data be it fastq or unaligned BAM would be 200GB per individual, times 100 would roughly equal 20TB. Definitely managable on a decent HPC with a parallel file system given you have access to such a cluster. Make sure you have the space, the CPU and memory resources and most importantly that your scripts are well-tested before starting the "big run".

ADD COMMENTlink written 8 weeks ago by ATpoint19k

Thanks for the reply, do you have any clue about the output size of variant calling using GATK?

ADD REPLYlink written 8 weeks ago by star170

Genome Variant Call Format file (gVCF). gVCF was developed to store sequencing information for both variant and non-variant positions, which is required for human clinical applications. gVCF is a set of conventions applied to the standard variant call format (VCF) 4.1 as documented by the 1000 Genomes Project. These conventions allow representation of genotype, annotation, and other information across all sites in the genome in a compact format. Typical human whole-genome sequencing results expressed in gVCF with annotation are less than 1 Gbyte, or about 1/100 the size of the BAM file used for variant calling. If you are performing targeted sequencing, gVCF is also an appropriate choice to represent and compress the results.

ADD REPLYlink written 8 weeks ago by genomax70k

No, never used it.

ADD REPLYlink written 8 weeks ago by ATpoint19k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1715 users visited in the last hour