Question: How many storage needs for Whole genome sequencing?
gravatar for star
8 weeks ago by
star170 wrote:

I am going to start the whole genome study and I like to have an estimation about the storage which is used for analyzing, I know it depends on several things like coverage, single end or paired-end and so on, but if anyone can help me and give an estimation it would be great!

I like to have WGS from 100 individuals with 50X coverage and paired-end.

It may be helpful to look through your own datasets to see what you have. Extrapolate from those numbers and add 10% to account for variation.

mentioning which genome/species you're working on might be the crucial bit of info here.

for viral genomes a few GBs will do, for conifer genomes for instance go think the in the range of several TBs

I am working on the Human Genome.

gravatar for ATpoint
8 weeks ago by
ATpoint19k wrote:

If memory serves from a cohort I analyzed in the past of 20 matched-normal WGS (human, somewhat 30-50x, 2x100bp), one BAM file at compression level 5 was roughly 100GB. Roughly double that size for the raw data be it fastq or unaligned BAM would be 200GB per individual, times 100 would roughly equal 20TB. Definitely managable on a decent HPC with a parallel file system given you have access to such a cluster. Make sure you have the space, the CPU and memory resources and most importantly that your scripts are well-tested before starting the "big run".

Thanks for the reply, do you have any clue about the output size of variant calling using GATK?

Genome Variant Call Format file (gVCF). gVCF was developed to store sequencing information for both variant and non-variant positions, which is required for human clinical applications. gVCF is a set of conventions applied to the standard variant call format (VCF) 4.1 as documented by the 1000 Genomes Project. These conventions allow representation of genotype, annotation, and other information across all sites in the genome in a compact format. Typical human whole-genome sequencing results expressed in gVCF with annotation are less than 1 Gbyte, or about 1/100 the size of the BAM file used for variant calling. If you are performing targeted sequencing, gVCF is also an appropriate choice to represent and compress the results.

No, never used it.

