Question: How many storage needs for Whole genome sequencing?
gravatar for star
12 months ago by
star230 wrote:

I am going to start the whole genome study and I like to have an estimation about the storage which is used for analyzing, I know it depends on several things like coverage, single end or paired-end and so on, but if anyone can help me and give an estimation it would be great!

I like to have WGS from 100 individuals with 50X coverage and paired-end.

ADD COMMENTlink modified 12 months ago by ATpoint34k • written 12 months ago by star230

It may be helpful to look through your own datasets to see what you have. Extrapolate from those numbers and add 10% to account for variation.

ADD REPLYlink written 12 months ago by genomax83k

mentioning which genome/species you're working on might be the crucial bit of info here.

for viral genomes a few GBs will do, for conifer genomes for instance go think the in the range of several TBs

ADD REPLYlink written 12 months ago by lieven.sterck7.8k

I am working on the Human Genome.

ADD REPLYlink written 12 months ago by star230
gravatar for ATpoint
12 months ago by
ATpoint34k wrote:

If memory serves from a cohort I analyzed in the past of 20 matched-normal WGS (human, somewhat 30-50x, 2x100bp), one BAM file at compression level 5 was roughly 100GB. Roughly double that size for the raw data be it fastq or unaligned BAM would be 200GB per individual, times 100 would roughly equal 20TB. Definitely managable on a decent HPC with a parallel file system given you have access to such a cluster. Make sure you have the space, the CPU and memory resources and most importantly that your scripts are well-tested before starting the "big run".

ADD COMMENTlink written 12 months ago by ATpoint34k

Thanks for the reply, do you have any clue about the output size of variant calling using GATK?

ADD REPLYlink written 12 months ago by star230

Genome Variant Call Format file (gVCF). gVCF was developed to store sequencing information for both variant and non-variant positions, which is required for human clinical applications. gVCF is a set of conventions applied to the standard variant call format (VCF) 4.1 as documented by the 1000 Genomes Project. These conventions allow representation of genotype, annotation, and other information across all sites in the genome in a compact format. Typical human whole-genome sequencing results expressed in gVCF with annotation are less than 1 Gbyte, or about 1/100 the size of the BAM file used for variant calling. If you are performing targeted sequencing, gVCF is also an appropriate choice to represent and compress the results.

ADD REPLYlink written 12 months ago by genomax83k

No, never used it.

ADD REPLYlink written 12 months ago by ATpoint34k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1679 users visited in the last hour