Question

Ideal practice to manage huge NGS dataset/Intermediate files of Whole Genome Human sequencing

0

Entering edit mode

8.6 years ago

ravi.uhdnis ▴ 220

Hi, I am working with Whole Genome Sequencing data of Human samples from HiSeq2500 platform with approx. 30X coverage. Everyday it's going hard for me to handle huge data size (Raw reads data e.g. .bcl/.fastq.gz); Alignment files (.sam/.bam); subsequent .bam files after "Mark Duplication", "Local realignment around InDels", "Base Quality Score Recalibration" and finally after calling varinats by 'HaploTypeCaller of GATK'. I want to know what is the best practices to handle such huge data files, Should I delete the older file once I get the next stage file? It would be great if someone suggest me the best practice to handle datasets/files in WGS of human samples. Thank you.

next-gen Assembly genome alignment • 3.7k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by ravi.uhdnis ▴ 220

Ram · Answer 1 · 2015-10-07

We typically delete files from a step once they're no longer needed, though fastq files are typically still stored (regarding bcl files, no one at the institute except me ever sees them so I'm excluding them).

If you're afraid that you may need to recall variants and/or remap things in the future then just keep the first generated BAM file and use bamHash to ensure that it fully captures all of the original reads, which can then be deleted.

BTW, things like bcl files can just be written to tape or some other "able to be moved off-site and recovered in 10 years" medium and then be deleted. We keep them around for a few months after writing to tape and then remove them.

Ram · Answer 2 · 2015-10-07

I would only store the original fastq files and the final bam and vcf files.

You can use something like bcbio to run these relative common (especially for human) bio-informatics pipelines without much hand on work.

bcbio will put the intermediate data in a work directory and the final bam and vcf files in a final directory. Afterwords you can delete the work directory. Having a pipeline that does not need much hands on time also means it's not much work to just rerun the whole analysis all the way from the fastq files.

You only need to specify a high level workflow file like the one below and a list of fastq files with sample names added.

https://github.com/chapmanb/bcbio-nextgen/blob/master/config/templates/freebayes-variant.yaml

Do validate the results and adjust he bcbio-pipelines / parameters if needed.

https://bcbio-nextgen.readthedocs.org/en/latest/contents/introduction.html

https://github.com/chapmanb/bcbio-nextgen

Ram · Answer 3 · 2015-10-07

1

Entering edit mode

8.6 years ago

Pierre Lindenbaum 161k

Don't merge your fastq for the same sample from the beginning, map each fastq in parallel
for each fastq split the bam by chromosome or contig and sort
merge the bams for the same chromosomes
markdups...
merge the chrom_bams to sample_bam
for calling, use the GATK Gvcf strategy

Should I delete the older file once I get the next stage file ?

If you use GNU make: no. Delete everything when you're ok with the results.

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Usually I merge all fastq.gz files of forward and Reverse reads of all lanes of a sample to form ONE final Forward (R1) and Reverse (R2) read. Then, use BWA mem to align them to human reference genome (GRCh38), n the output .sam file is usually 250-350GB in size which eventually get converted into .bam (by picard) file of the range of 45-50GB. In next 3-4 steps, there is generation of similar .bam files of 40-50GB size. The final variant calling file (.VCF) normally range 120-135 GB. So, overall I normally get 500-700GB data from start to end for a WGS human sample. It would be great if you let me know how to map each fastq.gz file in parallel?. Usually I get [8 lanes*2(forward n Reverse)*8(no of files of each type)] * 2 = 256 fastq.gz files for a sample. Also, i have access of SUN cluster system with 16 nodes with 8 cores each. Thank you.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by ravi.uhdnis ▴ 220

0

Entering edit mode

see https://github.com/lindenb/ngsxml and the option -j of GNU make. So cluster manager like SGE have a parallelized version of make.

ADD REPLY • link 8.5 years ago by Pierre Lindenbaum 161k