Question

CombineGVCFs with GATK4.0

0

Entering edit mode

16 months ago

Yong ▴ 10

Hi All, I have ~70 gVCF files from Whole exome sequencing (WES), and 500 gVCF files from Whole genome sequencing (WGS). Now I want to combine them for the post analysis with the CombineGVCFs in GATK4.0 . Since the limit of memory, I split the data into 11 batches. The elapsed time for combining the files from WGS is acceptable (about 3 days for all 22 chromosomes per 50 files), but it is too slow to combine the files from WES (more than 10 days for a chromosome). I guess that there may be some problems with the gVCF files from WES. The following is the scripts for generating and combining the gVCF files. The scripts for generating the gVCF file from fasta file:

GENOME=Reference/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta

DBSNP=Reference/resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf

SNP=Reference/resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf.gz

INDEL=Reference/resources_broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

## Generate the bam files
bwa mem -t 6 -M -Y -R '@RG\tID:s0101\tSM:s0101\tLB:WES\tPL:Illumina' $GENOME V300050587_L03_690_1.fq.gz V300050587_L03_690_2.fq.gz | \
samtools view -Sb - > s0101_V300050587_L03_690.bam

bwa mem -t 6 -M -Y -R '@RG\tID:s0101\tSM:s0101\tLB:WES\tPL:Illumina' $GENOME V300050682_L03_690_1.fq.gz V300050682_L03_690_1.fq.gz | \
samtools view -Sb - > s0101_V300050682_L03_690.bam

## Sinince the samples has 2 pairs of fastq files, merge them togother
samtools merge -@ 6 -h s0101_V300050587_L03_690.bam s0101.bam s0101_V300050587_L03_690.bam s0101_V300050682_L03_690.bam


picard AddOrReplaceReadGroups \
-I s0101.bam \
-O s0101_ReplaceReadGroups.bam \
-RGID s0101 \
-RGLB WES \
-RGSM s0101 \
-RGPL Illumina \
-RGPU s0101_L0_690

## Sort using samtools
samtools sort -@ 30 -o s0101.sorted.bam s0101_ReplaceReadGroups.bam

picard -Xmx128g MarkDuplicates -I s0101.sorted.bam -O s0101.sorted.markdup.bam -M s0101.sorted.markdup.txt -REMOVE_DUPLICATES true

picard -Xmx128g BuildBamIndex -I s0101.sorted.markdup.bam

samtools flagstat -@ 30 s0101.sorted.markdup.bam >s0101.sorted.markdup.stat

gatk --java-options "-Xmx128G" BaseRecalibrator -R $GENOME -I s0101.sorted.markdup.bam \
--known-sites $DBSNP \
--known-sites $INDEL \
--known-sites $SNP \
-O recal_data_s0101.table

gatk --java-options "-Xmx128G" ApplyBQSR --bqsr-recal-file recal_data_s0101.table \
-R $GENOME \
-I s0101.sorted.markdup.bam \
-O s0101.sorted.markdup.BQSR.bam

gatk --java-options "-Xmx128G" HaplotypeCaller  -R $GENOME \
-I s0101.sorted.markdup.BQSR.bam \
-ERC GVCF \
-O /public/home/hpc196501013/WES/Raw_fastq/gVCFs/s0101_sorted_markdup_realigned_recal.g.vcf.gz

The scripts for combining the gVCF files:

GENOME=Reference/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta
gatk --java-options "-Xmx100G" CombineGVCFs  -R $GENOME \
--variant  Vraint_list_WES.list \
-L chr1 \
-O CombinedgVCFs_cases_chr1.g.vcf.gz

Is there any problem in above codes and how can I solve it ? Thanks a lot!

CombineGVCFs GATK4.0 • 1.3k views

ADD COMMENT • link 16 months ago by Yong ▴ 10

1

Entering edit mode

you don't need AddOrReplaceReadGroups as the groups were specified in the bwa mem line.

ADD REPLY • link 16 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks for your suggestions! I guess that I found the problem. Among all WES subjects, someone have only 1 pair of Fastq files, and the others have 2 or 3 pairs. For the subjects with 2 or more pairs of Fastq files, I used bwa to generate bam files for per pair of Fastq files and then used samtools merge to merge them together. When I included these subjects, it will takes a very long time to run CombineGVCFs. If I exclude theses subjects, the elapsed time will be acceptable. So my question is that why those subjects have 2 or more pairs of Fastq files, and how can I process this data correctly. Thank you!

ADD REPLY • link 16 months ago by Yong ▴ 10

0

Entering edit mode

are you using a bash script for this ? you'd better use Snakemake or Nextflow.

ADD REPLY • link 16 months ago by Pierre Lindenbaum 161k

score 1 · Answer 1 · 2022-11-29

as you're working with WES, you could combine the GVCF per BED set of, say 100kb, instead of a whole chromosome

gatk --java-options "-Xmx100G" CombineGVCFs  -R $GENOME \
--variant  Vraint_list_WES.list \
-L my.000103.bed \
-O CombinedgVCFs_cases_000103.g.vcf.gz

and then GenotypeGVC each chunk and then concatenate the vcfs with 'bcftools concat'