Hi All, I have ~70 gVCF files from Whole exome sequencing (WES), and 500 gVCF files from Whole genome sequencing (WGS). Now I want to combine them for the post analysis with the CombineGVCFs in GATK4.0 . Since the limit of memory, I split the data into 11 batches. The elapsed time for combining the files from WGS is acceptable (about 3 days for all 22 chromosomes per 50 files), but it is too slow to combine the files from WES (more than 10 days for a chromosome). I guess that there may be some problems with the gVCF files from WES. The following is the scripts for generating and combining the gVCF files. The scripts for generating the gVCF file from fasta file:
GENOME=Reference/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta
DBSNP=Reference/resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf
SNP=Reference/resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf.gz
INDEL=Reference/resources_broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
## Generate the bam files
bwa mem -t 6 -M -Y -R '@RG\tID:s0101\tSM:s0101\tLB:WES\tPL:Illumina' $GENOME V300050587_L03_690_1.fq.gz V300050587_L03_690_2.fq.gz | \
samtools view -Sb - > s0101_V300050587_L03_690.bam
bwa mem -t 6 -M -Y -R '@RG\tID:s0101\tSM:s0101\tLB:WES\tPL:Illumina' $GENOME V300050682_L03_690_1.fq.gz V300050682_L03_690_1.fq.gz | \
samtools view -Sb - > s0101_V300050682_L03_690.bam
## Sinince the samples has 2 pairs of fastq files, merge them togother
samtools merge -@ 6 -h s0101_V300050587_L03_690.bam s0101.bam s0101_V300050587_L03_690.bam s0101_V300050682_L03_690.bam
picard AddOrReplaceReadGroups \
-I s0101.bam \
-O s0101_ReplaceReadGroups.bam \
-RGID s0101 \
-RGLB WES \
-RGSM s0101 \
-RGPL Illumina \
-RGPU s0101_L0_690
## Sort using samtools
samtools sort -@ 30 -o s0101.sorted.bam s0101_ReplaceReadGroups.bam
picard -Xmx128g MarkDuplicates -I s0101.sorted.bam -O s0101.sorted.markdup.bam -M s0101.sorted.markdup.txt -REMOVE_DUPLICATES true
picard -Xmx128g BuildBamIndex -I s0101.sorted.markdup.bam
samtools flagstat -@ 30 s0101.sorted.markdup.bam >s0101.sorted.markdup.stat
gatk --java-options "-Xmx128G" BaseRecalibrator -R $GENOME -I s0101.sorted.markdup.bam \
--known-sites $DBSNP \
--known-sites $INDEL \
--known-sites $SNP \
-O recal_data_s0101.table
gatk --java-options "-Xmx128G" ApplyBQSR --bqsr-recal-file recal_data_s0101.table \
-R $GENOME \
-I s0101.sorted.markdup.bam \
-O s0101.sorted.markdup.BQSR.bam
gatk --java-options "-Xmx128G" HaplotypeCaller -R $GENOME \
-I s0101.sorted.markdup.BQSR.bam \
-ERC GVCF \
-O /public/home/hpc196501013/WES/Raw_fastq/gVCFs/s0101_sorted_markdup_realigned_recal.g.vcf.gz
The scripts for combining the gVCF files:
GENOME=Reference/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta
gatk --java-options "-Xmx100G" CombineGVCFs -R $GENOME \
--variant Vraint_list_WES.list \
-L chr1 \
-O CombinedgVCFs_cases_chr1.g.vcf.gz
Is there any problem in above codes and how can I solve it ? Thanks a lot!
you don't need AddOrReplaceReadGroups as the groups were specified in the bwa mem line.
Thanks for your suggestions! I guess that I found the problem. Among all WES subjects, someone have only 1 pair of Fastq files, and the others have 2 or 3 pairs. For the subjects with 2 or more pairs of Fastq files, I used
bwa
to generate bam files for per pair of Fastq files and then usedsamtools merge
to merge them together. When I included these subjects, it will takes a very long time to runCombineGVCFs
. If I exclude theses subjects, the elapsed time will be acceptable. So my question is that why those subjects have 2 or more pairs of Fastq files, and how can I process this data correctly. Thank you!are you using a bash script for this ? you'd better use Snakemake or Nextflow.