Hi everyone. I tried to apply
GATK pipeline to
amplicon-based mtDNA sequencing data analyse and found some problems want to ask/discuss with you guys.
I sat up my pipeline according this GATK Best Practice. Here is my pipeline:
bwa mem -t 4 -R "@RG\tID..." hs_ref_GRCh38.p12_chrMT.fa sample.R1.fastq.gz sample.R2.fastq.gz > sample.sam
- SAM to BAM(samtools)
samtools view -bF 12 sample.sam > sample.bam
- sort BAM(GATK SortSam)
gatk SortSam -I sample.bam -O sample_sorted.bam -SO coordinate
- BAM index(samtools index)
- Validate BAM(GATK ValidateSamFile)
gatk ValidateSamFile -I sample_sorted.bam -M VERBOSE
- Call variant(GATK HaplotypeCaller)
gatk HaplotypeCaller -I samplt_sorted.bam -O sample.g.vcf -R hs_ref_GRCh38.p12_chrMT.fa -ploidy 1 -ERC GVCF
- Consolidate GVCFs(GATK GenomicsDBImport)
gatk GenomicsDBImport --genomicsdb-workspace-path ./workspace -L MT -V sample1.g.vcf -V sample2.g.vcf ...(many samples)
- Joint-Call Cohort(GATK GenotypeGVCFs)
gatk GenotypeGVCFs -O 18R111.vcf -R hs_ref_GRCh38.p12_chrMT.fa -ploidy 1 -V gendb://workspace
Here is something I want to explain first
- No mark/remove duplicates because this is amplicon-based(multi-PCR) sequencing data, most of them are duplicates normally(fastp reported about 96%).
- I set
-ploidy 1in step 7 and 9 because mtDNA should treated as haploid. However I can't set step 8 to
haploidbecause this only for diploid. Second, I find description below in this GATK article: > The one quantity that changes whether the GATK considers the possibility of a heterozygous genotype at all is the ploidy, which describes how many copies of each chromosome each individual in the species carries.
It seems this ploidy setting only works for nuclear genome, means the copies of chromosome. But this not quite right for mtDNA for copy number not sure(At least not 1 copy). Question1: What will this affect my result?
When I check my finally vcf result I find 2 problems, result example:
MT 214 . A G 2386.46 ... GT:AD:DP:GQ:PL 1:0,112:112:99:5445,0 0:136,0:136:99:0,1080...
There problem is AD and DP too low value(Important to me, I need to calculate each variant's frequency.). The real depth value is about 5000, I find many has the same issue(check here). This seems because
HaplotypeCaller not suite for amplicon data. Question2: Have anyone call such mtDNA variant using GATK? How?
Mutect2 work here?
I think about this because mtDNA heterogeneity are like tumor, while tumor has many other features that mtDNA don't like structure variants. Any one think this before? What should be
normal tissue data (-I:normal param in Mutect2) if I apply
Mutect2 to mtDNA? Reference sequence?