Low SNP Overlap with Michigan 1KG and TopMed reference panel
Entering edit mode
13 months ago
berndmann ▴ 10

I extracted three samples (HG02024 - HG02026) from the 1000 Genomes Project's 30x alignment files, employing the Genome Analysis Toolkit (GATK) best practice pipeline. This process involved performing base quality score recalibration, identifying and removing duplicate reads, utilizing the HaplotypeCaller to generate a genomic VCF (gVCF) file, and calling variants from the gVCF. The corresponding R code for these steps appears as follows

#generate base-score recalibration file  
call = glue("python3 $HOME/gatk/gatk BaseRecalibrator -I {i} -R {reference}/unsorted/GRCh38_full_analysis_set_plus_decoy_hla.fa  --known-sites {reference}/dbsnp/dbsnp_b151_GRCh38p7_chr.vcf.gz -O {out}.bsrq.txt ")
 #apply base-score recalibration to bam
 call = glue(" python3 $HOME/gatk/gatk ApplyBQSR -R {reference}/unsorted/GRCh38_full_analysis_set_plus_decoy_hla.fa -I {i} --bqsr-recal-file {out}.bsrq.txt -O {out}_bsrq.bam")

 #mark and remove duplicates
 call = glue("java -jar {tools}picard.jar MarkDuplicates -I {out}_bsrq.bam -O {out}_bsrq_rmdup.bam -M marked_dup_metrics.txt --REMOVE_DUPLICATES")

 #sort and index the file for the haplotypecaller
 call = glue("samtools index {out}_bsrq_rmdup.bam")

 #generate gvcf
 call = glue(" python3 $HOME/gatk/gatk --java-options '-Xmx28g' HaplotypeCaller --native-pair-hmm-threads 5 -R  {reference}/unsorted/GRCh38_full_analysis_set_plus_decoy_hla.fa -I {out}_bsrq_rmdup.bam -O {out}_bsrq_rmdup.gvcf.vcf.gz   -ERC GVCF")

 #get interval of the files
 L <- i  %>% str_extract('chr(\\d+|X|Y)')

#genotype gvcfs
call = glue("python3 $HOME/gatk/gatk GenotypeGVCFs -R  {reference}/unsorted/GRCh38_full_analysis_set_plus_decoy_hla.fa -V {out}_bsrq_rmdup.gvcf.vcf.gz  --include-non-variant-sites -O {out}_bsrq_rmdup.vcf.gz -L {L}  ")

I attempted to phase the three samples using both the Michigan Imputation Server and TopMed in order to compare the phasing results obtained with different reference panels. However, this procedure encountered issues even when employing the 1000 Genomes Project's 30x Hg38 reference panel. Instead of achieving a complete 100% overlap as expected, I observed a substantially lower overlap of 64.83%.

From Michigan Imputation Server:

1 valid VCF file(s) found.

Samples: 53
Chromosomes: 21
SNPs: 118947
Chunks: 3
Datatype: unphased
Build: hg38
Reference Panel: apps@1000g-phase3-deep@1.0.0 (hg38)
Population: all
Phasing: eagle
Mode: imputation

Quality Control

Calculating QC Statistics

Alternative allele frequency > 0.5 sites: 49,935
Reference Overlap: 64.83 %
Match: 62,365
Allele switch: 0
Strand flip: 0
Strand flip and allele switch: 0
A/T, C/G genotypes: 0
Filtered sites:
Filter flag set: 0
Invalid alleles: 17,968
Multiallelic sites: 3,775
Duplicated sites: 0
NonSNP sites: 0
Monomorphic sites: 0
Allele mismatch: 657
SNPs call rate < 90%: 0

Excluded sites in total: 22,400
Remaining sites in total: 62,365
See snps-excluded.txt for details
Typed only sites: 34,182
See typed-only.txt for details

Warning: 1 Chunk(s) excluded: reference overlap < 50.0% (see chunks-excluded.txt for details).
Remaining chunk(s): 2

I also noticed a smaller overlap when using TopMed or HLA reference panels, which is logical considering the likely reduced number of samples in those panels. Could there be any apparent error in my procedure that I might be overlooking??

Michigan-Server Phasing TopMed • 815 views
Entering edit mode

Most of the exclusion is the invalid alleles, look at what those are. iirc the imputation server only accepts biallelic single nucleotide variants


Login before adding your answer.

Traffic: 1971 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6