Small overlap of 1KG Sample file with dbSNP and same sample from phased file
0
0
Entering edit mode
13 months ago
berndmann ▴ 10

I'm looking into variant calling right now and still can't understand the results of my VCF output.

My steps so far are the following:

Step 1: Download all necessary files:

Download the sample from phase3 1KG : HG02024

Download the related reference genome file : hs37d5.fa.gz

Download the already phased samples file for chr1 : Phased file of phase 3 containing related individuals

Download a dbSNP version that is at least dbSNP 135. I used dbSNP 151

Step 2: Call the variants of chr1 of HG02024 I used the following simple calling pipeline in R with the HaploTypeCaller from GATK

call = str_glue(' python3 $HOME/gatk/gatk --java-options "-Xmx28g" HaplotypeCaller --native-pair-hmm-threads 24 -R  {reference}/GRCh37/references_hs37d5_hs37d5.fa -I HG02024.mapped.ILLUMINA.bwa.KHV.low_coverage.20130415_sort.bam -L 1 -O HG02024.mapped.ILLUMINA.bwa.KHV.low_coverage.20130415_sort.vcf -ERC GVCF')
system(call)

This resulted in a file that contains 57 million positions. I still believe that these are to much or at least those can't be all SNPs right? One would expect something around less then 9 million snps I would say. So roughly 8% (share of chr 1 on the full human chr) of 113,862,023 SNPs in dbSNP 151

Step 3: Compare the variants with dbsnp and the phased file

I wrote a short bash script to do this :

#!/bin/bash

#Check if the required number of arguments is provided
if [ "$#" -ne 3 ]; then
    echo "Usage: $0 <input_vcf> <dbsnp_vcf> <output_vcf>"
    exit 1
fi

input_vcf=$1
dbsnp_vcf=$2
output_vcf=$3

#Extract positions from the input VCF file
grep -v "^#" "$input_vcf" | awk '{print $1 "\t" $2}' > input_positions.txt

#Subset the dbSNP VCF file to overlap with the input VCF file
awk 'FNR==NR {a[$1,$2]; next} !($1,$2) in a' input_positions.txt "$dbsnp_vcf" > "$output_vcf"

#Clean up the temporary file
rm input_positions.txt

echo "Reduced dbSNP VCF file saved as $output_vcf"

When I compare the overlap between dbSNP and the called file I observe an overlap of 13,268,727 SNPS. The overlap between the called file and the phased file I observed was only 1,963,447 SNPs out of 6,468,094 SNPs.

What is the reasoning that the overlap between the called file and the phased file is that low?

EDIT: I also tried to call the 30x bam file of the sample and perform a liftover. Thereby the overlap between this called and lifted vcf and the phase file was even lower (932,179 SNPs).

HaploTypeCaller 1KG dbSNP • 416 views
ADD COMMENT

Login before adding your answer.

Traffic: 1996 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6