What is a bad GATK Genotype Quality?
2
5
Entering edit mode
8.8 years ago
devenvyas ▴ 740

I am preparing admixture analyses involving the Neanderthal and Denisovan genomes, and I have downloaded extended (=generally non-vcftools friendly) VCF files (http://cdna.eva.mpg.de/neandertal/altai/AltaiNeandertal/VCF/ and http://cdna.eva.mpg.de/denisova/VCF/hg19_1000g/). The files were originally made with GATK, but the authors greatly modified to files; thus, they aren't standard VCFs.

I've got them down to just the sites that I have modern data for, so they only a fraction as massive as the original files.

A few hundred sites sites have been marked LowQual, and have been ejected using grep -v; however, LowQual is going into more than just Genotyping Quality (i.e., LowQual sites have GQ's as high as 59, but for the Neanderthal 8,767 of 511,858 non-LowQual sites have GQ<60).

I was wondering what would be good GQ cut-off to use for the non-LowQual line?

Also, any suggestions on how to filter them? (Remember I cannot use VCFtools or GATK or anything similar due to the non-standard formatting)

Here is an example line, GQ is in the subsequent code block

1    5031561    rs7518523    A    G    909.02    .    AC=2;AF=1.00;AN=2;DP=24;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=0.9665;MQ=39.43;MQ0=0;QD=37.88;1000gALT=G;AF1000g=0.34;AFR_AF=0.70;AMR_AF=0.28;ASN_AF=0.17;EUR_AF=0.27;UR;TS=HPGOMC;TSseq=A,G,A,G,G,A;CAnc=A;GAnc=A;OAnc=G;bSC=987;mSC=0.000;pSC=0.007;GRP=-2.27;Map20=1    GT:DP:GQ:PL:A:C:G:T:IR    1/1:24:72.23:942,72,0:0,0:0,0:10,14:0,0:0
GT 1/1
DP 24
GQ 72.23
PL 942,72,0
A 0,0
C 0,0
G 10,14
T 0,0
IR 0
quality SNP • 12k views
ADD COMMENT
0
Entering edit mode

(Remember I can used VCFtools or GATK or anything similar due to the non-standard formatting)

can or cannot use GATK/vcftools?

ADD REPLY
0
Entering edit mode
I cannot use either of them
ADD REPLY
4
Entering edit mode
8.8 years ago
devenvyas ▴ 740

Since I didn't get any suggestions, I sought out how this dataset has been used recently in the literature.

I found sources by Qin and Stoneking (dx.doi.org/10.1093/molbev/msv141) and Lazaridis et al. (dx.doi.org/10.1038/nature13673) (former cites the latter), which suggest filtering the LowQual sites as well GQ < 30 and Qual < 50.

Just thought to pass this along for anyone else using these datasets.

ADD COMMENT
5
Entering edit mode
8.8 years ago
vdauwera ★ 1.2k

The most important point here is to understand the difference between variant site-level (INFO) quality and sample-level (genotype/FORMAT). Depending on what you're trying to learn from your data, the GQ may or may not matter. GQ describes how sure we are that we have the right genotype; for high-quality variant sites that have made it past INFO-level filtering, that just means we're confident there is variation at the site -- we're just not sure whether that variation is in the heterozygous or homozygous-variant form. Like I said, depending on what you're studying, that may or may not matter.

The second most important point is that it sounds like whoever prepared the files only used the built-in QUAL-based filtering, not a proper filtering method like variant recalibration (VQSR). Rather than focusing on GQ, you should look into applying proper filtering at the site level.

My recommendation would be to figure out how to mcguyver this poorly formatted VCF file into shape so you can use GATK or other well-designed tools for filtering, rather than putting effort into working with it as it stands.

ADD COMMENT
2
Entering edit mode

They are not poorly formatting VCF files. They use an extended, non-standard format (i.e., non-standard != poor). For example standard GATK format is not amenable to triallelic sites; they had sites that are biallelic in modern humans, but the Neanderthal/Denisovan was heterozygous with a third allele. (http://www.sciencemag.org/content/suppl/2012/08/29/science.1224344.DC1/Meyer.SM.pdf pp. 16-20; http://www.nature.com/nature/journal/v505/n7481/extref/nature12886-s1.pdf p. 14). I've contacted one of the creators of the files in the past, and he has said that trying to use GATK/vcftools with these files would not be a good idea and that python or pysam would be the best way to go.

You are jumping to a lot of conclusions about the filtering. Based on the Meyer link above, they used more than what you think with multiple iterations of genotyping. (These files are from large scale ancient DNA genome projects, they are not going to be that sloppy).

Given the fact that the VCFs are from ancient DNA, GQ is probably important (and there are some analyses in those supplemental docs indicating that lower GQ values have biases some dating analyses)

ADD REPLY

Login before adding your answer.

Traffic: 2498 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6