Question: What is a bad GATK Genotype Quality?
4
gravatar for devenvyas
3.8 years ago by
devenvyas570
Stony Brook
devenvyas570 wrote:

I am preparing admixture analyses involving the Neanderthal and Denisovan genomes, and I have downloaded extended (=generally non-vcftools friendly) VCF files (http://cdna.eva.mpg.de/neandertal/altai/AltaiNeandertal/VCF/ and http://cdna.eva.mpg.de/denisova/VCF/hg19_1000g/). The files were originally made with GATK, but the authors greatly modified to files; thus, they aren't standard VCFs.

I've got them down to just the sites that I have modern data for, so they only a fraction as massive as the original files.

A few hundred sites sites have been marked LowQual, and have been ejected using grep -v; however, LowQual is going into more than just Genotyping Quality (i.e., LowQual sites have GQ's as high as 59, but for the Neanderthal 8,767 of 511,858 non-LowQual sites have GQ<60).

I was wondering what would be good GQ cut-off to use for the non-LowQual line?

Also, any suggestions on how to filter them? (Remember I cannot use VCFtools or GATK or anything similar due to the non-standard formatting)

Here is an example line, GQ is in bold italics

1    5031561    rs7518523    A    G    909.02    .    AC=2;AF=1.00;AN=2;DP=24;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=0.9665;MQ=39.43;MQ0=0;QD=37.88;1000gALT=G;AF1000g=0.34;AFR_AF=0.70;AMR_AF=0.28;ASN_AF=0.17;EUR_AF=0.27;UR;TS=HPGOMC;TSseq=A,G,A,G,G,A;CAnc=A;GAnc=A;OAnc=G;bSC=987;mSC=0.000;pSC=0.007;GRP=-2.27;Map20=1    GT:DP:GQ:PL:A:C:G:T:IR    1/1:24:72.23:942,72,0:0,0:0,0:10,14:0,0:0

 

snp quality • 6.0k views
ADD COMMENTlink modified 4 months ago by Biostar ♦♦ 20 • written 3.8 years ago by devenvyas570

>(Remember I can used VCFtools or GATK or anything similar due to the non-standard formatting)

can or cannot use GATK/vcftools?

ADD REPLYlink written 3.8 years ago by RamRS21k
I cannot use either of them
ADD REPLYlink written 3.8 years ago by devenvyas570
2
gravatar for devenvyas
3.8 years ago by
devenvyas570
Stony Brook
devenvyas570 wrote:

Since I didn't get any suggestions, I sought out how this dataset has been used recently in the literature.

I found sources by Qin and Stoneking dx.doi.org/10.1093/molbev/msv141 and Lazaridis et al. dx.doi.org/10.1038/nature13673 (former cites the latter), which suggest filtering the LowQual sites as well GQ < 30 and Qual < 50.

Just thought to pass this along for anyone else using these datasets.

 

 

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by devenvyas570
5
gravatar for vdauwera
3.8 years ago by
vdauwera900
Cambridge, MA
vdauwera900 wrote:

The most important point here is to understand the difference between variant site-level (INFO) quality and sample-level (genotype/FORMAT). Depending on what you're trying to learn from your data, the GQ may or may not matter. GQ describes how sure we are that we have the right genotype; for high-quality variant sites that have made it past INFO-level filtering, that just means we're confident there is variation at the site -- we're just not sure whether that variation is in the heterozygous or homozygous-variant form. Like I said, depending on what you're studying, that may or may not matter.

The second most important point is that it sounds like whoever prepared the files only used the built-in QUAL-based filtering, not a proper filtering method like variant recalibration (VQSR). Rather than focusing on GQ, you should look into applying proper filtering at the site level. 

My recommendation would be to figure out how to mcguyver this poorly formatted VCF file into shape so you can use GATK or other well-designed tools for filtering, rather than putting effort into working with it as it stands.  

ADD COMMENTlink written 3.8 years ago by vdauwera900
1

They are not poorly formatting VCF files. They use an extended, non-standard format (i.e., non-standard != poor). For example standard GATK format is not amenable to triallelic sites; they had sites that are biallelic in modern humans, but the Neanderthal/Denisovan was heterozygous with a third allele. (http://www.sciencemag.org/content/suppl/2012/08/29/science.1224344.DC1/Meyer.SM.pdf pp. 16-20; http://www.nature.com/nature/journal/v505/n7481/extref/nature12886-s1.pdf p. 14). I've contacted one of the creators of the files in the past, and he has said that trying to use GATK/vcftools with these files would not be a good idea and that python or pysam would be the best way to go.

You are jumping to a lot of conclusions about the filtering. Based on the Meyer link above, they used more than what you think with multiple iterations of genotyping. (These files are from large scale ancient DNA genome projects, they are not going to be that sloppy).

Given the fact that the VCFs are from ancient DNA, GQ is probably important (and there are some analyses in those supplemental docs indicating that lower GQ values have biases some dating analyses)

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by devenvyas570
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 713 users visited in the last hour