For example, In the following record I dont understand some of the GT:AD:DP:GQ:PL information
chr1 897723 rs6696911 C T 453.42 PASS AC=1;AF=0.50;AN=2;BaseQRankSum=-0.479;DB;DP=36;Dels=0.00;FS=1.480;HRun=2;HaplotypeScore=0.0000;MQ=41.37;MQ0=0;MQRankSum=-0.578;QD=12.60;ReadPosRankSum=0.842
GT:AD:DP:GQ:PL 0/1:19,17:36:87.16:483,0,532
GT=1/1 I'm pretty sure both allele have T's. Whereas 1/0 would mean hetro for ref and snp?
AD = 19,17 - I cant find and explanation what AD means?
DP = 36 easy to understand
GQ = 87.16 Why are there two values in this field?
PL = 483,0,532 - I'm a bit baffled about this field?
Thanks for the link but I have been there also. That for the AD definition. I understand that PL is phred-scaled genotype likelihood but why are there three values? Thanks
For a biallelic site, the PL has three numbers, The first one is the probability that the site is homozgyous reference, the second is the probability that the sample is heterzygous, the third that it is homozygous for the alternate letter. The higher the number, the less likely it is that your sample is that genotype. So if your PL is 483,0,532 the software is quite sure that your sample is not homozygous reference or homozygous alternate, it's heterozygous. And the GT shows that, by being 0/1. If the first and last numbers had been lower, then the quality of the SNP woud be poorer, and the genotype would be less confident.
I think that means you have 19 reads showing the reference allele, and 17 reads showing the alterante allele. Those do add up to 36, which is your total depth.
from a VCF file generated by GATK's UnifiedGenotyper:
##fileformat=VCFv4.1
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
this is not described on the VCF v4.1 format specs, although they do mention "Additional Genotype fields can be defined in the meta-information. However, software support for such fields is not guaranteed."
AD = Allelic Depth, which is the number of reads that have the reference vs non reference base. In this case 19 ref, 17 alternate.
These two values will usually, but not always sum to the DP value. Reads that are not used for calling are not counted in the DP measure, but are included in AD.
For another reference, try the 1000 genomes site. The AD stands for allele depth, GQ is genotype quality, and that is one float value. PL is the phred-scaled genotype likelihood.
Thanks for the link but I have been there also. That for the AD definition. I understand that PL is phred-scaled genotype likelihood but why are there three values? Thanks
Also why are the two genotype quality values thanks
There is only one GQ value. It is 87.16 (87 and 16/100)