I'm browsing the 1000 genomes project data and looking at the VCF files from phase 1 (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/)
Based on genotype likelihoods (GL) I want to calculate a phred-like score for the most likely genotype (per individual). But I'm a bit stuck on the interpretation of the genotype data in these files.
REF ALT FORMAT HG00099 ...
A C GT:DS:GL 0|0:0.000:-0.10,-0.69,-4.10 ...
As I read this, the observed (most likely) genotype is AA (0|0). DS has the appropriate score (dosage of ALT allele = 0.000). But I can't work out how to interpret the GLs. These look log10-scaled (as explained in the documentation) and normalized for the most likely genotype but if that is the case why is the first value -0.10 instead of 0? Can anybody help me out on this?
For my purpose, the GQ value would be great. Why isn't it in these VCFs? I suppose you can get it by summing the likelihoods of the less likely genotypes and make a Phred-like score?