Question: How to interpret genotype likelihoods?
2
gravatar for philip
6.6 years ago by
philip10
Belgium
philip10 wrote:

I'm browsing the 1000 genomes project data and looking at the VCF files from phase 1 (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/)

Based on genotype likelihoods (GL) I want to calculate a phred-like score for the most likely genotype (per individual). But I'm a bit stuck on the interpretation of the genotype data in these files.

For instance:

REF   ALT   FORMAT     HG00099                           ...  
A       C       GT:DS:GL   0|0:0.000:-0.10,-0.69,-4.10   ...

As I read this, the observed (most likely) genotype is AA (0|0). DS has the appropriate score (dosage of ALT allele = 0.000). But I can't work out how to interpret the GLs. These look log10-scaled (as explained in the documentation) and normalized for the most likely genotype but if that is the case why is the first value -0.10 instead of 0? Can anybody help me out on this? 

For my purpose, the GQ value would be great. Why isn't it in these VCFs? I suppose you can get it by summing the likelihoods of the less likely genotypes and make a Phred-like score?

Cheers

 

 

ADD COMMENTlink modified 6.6 years ago by Gabriel R.2.8k • written 6.6 years ago by philip10

Where is the "documentation"? The VCF spec does not require GL to be normalized, and therefore not having 0 is okay.

ADD REPLYlink written 6.6 years ago by lh332k

I was only referring to the documentation (http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41) regarding the log10 scale. I assumed normalization by analogy with PL (Genotype Likelihoods. Thanks for the clarification 

ADD REPLYlink written 6.6 years ago by philip10
1
gravatar for Gabriel R.
6.6 years ago by
Gabriel R.2.8k
Danmarks Tekniske Universitet
Gabriel R.2.8k wrote:

> why is the first value -0.10 instead of 0? 

Without seeing the coverage and quality scores, it is hard to work back the calculation that was done. But If the likelihood of the sample being homozygous reference 10^-0.10 = 0.79 is greater than the heterozygous case (10^-0.69=0.20).  Now, why is it not 0 ? Maybe you have an alternative allele with a low base quality or the bases for the reference are few and have low quality themselves. Again, without seeing the reads per se, I am conjecturing. 

> For my purpose, the GQ value would be great. 

What is your purpose ? The GL should give you all the information you need. When I wish to retain sites that pass a certain quality criterion, I compute the difference between the log likelihoods (essentially, a likelihood ratio) and apply a threshold to it as to how much certainty I am willing to have.

 

ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by Gabriel R.2.8k

Thanks for your reply. Again, my questions were based on the false assumption the GLs are normalised values. So thanks for clarifying this

Indeed, a likelihood ratio will suffice for my purposes. For instance, transforming the GLs to the actual likelihoods and make the ratio: Likelihood ratio = Likelihood(most likely genotype)/(Likelihood(alternative genotype 1) + Likelihood(alternative genotype 2) for a biallelic variant.

Could you give me an idea of a good treshold value for this ratio? For the given example this ratio would be ~3.9 (= 0.79/(0.20 + 7.93E-05) showing this call isn't very reliable. What can be accepted statistically as a good treshold for this ratio?

Thanks!

ADD REPLYlink modified 6.6 years ago • written 6.6 years ago by philip10

Start from first principles, what would you be willing to tolerate without loosing too many sites ?

ADD REPLYlink written 6.6 years ago by Gabriel R.2.8k

Loosing sites is not an issue, definitely not at first. To start with, I want some idea about threshold value for these likelihood ratios that can be considered as statistically significant (ie a good call).

To give you some context: I'm setting up a 'dummy' genetic database for a clinical research project. I'm using the 1K Genomes project as a proxy for clinically relevant genetic data because we have no 'real' genetic data yet. My work is part of a proof-of-concept validation for a clinical/genetic project where clinical data is compared with genetic data and is coming from different databases (real life situation).

Since a lot of data is still low coverage in the 1000 Genomes Phase 1, I need some threshold (or several) to determine if the genotype calls can be viewed as reliable or not. I suppose in practice, that comes down to a threshold that is generally accepted to yield a good sensitivity/specificity. For instance (and correct me if I'm wrong) base calls with Phred scores higher than 30 are considered to be quite reliable calls. 

In that regard, I'm looking for a meaningful threshold value for these likelihood ratio which can be considered as good genotype calls, in analogy to Phred for base calling. As an example, using the same logic for Q30 (1/1000 that the call is wrong) a likelihood ratio of about 1000:1 can be considered a good call. Does this make any sense? :-)

What would be a good way to go about this? I greatly appreciate the input.

 

Cheers

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by philip10

A way you could do it is by downloading the phase III data from 1000g. To my dismay, there are no GL nor PL, they just have GT fields.  

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

It's a cope out but it could be a way to go forward with your project

 

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by Gabriel R.2.8k

This is a pre-release. I guess the final release will include GL/PL.

ADD REPLYlink written 6.5 years ago by lh332k

They released GL data for Phase III: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/genotype_likelihoods/

ADD REPLYlink written 4.8 years ago by vitor.aguiar10

The genotypes are imputed. Imputation adds a lot of power.

ADD REPLYlink written 6.5 years ago by lh332k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1147 users visited in the last hour