Question

Genotype Likelihoods

6

Entering edit mode

10.8 years ago

Varun Gupta ★ 1.3k

Hi Everyone

I am starting to find SNP's in my dataset and i am reading online and i come across the term genotype likelihoods. Can you explain me what it means.

Thanks

V

genotype • 21k views

ADD COMMENT • link updated 10.8 years ago by stolarek.ir ▴ 700 • written 10.8 years ago by Varun Gupta ★ 1.3k

score 5 · Answer 1 · 2013-07-11

5

Entering edit mode

10.8 years ago

stolarek.ir ▴ 700

@Dan Gaston

However there are situations in VCF outputs when the most probable genotype is different from the one that is reported, for example:

Having a VCF file looking like:

CHROM POS ID Ref Alt Filter GT:AD:DP:GQ:PL

chr1 845668 . C T [CLIPPED] 0/1:1,3:4:25,92:103,0,26

Lets focus on the GT field and PL:

0/1, 103,0,26

GT is given 0/1, so the heterozygous however PL field reports most probable genotype as 1/1[value 26 = ~ 25% of chance that this is the correct one]. other genotypes from PL field: 0/1[0 probability], 0/0[value 103, also very small probability]. The GT reported as 0/1 comes in this case however from DP field, which shows 1,3:4 - meaning 4 reads span this position, out of which 3 report ALT allel, and 1 report REF allele. So one has to be careful when calling GT, the most probable position is encoded in PL field (even though it's not always given in VCF)

ADD COMMENT • link 10.8 years ago by stolarek.ir ▴ 700

16

Entering edit mode

That's not a correct interpretation of the PL field. PL values are phred-scaled likelihood scores, normalized such that the most likely genotype will have a score of 0. So the approximate likelihoods are 10^(-PL/10). In this case, for PL values of 103,0,26 the likelihoods would be

10^(-10.3) approximately 5.0E-11

10^0 approximately 1

10^(-2.6) approximately 0.0025

So the heterozygous case is the most likely and is indicated as such by the PL values.

ADD REPLY • link 10.8 years ago by Bpow ▴ 280

0

Entering edit mode

mhm. Thanks for this. I was reading just yesterday page from GATK, on which there was a mistake (it really confused me).

ADD REPLY • link 10.8 years ago by stolarek.ir ▴ 700

0

Entering edit mode

Worth noting though that the most LIKELY genotype is not always the called one, that comes after the PROBABILITY is calculated. Note for instance the first line here:

GT:PL:DP:DV:GP:GQ 0/1:26,3,0:1:1:23,1,5:5

GT:PL:DP:DV:GP:GQ 0/1:27,0,35:4:2:23,0,44:23

GT:PL:DP:DV:GP:GQ 0/1:27,3,0:1:1:27,2,3:3

GT:PL:DP:DV:GP:GQ 0/1:28,3,0:1:1:26,1,4:4

The PL field (26,3,0) suggests 1/1 as the most likely genotype. The GP field (23,1,5) shows 0/1 is the most probable: so 0/1 is called. Such cases seem to almost always occur with really low read depths and genotype qualities.

ADD REPLY • link 5.2 years ago by mtrw85 ▴ 10

0

Entering edit mode

So do you know why this is happening?

ADD REPLY • link 4.1 years ago by etay.rot ▴ 30

score 1 · Answer 2 · 2013-07-11

1

Entering edit mode

10.8 years ago

stolarek.ir ▴ 700

Here is a bit of explanation in a human form:

http://www.researchgate.net/publication/51141498_Genotype_and_SNP_calling_from_next-generation_sequencing_data/file/504635154be72509e0.pdf

ADD COMMENT • link 10.8 years ago by stolarek.ir ▴ 700

1

Entering edit mode

Here is an updated link that works:

https://www.researchgate.net/publication/51141498_Genotype_and_SNP_calling_from_next-generation_sequencing_data_Nat_Rev_Genet

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3593722/

ADD REPLY • link 8.2 years ago by ariel ▴ 250

score 0 · Answer 3 · 2013-07-11

Definitely read the link that stolarek.ir posted. I thought I would just briefly state for a very short and oversimplified answer that most genotype callers use some sort of probabilistic model for determining whether a position matches the reference assembly used or has one or more variant alleles at that position. They also typically have different models for SNPS versus indels. Most of these models are Bayesian and therefore the genotype likelihood, in plain language, is the probability of a specific genotype given the data nucleotides at that position from the aligned reads that pass some filter(s). The genotype with the best likelihood (highest probability) is picked as the observed genotype.