Question

Genotype Likelihood Calculation In Samtools Is Wrong?

7

Entering edit mode

12.6 years ago

Marvin ▴ 890

I recently tried to apply a simplified version of the equations from Chapter 3.1.3 of http://www.broadinstitute.org/gsa/wiki/images/1/10/Samtools.pdf (modelling sequencing errors/ practical calculation), but the equations don't make sense to me:

Either we're calculating the likelihood of a sequence of observations, in that case P(D|1) should simply be (2^-n), or we're aggregating the data before into k reference alleles and (n-k) alternate alleles, in which case P(D|0) and P(D|2) should also contain the term (n over k). GATK seems to agree with my view: http://www.broadinstitute.org/gsa/wiki/index.php/File:UGgenotypeLikelhoods.jpg.

Intuitively, the samtools method seems to make sense, too. But GATK and samtools can give wildly different results (not just from the different error model, this equation alone can make a difference of many orders of magnitude!), so they cannot both be right. My gut feeling is that samtools would give lots of false positive heterozygous calls. And when I applied the equations to estimate the fraction of heterozygous sites in a sample, I got a gross overstimate.

Has anyone else noticed this? What are you using to call heterozygotes and why? Or am I simply confused?

To clarify further: P(D|0) and P(D|2) express the probability of seeing a specific sequence of n alleles of two kinds, but P(D|1) expresses the probability of seeing a specific set of k alleles of one kind and (n-k) alleles of another kind. Either works, but not in combination.

If something like this is the reasoning behind the equations, they are wrong. And if not, what is the reasoning, because I can't figure it out.

samtools genotyping • 13k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 12.6 years ago by Marvin ▴ 890

2

Entering edit mode

The error modeling is omitted due to the page limit, so this question is not answered in that paper.

ADD REPLY • link 12.6 years ago by lh3 33k

1

Entering edit mode

If you have not seen it, this paper by Heng Li gives additional/updated details on the samtools model http://bioinformatics.oxfordjournals.org/content/early/2011/09/08/bioinformatics.btr509.short

ADD REPLY • link 12.6 years ago by Jts ★ 1.4k

0

Entering edit mode

If you have not seen it, you might be interested in this paper by Heng Li with more/updated details on the samtools model: Genotype Likelihood Calculation In Samtools Is Wrong?

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 12.6 years ago by Jts ★ 1.4k

0

Entering edit mode

As I misunderstood your question... The paper jts refers to gives the likelihood computation in mutli-ploidy cases. That formula is reduced to the GATK one for diploid samples. Nonetheless, that formula is not used in samtools.

ADD REPLY • link 12.6 years ago by lh3 33k

0

Entering edit mode

Hi Marvin,

Could you please send me the articles in this link and thisone? I can't open the website right now. My email address is devinliao0918@gmail.com. Thanks

By the way, did you ever try to implement the formula to calculate the genotype likelihood by yourself?

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by devinliao0918 ▴ 40

score 7 · Answer 1 · 2011-09-27

7

Entering edit mode

12.6 years ago

lh3 33k

The fundamental assumption behind the GATK model is that errors are independent, which is frequently not true in practice. Samtools tries to model the error dependency. If you change all f_k to 1, the samtools model is not far from the GATK model. On real data, the two models differ little given shallow coverage. They do differ a lot given extremely high coverage (like hundreds folds of coverage), but for single-sample SNP calling, we do not need very high phred-scaled genotype likelihoods. The difference has minor noticeable effect on SNP calling. And actually in this case, the samtools model is closer to the truth.

Also, I guess your implementation of the samtools error model is wrong. The samtools model tends to be conservative - it tends to call fewer heterozygotes than the independent model. At last, this model is originated from MAQ and has been used hundreds of times in publications. Even from this fact alone, you should not simply conclude that this model is wrong.

EDIT: Sorry, I misunderstood your question. I thought you were asking a difficult question, but it turns out that you are raising an easier one.

Okay, let me do the deduction. Alpha contains the binomial coefficient. The product of betas is a partial sum of alpha. Thus P(D|0) and P(D|2) have the coefficient. The GATK model does not apply the sum, but in practice, this leads to little difference.

ADD COMMENT • link 12.6 years ago by lh3 33k

1

Entering edit mode

Thanks for the long answer, but it answers a different question. I'm not really concerned with the error model, but with the calculation of P(D|1) specifically. Samtools has (n over k) in that equation, GATK doesn't.

ADD REPLY • link 12.6 years ago by Marvin ▴ 890

0

Entering edit mode

Now I got my point across, it seems. But P(D|0) and P(D|2) don't have a binomial coefficient! They should. (Or is it different in the code than in the documentation?)

ADD REPLY • link 12.6 years ago by Marvin ▴ 890

0

Entering edit mode

They include. Do a little deduction you will find.

ADD REPLY • link 12.6 years ago by lh3 33k

0

Entering edit mode

Okay, that makes sense now. I can't really "see" it (yet), but I can believe it's there. That also means I misunderstood the meaning of gamma and beta.

ADD REPLY • link 12.6 years ago by Marvin ▴ 890