Question: Estimating Probability Of Differing Allele Frequencies From Pooled Samples
gravatar for Ketil
6.0 years ago by
Ketil3.9k wrote:

Given genotype information (SNP allele frequencies) from different pooled populations, how can I calculate the probability (or significance, or some confidence value) of the populations having different allele frequencies? I'm not terribly interested (and it's easy to do anyway) in estimating the actual allele frequencies, just the probability of it differing.

E.g. if I have alleles A and C, with observed frequencies of 9/3 and 6/1, I can estimate minor allele frequency to 0.33 and 0.17, respectively, but is the difference significant - and what is the p-value?

Or more generally: given two set of samples from binomial distributions with success probabilities p1 and p2, how can I calculate the probability of p1=p2?

It seems this would be a rather fundamental task, but I haven't found any good source on how to do this, and my statistics-fu seems to have rusted...

snp statistics • 5.3k views
ADD COMMENTlink modified 6.0 years ago by matted6.9k • written 6.0 years ago by Ketil3.9k
gravatar for matted
6.0 years ago by
Boston, United States
matted6.9k wrote:

Since allele frequencies and pooling mean different things in different contexts, I'll ask you to clarify which world you're in. Here are the choices as I see them:

  1. Allele frequencies from confident genotypes on individuals (e.g. SNP arrays or deep sequencing with confident SNP calls)
  2. Allele frequencies from sequence reads on individuals (e.g. N barcodes total for a population of N individuals)
  3. Allele frequencies from reads pooled by population (e.g. 2 barcodes total for 2 populations of N individuals)

The answers differ based on your world:

  1. This is really simple (e.g. the basic test for GWAS). The most common thing to do is Fisher's exact test or a chi-squared test. You could also compute a likelihood ratio or do a full Bayesian analysis.
  2. This is a bit harder because you should account for the uncertainty from the finite number of sequencing reads. bcftools has a nice tool to handle this automatically: contrast calling between groups of samples. See the bcftools docs and option -1. You can do permutation testing with -U. I think the details are in one of Heng's recent Bioinformatics papers.
  3. This is hardest of the three. You could try bcftools pair-calling mode, which calculates the likelihood that the genotypes are different in two samples (-T pair). It isn't meant exactly for this though, so it's not clear how much to trust the results. You could try out freebayes, which I think says it handles these cases (I haven't used it though). I personally would construct a likelihood ratio test accounting for the number of reads of each allele, testing the hypothesis of having the same allele frequency versus having two different allele frequencies.
ADD COMMENTlink written 6.0 years ago by matted6.9k

Yes, I have #3, that is, moderate to deep sequencing of pools of individuals from separate populations. I've used bcftools, but it seems to be geared towards having two genotypes, and the documentation is a bit impenetrable. I can try -T pair, but I think it's better if I just write my own solution for this - that way it will be comprehensible to me.

Akamai seems to be struggling, so Boston College and github are slow as molasses right now, but I'll check out FreeBayes when they get back in order. From a cursory glance, it looks like just another SNP caller, with a zillion knobs and dials that needs manual adjustment - I'm not sure it will be any help.

When I have a few good variant candidates that are likely to be different, I can go back and do proper individual based genotyping, as in your #1.

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by Ketil3.9k

Yeah, the freebayes option I remembered is:

"To run on pooled data, set the --pooled flag (which turns off the prior component derived from the probability of a specific distribution of heterozygotes and homozygotes given the allele frequency), and set --ploidy to the number of total copies of the genome in each pooled sample."

But like I said, I've never used it, so I can't vouch for it firsthand. I'd be curious to know if you try it out and it works.

ADD REPLYlink written 6.0 years ago by matted6.9k

In regards to option 3, could you explain how you would construct the likelihood ratio test (or provide a source with more information)?

ADD REPLYlink written 18 months ago by linnaean0
gravatar for Ketil
6.0 years ago by
Ketil3.9k wrote:

Okay, since nobody else did the obligatory WP lookup. This shows how to estimate a confidence interval around the estimated Binomial p: Pick your poison (or should that be Poisson)?

One way to do this would then be to calculate the confidence interval for p1 and p2, and if they do not overlap, we consider them from different underlying distribution. This is a conservative test, meaning that you can have overlapping confidence intervals even when the distributions are significantly different. See for details.

ADD COMMENTlink written 6.0 years ago by Ketil3.9k

good one: "Pick your Poisson!" I am going to use that line from now on!

ADD REPLYlink written 6.0 years ago by Istvan Albert ♦♦ 77k

Well, I'm off for the weekend, but I brought with me an mpileup file, and some docs, and will see if I can't write a tool to calculate this on the plane... no promises, though. :-)

ADD REPLYlink written 6.0 years ago by Ketil3.9k

Okay, done! I've stumbled into some problems with

  1. higher coverage than number of haplotypes lead to overestimating confidence
  2. collapsed repeats confound the results (although there is still evidence of variance)
  3. not sure how to deal with indels

And probably some more.

ADD REPLYlink written 6.0 years ago by Ketil3.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1510 users visited in the last hour