Question: Estimating Probability Of Differing Allele Frequencies From Pooled Samples
1
6.2 years ago by
Ketil3.9k
Germany
Ketil3.9k wrote:

Given genotype information (SNP allele frequencies) from different pooled populations, how can I calculate the probability (or significance, or some confidence value) of the populations having different allele frequencies? I'm not terribly interested (and it's easy to do anyway) in estimating the actual allele frequencies, just the probability of it differing.

E.g. if I have alleles A and C, with observed frequencies of 9/3 and 6/1, I can estimate minor allele frequency to 0.33 and 0.17, respectively, but is the difference significant - and what is the p-value?

Or more generally: given two set of samples from binomial distributions with success probabilities p1 and p2, how can I calculate the probability of p1=p2?

It seems this would be a rather fundamental task, but I haven't found any good source on how to do this, and my statistics-fu seems to have rusted...

snp statistics • 5.4k views
modified 6.2 years ago by matted6.9k • written 6.2 years ago by Ketil3.9k
3
6.2 years ago by
matted6.9k
Boston, United States
matted6.9k wrote:

Since allele frequencies and pooling mean different things in different contexts, I'll ask you to clarify which world you're in. Here are the choices as I see them:

1. Allele frequencies from confident genotypes on individuals (e.g. SNP arrays or deep sequencing with confident SNP calls)
2. Allele frequencies from sequence reads on individuals (e.g. N barcodes total for a population of N individuals)
3. Allele frequencies from reads pooled by population (e.g. 2 barcodes total for 2 populations of N individuals)

1. This is really simple (e.g. the basic test for GWAS). The most common thing to do is Fisher's exact test or a chi-squared test. You could also compute a likelihood ratio or do a full Bayesian analysis.
2. This is a bit harder because you should account for the uncertainty from the finite number of sequencing reads. `bcftools` has a nice tool to handle this automatically: contrast calling between groups of samples. See the bcftools docs and option `-1`. You can do permutation testing with `-U`. I think the details are in one of Heng's recent Bioinformatics papers.
3. This is hardest of the three. You could try `bcftools` pair-calling mode, which calculates the likelihood that the genotypes are different in two samples (`-T pair`). It isn't meant exactly for this though, so it's not clear how much to trust the results. You could try out freebayes, which I think says it handles these cases (I haven't used it though). I personally would construct a likelihood ratio test accounting for the number of reads of each allele, testing the hypothesis of having the same allele frequency versus having two different allele frequencies.

Yes, I have #3, that is, moderate to deep sequencing of pools of individuals from separate populations. I've used bcftools, but it seems to be geared towards having two genotypes, and the documentation is a bit impenetrable. I can try -T pair, but I think it's better if I just write my own solution for this - that way it will be comprehensible to me.

Akamai seems to be struggling, so Boston College and github are slow as molasses right now, but I'll check out FreeBayes when they get back in order. From a cursory glance, it looks like just another SNP caller, with a zillion knobs and dials that needs manual adjustment - I'm not sure it will be any help.

When I have a few good variant candidates that are likely to be different, I can go back and do proper individual based genotyping, as in your #1.

Yeah, the freebayes option I remembered is:

"To run on pooled data, set the --pooled flag (which turns off the prior component derived from the probability of a specific distribution of heterozygotes and homozygotes given the allele frequency), and set --ploidy to the number of total copies of the genome in each pooled sample."

But like I said, I've never used it, so I can't vouch for it firsthand. I'd be curious to know if you try it out and it works.

In regards to option 3, could you explain how you would construct the likelihood ratio test (or provide a source with more information)?

1
6.2 years ago by
Ketil3.9k
Germany
Ketil3.9k wrote:

Okay, since nobody else did the obligatory WP lookup. This shows how to estimate a confidence interval around the estimated Binomial p: http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval Pick your poison (or should that be Poisson)?

One way to do this would then be to calculate the confidence interval for p1 and p2, and if they do not overlap, we consider them from different underlying distribution. This is a conservative test, meaning that you can have overlapping confidence intervals even when the distributions are significantly different. See http://www.cscu.cornell.edu/news/statnews/stnews73.pdf for details.

good one: "Pick your Poisson!" I am going to use that line from now on!

Well, I'm off for the weekend, but I brought with me an mpileup file, and some docs, and will see if I can't write a tool to calculate this on the plane... no promises, though. :-)

Okay, done! I've stumbled into some problems with

1. higher coverage than number of haplotypes lead to overestimating confidence
2. collapsed repeats confound the results (although there is still evidence of variance)
3. not sure how to deal with indels

And probably some more.