Question

How to calculate p-value from codeml likelihood ratio test results for Benjamini-Hochberg correction

1

Entering edit mode

8.1 years ago

memory_donk ▴ 360

Hi Biostars

I'm doing positive selection analysis on a relatively large set of orthologous genes using PAML's implementation of the branch-site model. I've got the results of my likelihood ratio test (LRT) for each gene, and according to this (1) and other papers, the branch-site model follows a 1:1 mixture of chi^2 and point mass 0 distribution, so the critical values for 5% and 1% are 2.71 and 5.41 respectively.

My problem is that I need to perform a Benjamini-Hochberg correction, and for that I need the specific p-value for each gene, not just to decide whether it passes either critical value.

How would I go about calculating the actual p-value from each LRT result and the number of degrees of freedom? I can write a script in python/perl (maybe R if necessary) if I know how to actually calculate the p-values. Any help would be greatly appreciated!

(1) Statistical properties of the branch-site test of positive selection http://www.ncbi.nlm.nih.gov/pubmed/21087944

statistics PAML codeml LRT • 5.9k views

ADD COMMENT • link 8.1 years ago by memory_donk ▴ 360

score 5 · Answer 1 · 2016-03-31

I apologise for positing this and then answering my own question, I hope I haven't broken a site rule (if so please feel free to delete this post, though I'll leave it up for now in case anyone else runs into a similar problem).

This is a classic case of RTFM, because I haven't read the whole (huge) PAML manual and missed a small blurb where they describe how to calculate p-value for this distribution.

You calculate p value with this distribution by calculating the p value using chi^2 (1 degree of freedom) then divide by 2. Yup. Divide by 2. So if you had an LRT result of 2.71 you'll get ~.10, then divide by 2 to get 0.05 (5%). This distribution actually winds up being less stringent than a normal chi^2. They actually recommend in the manual that you just use chi^2 (1 degree of freedom) rather than the mixture, however the literature that actually uses the branch-site model seems to disagree (maybe for obvious reasons).

Again I apologise for answering my own question, but I had struggled for quite a while trying to answer this with google. Hopefully someone learns from the mistake.