Question: Statistical test for finding driver mutations.
0
banerjeeshayantan170 wrote:

One approach to identify driver mutations that drive cancer progression is to look for recurrent mutations. Now BMR (background mutation rate) gives the frequency of finding a mutation by chance (Pg) at a specific location of the genome. Usually the mutations observed are less than BMR and so are not that important . If the frequency observed is greater than BMR, only then we can conclude that they are driver mutations. Now an experiment is run where we take a sample of say 25 patients and find that the frequency of finding mutations here (P) is greater than that expected by chance,i.e, P>Pg. So how do I build the hypothesis tests to confirm/deny my findings. (Population size=N).

One way of doing so is the following
Null hypothesis or H0 is P < Pg

Alternative hypothesis or H1 is P > Pg

Let X be a binomial random variable where X~B(N,Pg) as there are N independent trials each with a probability of success Pg.
Now, P(X>=25) will give the p value and can be calculated from the binomial table. If the value is less than a signifance level (say,0.05) , we can conclude that the recurrent mutations observed are not by chance and are driver mutations and vice versa.
Am I correct in stating all of this?

statistics sequencing gene • 1.6k views
modified 3.5 years ago by Collin850 • written 3.5 years ago by banerjeeshayantan170
1
Collin850 wrote:

I'm curious, are you trying to implement a statistical test your self on actual data? It's better to use an already established method (e.g. 20/20+, MutSigCV, OncodriveFML). If you accurately calculate the background mutation rate, you should see roughly half of the genes on either side of the BMR because most genes are passenger genes for cancer. However, if you have only 25 samples, then based on typical mutation rates for cancer you would see most genes won't even have a somatic mutation (ball park of ~100 somatic mutations per sample, but cancer such as melanoma will have much more).

You have a few problems with your setup. One the null hypothesis would be P=Pg. The larger second issue is how you estimate Pg, and the assumption you are making when using a binomial model. The binomial model will assume there is a constant background mutation rate. The background rate of mutations varies at several levels: different patient's cancers, different locations in the genome (and consequently different genes), the length of the gene, and different nucleotide based contexts (e.g. C->T say at a CpG site). All of these factors have lead to problems in statistical tests based on a binomial model (PMID: 27911828, 23770567). If you did not model these factors, you would be substantially better off using a beta-binomial model, which accounts for over-dispersion (https://en.wikipedia.org/wiki/Overdispersion ). Also, in your example setup N=25 does not necessarily mean you would calculate the p-value by P(X>=25), but rather P(X>=x) where x is the number of patients which have a (typically non-silent) mutation in the gene. Third, given you are testing many genes you would not use a nominal statistical significance level of p=0.05, but rather use something like the False Discovery Rate to control errors (https://en.wikipedia.org/wiki/False_discovery_rate ).

I would like to also point out that trying to detect driver genes based on a background mutation rate is not the only way to identify driver genes statistically. I don't have space to go into it in depth here, but alternative approaches have been generally better due to the above mentioned problem of the multiple levels of variability in the background mutation rate (PMID: 27911828). Some of these exploit clustering of mutations or high "functional impact" mutations.

Lastly even if you have a good statistical test, it usually relies on good quality somatic mutation calls. If germline mutations or mutation calling artifacts contaminate your somatic mutation data then it could lead statistical tests to erroneously reject the null hypothesis.

Thank you so much for your detailed and insightful answer. Thanks for pointing out the references to alternative methods . This really helped!

I just have a few questions in mind.

First, If my H0: P = Pg then my H1: P ≠ Pg. That means for H1 , the frequency can be either higher or lower than BMR. But recurrent mutations are driver iff frequency of occurrence > BMR. Am I correct?

The binomial RV X in my example is the number of patients having a non silent mutation in the gene, right?

Like you said it should be P(X>=x) , this means that the probability of observing the number of patients having a non silent mutation atleast as high as the number seen in the sample data of 25 patients. So x here is 0 = < x< = 25, right?

If you have your H1 be P != Pg, then you are assuming a two-sided statistical test. You can still have a H1 be P > Pg, which is a one sided statistical test. A one sided test is used, generally, because you are trying to identify cancer drivers. Cancer driver mutations would be more advantageous for a cancer cell and thus those cells would clonally expand, and consequently be detected more often in next generation sequencing. This means it only makes sense to have a one side test in this scenario, otherwise you would lose statistical power. That's not to say that genes with P < Pg would not be interesting, as they could represent genes that are essential in cancer (thus mutated at lower than expected frequency due to negative selection). It's likely that unless you have a very large number of samples that detecting cancer essential genes would be substantially statistically under powered though.

Yes, x could be between 0 and 25 based on your problem setup. Generally people estimate the background mutation rate using silent mutations, under the assumption that silent mutations can't be driver mutations (thus reflecting the null hypothesis). They then identify significant genes based on non-silent mutations using the estimate of the BMR from silent mutations.