2.1 years ago by
I'm curious, are you trying to implement a statistical test your self on actual data? It's better to use an already established method (e.g. 20/20+, MutSigCV, OncodriveFML). If you accurately calculate the background mutation rate, you should see roughly half of the genes on either side of the BMR because most genes are passenger genes for cancer. However, if you have only 25 samples, then based on typical mutation rates for cancer you would see most genes won't even have a somatic mutation (ball park of ~100 somatic mutations per sample, but cancer such as melanoma will have much more).
You have a few problems with your setup. One the null hypothesis would be P=Pg. The larger second issue is how you estimate Pg, and the assumption you are making when using a binomial model. The binomial model will assume there is a constant background mutation rate. The background rate of mutations varies at several levels: different patient's cancers, different locations in the genome (and consequently different genes), the length of the gene, and different nucleotide based contexts (e.g. C->T say at a CpG site). All of these factors have lead to problems in statistical tests based on a binomial model (PMID: 27911828, 23770567). If you did not model these factors, you would be substantially better off using a beta-binomial model, which accounts for over-dispersion (https://en.wikipedia.org/wiki/Overdispersion ). Also, in your example setup N=25 does not necessarily mean you would calculate the p-value by P(X>=25), but rather P(X>=x) where x is the number of patients which have a (typically non-silent) mutation in the gene. Third, given you are testing many genes you would not use a nominal statistical significance level of p=0.05, but rather use something like the False Discovery Rate to control errors (https://en.wikipedia.org/wiki/False_discovery_rate ).
I would like to also point out that trying to detect driver genes based on a background mutation rate is not the only way to identify driver genes statistically. I don't have space to go into it in depth here, but alternative approaches have been generally better due to the above mentioned problem of the multiple levels of variability in the background mutation rate (PMID: 27911828). Some of these exploit clustering of mutations or high "functional impact" mutations.
Lastly even if you have a good statistical test, it usually relies on good quality somatic mutation calls. If germline mutations or mutation calling artifacts contaminate your somatic mutation data then it could lead statistical tests to erroneously reject the null hypothesis.
modified 2.1 years ago
2.1 years ago by
Collin • 650