Question

Criteria for excluding genes from Fisher tests (presence of mutation vs sample group). Minimum number of samples mutated?

1

Entering edit mode

7.1 years ago

correlationmatrix ▴ 20

I would like to test a number of different genes as to whether a mutation in that gene is significantly associated with a particular group of samples. Thus, for each gene I will perform a Fisher test to compare the number of samples in group A with any mutation in gene 1, vs the number of samples in group B that has a mutation in the same gene. Repeat for X number of genes. Each group consists of 11 samples. However, I note that some of the genes are mutated in very few samples in total, say 1 or 2. In those cases, I could never get a significant p-value regardless of how the instances of this mutation were distributed across the different samples. Is it then a good idea to discard these genes from the test in order to reduce the influence of the false discovery rate correction I will need to perform? Or can it be considered "fishing" for significance? What is a sensible cutoff for the number of mutated instances to demand in that case? Using an online Fisher test, I note that one can only get a significant p-value when there are at least 5 mutated instances present (in the most optimistic scenario of all mutations belonging to one group). Would it then be wise to use a minimum of 5 mutated samples as a criterion to consider a gene for testing? (I'm asking because it is very easy to find excuses when something looks borderline significant after FDR correction...)

Fisher mutation samples • 1.6k views

ADD COMMENT • link updated 7.1 years ago by theobroma22 ★ 1.2k • written 7.1 years ago by correlationmatrix ▴ 20

0

Entering edit mode

The statistical principle you are looking for is "independent filtering". If you google for "independent filtering gwas" you will get some ideas.

The key element is that your filtering should be performed on a metric independent on the test statistic.

ADD REPLY • link 7.1 years ago by WouterDeCoster 47k

score 0 · Answer 1 · 2017-03-24

Never discard! You have to first acknowledge if your using methods that control for outliers, and what is the influence of those outliers on the outcome. You hit the nail on the head regarding the statistics, so now you can biologically validate the first few true negatives in the group. I guess you should also validate the one and only true positive as well. This will tell you if your data model presents any 'falseness' using that statistic. The most arbitrary and commonly used cutoff for a p-value is no greater than 5%. This is also based on the data dimensions though...what's the size of your data matrix? FYI: Fishers Test is quite robust!! You could Anova your data, right? Then, you get stars. :)