What is the statistical test for comparing the frequencies of variables in multiple groups?
1
0
Entering edit mode
2.5 years ago

I have data that contains the occurrences of genes in different lineages:

Lineage Gene    Gene_function
1   x   regulatory proteins
1   p   cell wall
1   y   conserved hypotheticals
1   x   respiration
1   z   respiration
2   w   cell wall
2   a   cell wall
2   y   regulatory proteins
3   b   respiration
3   x   conserved hypotheticals
3   a   regulatory proteins
3   b   regulatory proteins
3   z   conserved hypotheticals
3   a   respiration


How do I test if there are a significantly different number of, say, "cell wall" genes between all the lineages (I'm thinking equivalent to a classic ANOVA, followed by Tukey test to identify which specific lineages are different). N.b. there are a different number of rows for each lineage.

Then repeat this for each of the types of genes.

Is there a simple and quick way to do this in R?

statistics group comparison • 701 views
2
Entering edit mode
2.5 years ago
Hugo ▴ 360

I think that you need to use the Fisher's exact test of independence (http://www.biostathandbook.com/fishers.html). You would create a contingency matrix with lineage types in rows (1, 2, 3, and so on) and gene functions in columns (regulatory proteins, cell wall, etc.) The post-hoc test is just doing all possible pairwise comparisons between gene functions, and in this case you need to correct for the multiple comparisons.

0
Entering edit mode

Thanks, so I tried this but there were a couple of problems, but I think I found a solution. I put the data in a table using table(). When I run fisher.test() I get an error FEXACT error 6. LDKEY=617 is too small for this problem.... Something to do with memory. I subsetted the table and the maximum matrix it works with is 2x3 (or 3x2). I did find on another forum however that chisq.test(<table>, simulate.p.value = T) is 'equivalent' to Fisher's exact test. However, then I found that fisher.test() also has this argument. Indeed, they produce similar results. I'm not sure however what this parameter means! Thanks

0
Entering edit mode

According to the documentation, the fisher.test method only applies the simulate.p.value parameter in larger than 2 by 2 tables (a logical indicating whether to compute p-values by Monte Carlo simulation, in larger than 2 by 2 tables). I am not an statistician, but I assume that this parameter speeds up the p-value calculation in such tables by simulating them instead of doing an empirical (and more accurate) calculation:

In the r x c case with r > 2 or c > 2, internal tables can get too large for the exact test in which case an error is signalled. Apart from increasing ‘workspace’ sufficiently, which then may lead to very long running times, using ‘simulate.p.value = TRUE’ may then often be sufficient and hence advisable. Simulation is done conditional on the row and column marginals, and works only if the marginals are strictly positive. (A C translation of the algorithm of Patefield (1981) is used.)