Question

Which statistical test to compare expression of subset of genes from two cell types

0

Entering edit mode

6.9 years ago

florian.noack ▴ 20

Hi everyone, Iam a bit puzzled which statistical test I should use for my data. I have a subset of genes (based on methylation status of the promoter) and the expression data (FPKM) of two lineage related cell types (FACS from one animal). Question is if there is a difference between the expression of celltyp A vs celltyp B if the promoter is annotated as methylated in A according to my dataset. Right now Iam using a Wilcoxon signed rank test with continuity correction (Wilcoxon because the data are not normaly distributed and paired because i analyse the same set of genes in celltyp A and celltyp B). However I notice there seem to be a bias while testing bigger subsets of genes vs smaller subsets of genes. Bigger subsets (lets say 500 genes) seem to become always significant although the differences (at least by eye) dont look "big". On the other hand smaller subset (like 50) are not significant although the boxplot looks much (!!!) more convincing as for the big subset.

Is the wilcoxon rank test correct or is there another test which takes also into account how many genes are tested ?

Thanks a lot, Flo

RNA-Seq statistics • 2.3k views

ADD COMMENT • link 6.9 years ago by florian.noack ▴ 20

score 0 · Answer 1 · 2017-06-21

0

Entering edit mode

6.9 years ago

Jean-Karim Heriche 27k

I think you're encountering one of the problems with p-values and null-hypothesis testing. As your sample size grows, you're able to detect smaller and smaller differences. Ultimately, with real data, the null hypothesis of 'statistics A equals statistics B' is never realized so with a big enough sample, you'll always find them significantly different. Basically, you should not interpret statistical significance as indication of biological relevance. Your p-value may be 1e-12 but the change you detect can be absolutely irrelevant to the biological process you study. For more on why p-values are a problem, see for example articles by David Colquhoun (e.g. here) and this article, this statement from the American Statistical Association and comments on this statement here, here, here or [here][5] and there's even a Nature news article.

ADD COMMENT • link 6.9 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks for the fast response. I have to say the data i get from the above mention approach make biological sense. Gene where i expect a change (because they are methylated for example) show a change whereas other genesets (for example 500 random genes) show nothing. The problems comes when I subgroup the regions for example only genes where methylation is gained at predicted enhancers (which narrows done my geneset from 500 to lets say 50). The boxplot looks impressive and the changes are still ridiculous significant. Since we test with ten times less genes and still reach the same ridiculous low p-value i would guess methylation changes at enhancer cause a much higher and robust change then if i take all higher methylated regions (which makes biological sense). But how can I put this in numbers ? Is there no way ?

ADD REPLY • link 6.9 years ago by florian.noack ▴ 20

score 0 · Answer 2 · 2017-06-21

0

Entering edit mode

6.9 years ago

florian.noack ▴ 20

Thanks for the fast response. I have to say the data i get from the above mention approach make biological sense. Gene where i expect a change (because they are methylated for example) show a change whereas other genesets (for example 500 random genes) show nothing. The problems comes when I subgroup the regions for example only genes where methylation is gained at predicted enhancers (which narrows done my geneset from 500 to lets say 50). The boxplot looks impressive and the changes are still ridiculous significant. Since we test with ten times less genes and still reach the same ridiculous low p-value i would guess methylation changes at enhancer cause a much higher and robust change then if i take all higher methylated regions (which makes biological sense). But how can I put this in numbers ? Is there no way ?

ADD COMMENT • link 6.9 years ago by florian.noack ▴ 20

0

Entering edit mode

Please use the "add comment" button to reply to an answer. This keeps the discussion organized and avoids having comments appear as answer to a question.

ADD REPLY • link 6.9 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I am confused. I understood the original question to state that tests with large number of genes resulted in statistically significant differences and that tests with smaller numbers of genes were not significant and that this was your problem. Now it seems that the tests are significant in both cases so what is the problem ?

You can't compare the p-values to get an indication of the strength of the effect, if that's what you're getting at. If you want to quantify how much bigger the change in a subset is compared to the set it is derived from, just compute for example the ration, e.g. change in subset Z is 3x the change in the whole set.

ADD REPLY • link 6.9 years ago by Jean-Karim Heriche 27k