sample size, normality assumption, t.test/anova for DEG analysis
1
3
Entering edit mode
6.8 years ago
TriS ★ 4.5k

I wanted to share two posts and ask your take about it...and also write some thoughts out for myself to put ideas together :)

I came across those two posts from cross validated and I thought they can be relevant for this forum too :

here's my thoughts: there are lots of R packages  that can be used to find differentially expressed genes, one of my favorite is limma(), which runs a t.test through moderated t.statistics through "borrowing" information from all the genes in order to estimate the data about a single gene (correct me if I'm wrong). however, in my microarray analysis classes I was taught to use t.test for DEGs when we had "enough" samples.

the word "enough" always confused me, 3-5-10-100-10^89?! I didn't know that t.t.est was initially developed to analyze 4 samples, and although more elegant ways of evaluating real differences in the mean distribution of two samples have been developed, t.test is still widely used. so...what's your thought about the use of t.test for DEGs with N > 20-30? would you completely discard it? would you still use it for big sample sizes?

now let's say we have more than two conditions. here another example is anova. anova analyses, by definition, the variance across samples. a good and detailed description is in this file from the Jackson Laboratories. now, if we have just a "few" (same as "enough", very irritating word) samples then estimation of the variance can be tough and here's where tools like limma() come in handy: it uses values from other genes and it "shrinks" the variance. but then, let's say we have N > 30, would you still consider using ANOVA?

lastly, normality. I often underestimated the importance of normally distributed data, then started reading about central limit theorem, parametric and non-parametric testing. quick recap: parametric testing = t.test/anova non-parametric testing = wilcoxon rank test/mann whitney test , and I found it nicely mapped in a table here. so, my question is: when you analyze your data, how much weight do you put on their distribution? i.e. do you run a shapiro test to check the distribution or just go with "how their distribution looks like"?

lots of this goes back to power, but let's assume that we are given a set of data to analyze and we are not designing the experiment from the get-go...or if we do we have limited \$...oh wait...I forgot that in research money are not a problem (..add sarcastic grin... lol) :)

as a small test, I ran limma and t.test on a set of 28 normal and 32 tumor samples from some CEL files we had in the lab. the list of DEGs with the same thresholds (p.value <0.05 and log2FC > 1.5) is exaclty the same but pvalues, as expected, are lower in limma.

alright, I think that's it for now...thanks for reading it, i thought about those things for awhile so I'm curious to see what others think.

I haven't mentioned analysis like SAM or resampling because otherwise it'd become too long of a post feel free to share your ideas about them too

thanks :)

statistics data analysis t.test anova parametric • 3.7k views
3
Entering edit mode
6.8 years ago

You're never going to lose power versus a straight T-test by using limma. Since limma is able to handle a few hundred samples without much issue I don't see any reason to not just use it by default. BTW, the biggest gain in power with larger sample sizes comes from using a non-parametric test. I've seen recommendations of 50-60 samples per group for that, though I don't know how reliable those numbers are.

Regarding the ANOVA with enough samples, that's really just the same thing. After all, a T-test is an ANOVA, which is in turn just a linear model, which is what limma is using. You're not going to gain anything other than a little bit of time by using a straight ANOVA.

Regarding the distribution, linear models are fairly robust to violations of normality. Yeah, if you push things too much then you're going to get really crappy results.