Why can't use t-test for differential expression (negative binomial distribution assumed)
3
5
Entering edit mode
3.2 years ago
CY ▴ 640

According to central limit theorem, t-test can be used for non-normally distributed sample. Beside, RNA-Seq fits better to negative binomial distribution which doesn't significantly differ from normal distribution. So why can't we just use t-test for DE estimation?

RNA-Seq • 4.9k views
0
Entering edit mode

My understanding from a presentation I saw (I am not a statistician) is that you could use a t-test IF you have a large number of samples (think tens or more). I recall the n being something like 20.

0
Entering edit mode

By "if I have large number of sample", is it because that gene expression follows non-normal distribution (although follows a similar one). Only with large number of samples would the expression statistics converge to normal distribution (by central limit theorem). Am I understand it correctly?

2
Entering edit mode

Counts themselves will never approach a normal distribution, since they're integer and bounded at 0. They can be transformed to be "close enough", though, which is part of what voom() does.

0
Entering edit mode

Even though counts don't approach normal distribution, central limit theorem still allow me to use t-test on normalized counts if we have sufficient sample size (although most like we don't), right?

0
Entering edit mode

2
Entering edit mode

Have a look at the studies by Gierlinski et al., particularly this, this, and this

0
Entering edit mode

Out of interest, is this pure academic interest or do you have data that do not behave as expected with standard tools and you try to tweak parameters now?

0
Entering edit mode

It is pure academic interest. Want to get a rough picture of how DE is done.

6
Entering edit mode
3.2 years ago

You can use a T-test, that's what limma is doing (though after passing counts through voom and then using some empirical bayes methods). The reason no one uses a simple T-test for RNA-seq is the same reason no one did it for array data, namely that you rarely have sufficient sample numbers to accurately estimate variance without pooling information across genes (see the original limma paper for a nice discussion of this).

2
Entering edit mode

Just came across this, which is a succinct, approachable summary of the limma paper

4
Entering edit mode
3.2 years ago

I have a feeling you're actually after the answer to a slightly different question, which could be along the lines of: "Why do we need a statistical package at all to process RNA-seq count data?"

As genomax and Devon have correctly pointed out: a t-test can be used in the realm of DE analysis, but you should absolutely, never ever apply it on the raw counts, no matter how many samples you have, because raw counts are never absolute measures of expression for a specific gene within a given sample. The actual number of reads per gene depends on the efficiency of the library prep including RNA extraction and cDNA synthesis and the amount of contamination from non-coding transcripts (e.g. rRNA, tRNA) and, of course, the actual sequencing depth, i.e. the number of reads per sample, also strongly influences the final value. All of these issues need to be taken into account before any statistical test, and this is where the packages have contributed a lot, too -- in addition to establishing ways of estimating variances from as little as 2-3 replicates per condition.

0
Entering edit mode

If I first normlize the raw counts the way most DE tools normalize it taking account of library size, heteroscedasticity, etc. Can I use t-test on these normalized counts if sample size is sufficient?

0
Entering edit mode

Yes, but why bother? You'll have lower power and more questions from reviewers if you go that route.

0
Entering edit mode

I just want to get a rough picture of how DE is tested in general.

0
Entering edit mode

Check out the slides I linked in Devon's reply, they should give you a good overview.

And yes, very roughly, your approach would be similar, but the crucial difference is, as others have pointed out, that the stats packages are fairly sophisticated in estimating the variance. Even limma doesn't use a simple t-test, they calculate a "moderated t-statistic", which tries to account for the typical lack of replicates.

0
Entering edit mode

I have read it. It is very straightforward and heapful.

2
Entering edit mode
3.2 years ago

As a trivial example, a t-test comparing (1,2,3) to (4,5,6) returns the same answer as one comparing (10,20,30) to (40,50,60) But if those are raw counts, you can't say that the likelihood of the two groups being different is the same between the two sets, because the lower count one is so much more prone to be way off due to sampling errors.