Question: Why can't use t-test for differential expression (negative binomial distribution assumed)
1
13 months ago by
CY470
United States
CY470 wrote:

According to central limit theorem, t-test can be used for non-normally distributed sample. Beside, RNA-Seq fits better to negative binomial distribution which doesn't significantly differ from normal distribution. So why can't we just use t-test for DE estimation?

rna-seq • 964 views
modified 13 months ago by swbarnes27.7k • written 13 months ago by CY470

My understanding from a presentation I saw (I am not a statistician) is that you could use a t-test IF you have a large number of samples (think tens or more). I recall the `n` being something like 20.

By "if I have large number of sample", is it because that gene expression follows non-normal distribution (although follows a similar one). Only with large number of samples would the expression statistics converge to normal distribution (by central limit theorem). Am I understand it correctly?

2

Counts themselves will never approach a normal distribution, since they're integer and bounded at 0. They can be transformed to be "close enough", though, which is part of what voom() does.

Even though counts don't approach normal distribution, central limit theorem still allow me to use t-test on normalized counts if we have sufficient sample size (although most like we don't), right?

Out of interest, is this pure academic interest or do you have data that do not behave as expected with standard tools and you try to tweak parameters now?

It is pure academic interest. Want to get a rough picture of how DE is done.

Have a look at the studies by Gierlinski et al., particularly this, this, and this

4
13 months ago by
Friederike5.6k
United States
Friederike5.6k wrote:

I have a feeling you're actually after the answer to a slightly different question, which could be along the lines of: "Why do we need a statistical package at all to process RNA-seq count data?"

As genomax and Devon have correctly pointed out: a t-test can be used in the realm of DE analysis, but you should absolutely, never ever apply it on the raw counts, no matter how many samples you have, because raw counts are never absolute measures of expression for a specific gene within a given sample. The actual number of reads per gene depends on the efficiency of the library prep including RNA extraction and cDNA synthesis and the amount of contamination from non-coding transcripts (e.g. rRNA, tRNA) and, of course, the actual sequencing depth, i.e. the number of reads per sample, also strongly influences the final value. All of these issues need to be taken into account before any statistical test, and this is where the packages have contributed a lot, too -- in addition to establishing ways of estimating variances from as little as 2-3 replicates per condition.

If I first normlize the raw counts the way most DE tools normalize it taking account of library size, heteroscedasticity, etc. Can I use t-test on these normalized counts if sample size is sufficient?

Yes, but why bother? You'll have lower power and more questions from reviewers if you go that route.

I just want to get a rough picture of how DE is tested in general.

Check out the slides I linked in Devon's reply, they should give you a good overview.

And yes, very roughly, your approach would be similar, but the crucial difference is, as others have pointed out, that the stats packages are fairly sophisticated in estimating the variance. Even limma doesn't use a simple t-test, they calculate a "moderated t-statistic", which tries to account for the typical lack of replicates.

I have read it. It is very straightforward and heapful.

3
13 months ago by
Devon Ryan95k
Freiburg, Germany
Devon Ryan95k wrote:

You can use a T-test, that's what limma is doing (though after passing counts through `voom` and then using some empirical bayes methods). The reason no one uses a simple T-test for RNA-seq is the same reason no one did it for array data, namely that you rarely have sufficient sample numbers to accurately estimate variance without pooling information across genes (see the original limma paper for a nice discussion of this).

1

Just came across this, which is a succinct, approachable summary of the limma paper

2
13 months ago by
swbarnes27.7k
United States
swbarnes27.7k wrote:

As a trivial example, a t-test comparing (1,2,3) to (4,5,6) returns the same answer as one comparing (10,20,30) to (40,50,60) But if those are raw counts, you can't say that the likelihood of the two groups being different is the same between the two sets, because the lower count one is so much more prone to be way off due to sampling errors.