Question: Why can't use t-test for differential expression (negative binomial distribution assumed)
1
gravatar for CY
4 weeks ago by
CY330
United States
CY330 wrote:

According to central limit theorem, t-test can be used for non-normally distributed sample. Beside, RNA-Seq fits better to negative binomial distribution which doesn't significantly differ from normal distribution. So why can't we just use t-test for DE estimation?

rna-seq • 254 views
ADD COMMENTlink modified 4 weeks ago by swbarnes25.6k • written 4 weeks ago by CY330

My understanding from a presentation I saw (I am not a statistician) is that you could use a t-test IF you have a large number of samples (think tens or more). I recall the n being something like 20.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax67k

By "if I have large number of sample", is it because that gene expression follows non-normal distribution (although follows a similar one). Only with large number of samples would the expression statistics converge to normal distribution (by central limit theorem). Am I understand it correctly?

ADD REPLYlink written 4 weeks ago by CY330
2

Counts themselves will never approach a normal distribution, since they're integer and bounded at 0. They can be transformed to be "close enough", though, which is part of what voom() does.

ADD REPLYlink written 4 weeks ago by Devon Ryan90k

Even though counts don't approach normal distribution, central limit theorem still allow me to use t-test on normalized counts if we have sufficient sample size (although most like we don't), right?

ADD REPLYlink written 29 days ago by CY330

Do you have any reference about this n > 20?

ADD REPLYlink written 29 days ago by CY330

Out of interest, is this pure academic interest or do you have data that do not behave as expected with standard tools and you try to tweak parameters now?

ADD REPLYlink written 29 days ago by ATpoint16k

It is pure academic interest. Want to get a rough picture of how DE is done.

ADD REPLYlink written 28 days ago by CY330

Have a look at the studies by Gierlinski et al., particularly this, this, and this

ADD REPLYlink written 29 days ago by Friederike4.2k
4
gravatar for Friederike
4 weeks ago by
Friederike4.2k
United States
Friederike4.2k wrote:

I have a feeling you're actually after the answer to a slightly different question, which could be along the lines of: "Why do we need a statistical package at all to process RNA-seq count data?"

As genomax and Devon have correctly pointed out: a t-test can be used in the realm of DE analysis, but you should absolutely, never ever apply it on the raw counts, no matter how many samples you have, because raw counts are never absolute measures of expression for a specific gene within a given sample. The actual number of reads per gene depends on the efficiency of the library prep including RNA extraction and cDNA synthesis and the amount of contamination from non-coding transcripts (e.g. rRNA, tRNA) and, of course, the actual sequencing depth, i.e. the number of reads per sample, also strongly influences the final value. All of these issues need to be taken into account before any statistical test, and this is where the packages have contributed a lot, too -- in addition to establishing ways of estimating variances from as little as 2-3 replicates per condition.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Friederike4.2k

If I first normlize the raw counts the way most DE tools normalize it taking account of library size, heteroscedasticity, etc. Can I use t-test on these normalized counts if sample size is sufficient?

ADD REPLYlink written 29 days ago by CY330

Yes, but why bother? You'll have lower power and more questions from reviewers if you go that route.

ADD REPLYlink written 29 days ago by Devon Ryan90k

I just want to get a rough picture of how DE is tested in general.

ADD REPLYlink written 29 days ago by CY330

Check out the slides I linked in Devon's reply, they should give you a good overview.

And yes, very roughly, your approach would be similar, but the crucial difference is, as others have pointed out, that the stats packages are fairly sophisticated in estimating the variance. Even limma doesn't use a simple t-test, they calculate a "moderated t-statistic", which tries to account for the typical lack of replicates.

ADD REPLYlink written 29 days ago by Friederike4.2k

I have read it. It is very straightforward and heapful.

ADD REPLYlink written 28 days ago by CY330
3
gravatar for Devon Ryan
4 weeks ago by
Devon Ryan90k
Freiburg, Germany
Devon Ryan90k wrote:

You can use a T-test, that's what limma is doing (though after passing counts through voom and then using some empirical bayes methods). The reason no one uses a simple T-test for RNA-seq is the same reason no one did it for array data, namely that you rarely have sufficient sample numbers to accurately estimate variance without pooling information across genes (see the original limma paper for a nice discussion of this).

ADD COMMENTlink written 4 weeks ago by Devon Ryan90k
1

Just came across this, which is a succinct, approachable summary of the limma paper

ADD REPLYlink written 4 weeks ago by Friederike4.2k
2
gravatar for swbarnes2
4 weeks ago by
swbarnes25.6k
United States
swbarnes25.6k wrote:

As a trivial example, a t-test comparing (1,2,3) to (4,5,6) returns the same answer as one comparing (10,20,30) to (40,50,60) But if those are raw counts, you can't say that the likelihood of the two groups being different is the same between the two sets, because the lower count one is so much more prone to be way off due to sampling errors.

ADD COMMENTlink written 4 weeks ago by swbarnes25.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 791 users visited in the last hour