Question: Why can't use t-test for differential expression (negative binomial distribution assumed)
1
gravatar for CY
13 months ago by
CY470
United States
CY470 wrote:

According to central limit theorem, t-test can be used for non-normally distributed sample. Beside, RNA-Seq fits better to negative binomial distribution which doesn't significantly differ from normal distribution. So why can't we just use t-test for DE estimation?

rna-seq • 964 views
ADD COMMENTlink modified 13 months ago by swbarnes27.7k • written 13 months ago by CY470

My understanding from a presentation I saw (I am not a statistician) is that you could use a t-test IF you have a large number of samples (think tens or more). I recall the n being something like 20.

ADD REPLYlink modified 13 months ago • written 13 months ago by genomax83k

By "if I have large number of sample", is it because that gene expression follows non-normal distribution (although follows a similar one). Only with large number of samples would the expression statistics converge to normal distribution (by central limit theorem). Am I understand it correctly?

ADD REPLYlink written 13 months ago by CY470
2

Counts themselves will never approach a normal distribution, since they're integer and bounded at 0. They can be transformed to be "close enough", though, which is part of what voom() does.

ADD REPLYlink written 13 months ago by Devon Ryan95k

Even though counts don't approach normal distribution, central limit theorem still allow me to use t-test on normalized counts if we have sufficient sample size (although most like we don't), right?

ADD REPLYlink written 13 months ago by CY470

Do you have any reference about this n > 20?

ADD REPLYlink written 13 months ago by CY470

Out of interest, is this pure academic interest or do you have data that do not behave as expected with standard tools and you try to tweak parameters now?

ADD REPLYlink written 13 months ago by ATpoint34k

It is pure academic interest. Want to get a rough picture of how DE is done.

ADD REPLYlink written 13 months ago by CY470

Have a look at the studies by Gierlinski et al., particularly this, this, and this

ADD REPLYlink written 13 months ago by Friederike5.6k
4
gravatar for Friederike
13 months ago by
Friederike5.6k
United States
Friederike5.6k wrote:

I have a feeling you're actually after the answer to a slightly different question, which could be along the lines of: "Why do we need a statistical package at all to process RNA-seq count data?"

As genomax and Devon have correctly pointed out: a t-test can be used in the realm of DE analysis, but you should absolutely, never ever apply it on the raw counts, no matter how many samples you have, because raw counts are never absolute measures of expression for a specific gene within a given sample. The actual number of reads per gene depends on the efficiency of the library prep including RNA extraction and cDNA synthesis and the amount of contamination from non-coding transcripts (e.g. rRNA, tRNA) and, of course, the actual sequencing depth, i.e. the number of reads per sample, also strongly influences the final value. All of these issues need to be taken into account before any statistical test, and this is where the packages have contributed a lot, too -- in addition to establishing ways of estimating variances from as little as 2-3 replicates per condition.

ADD COMMENTlink modified 13 months ago • written 13 months ago by Friederike5.6k

If I first normlize the raw counts the way most DE tools normalize it taking account of library size, heteroscedasticity, etc. Can I use t-test on these normalized counts if sample size is sufficient?

ADD REPLYlink written 13 months ago by CY470

Yes, but why bother? You'll have lower power and more questions from reviewers if you go that route.

ADD REPLYlink written 13 months ago by Devon Ryan95k

I just want to get a rough picture of how DE is tested in general.

ADD REPLYlink written 13 months ago by CY470

Check out the slides I linked in Devon's reply, they should give you a good overview.

And yes, very roughly, your approach would be similar, but the crucial difference is, as others have pointed out, that the stats packages are fairly sophisticated in estimating the variance. Even limma doesn't use a simple t-test, they calculate a "moderated t-statistic", which tries to account for the typical lack of replicates.

ADD REPLYlink written 13 months ago by Friederike5.6k

I have read it. It is very straightforward and heapful.

ADD REPLYlink written 13 months ago by CY470
3
gravatar for Devon Ryan
13 months ago by
Devon Ryan95k
Freiburg, Germany
Devon Ryan95k wrote:

You can use a T-test, that's what limma is doing (though after passing counts through voom and then using some empirical bayes methods). The reason no one uses a simple T-test for RNA-seq is the same reason no one did it for array data, namely that you rarely have sufficient sample numbers to accurately estimate variance without pooling information across genes (see the original limma paper for a nice discussion of this).

ADD COMMENTlink written 13 months ago by Devon Ryan95k
1

Just came across this, which is a succinct, approachable summary of the limma paper

ADD REPLYlink written 13 months ago by Friederike5.6k
2
gravatar for swbarnes2
13 months ago by
swbarnes27.7k
United States
swbarnes27.7k wrote:

As a trivial example, a t-test comparing (1,2,3) to (4,5,6) returns the same answer as one comparing (10,20,30) to (40,50,60) But if those are raw counts, you can't say that the likelihood of the two groups being different is the same between the two sets, because the lower count one is so much more prone to be way off due to sampling errors.

ADD COMMENTlink written 13 months ago by swbarnes27.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1482 users visited in the last hour