How Many Replicates And Which Statistical Test Can Be Done For Preliminary Analysis With Normalized Rna-Seq Count Data?
3
6
Entering edit mode
8.4 years ago
Ngsnewbie ▴ 370

Before running any RNA-seq data analysis packages, I want to do statistical test for my gene expression data. I have 4 replicates for two treatment (r and s). As RNA-seq data have -ve bionomial distribution. Does it make any sense if i do gene wise paired t-test, so that i can get p-values for each gene.

        r1    r2    r2    r4    s1    s2    s3    s4
gene1    569    257    301    446    259    289    835    414
gene2    24    1    28    26    15    1    51    36
.
.


The following lines (this is not a code) is what i would like to execute :

gene1 :paired t test[(569 257 301 446) Vs (259    289 835    414)]
gene2 :paired t test [(24 1 28 26) Vs (15 1 51 36)]


This is just for getting an idea about the data values per conditions not for actual analysis. Can i do paired t-test in this way? also does the above calculation make any sense on getting p-value per gene ? Seeking suggetions in this context.

rnaseq gene-expression • 11k views
5
Entering edit mode
8.4 years ago

You might take a look at this wikipedia entry: http://en.wikipedia.org/wiki/Student's_t-test#Assumptions . See if you can assess your data to see if the assumptions are met. In particular, are the data approximately normally distributed? Are the variance and mean independent? That said, you could look at the limma voom() function followed by traditional limma analysis (t-testing). You can compare that answer to what you get with edgeR or DESeq.

4
Entering edit mode
8.4 years ago

I suggest to use edgeR or DESeq. There a based on negative binomial and a very well documented (both are R package) http://www.bioconductor.org/packages/2.11/bioc/html/edgeR.html http://www.bioconductor.org/packages/2.11/bioc/html/DESeq.html

0
Entering edit mode

Thanks Nicobxl, I can use edgeR and DESeq. Actually, While browsing through microarray data analysis, To do t-test came in my mind that will help me to understand the usage of t-test and other statistical test in gene expression analysis. The above excercise is just for learning concepts. Thus i want to know does it make any sense with context to RNA-seq data ( Count data : Bam converted to counts by HTSeq package).

4
Entering edit mode
8.4 years ago

I've recently noticed the following paper Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression published in Bioinformatics (2013).

I have not yet read it and evaluating it is on my todo list. It may be useful or not.

9
Entering edit mode

Hi,

Scotty is remarkably useful! Sorry if you know some of this, but as your name is NGSNewbie, I'm going to assume you're a newbie.

Is r1 related to s1 or are they just four random replicates? If it's four cultures and four cultures all grown on a plate then you would use an unpaired t-test. If r1 and s1 were drawn from the same sample, and r2 and s2 were drawn from the same sample, etc. but given a different treatment then that would be a paired t-test.

That data doesn't look normalized yet. Are you planning to normalize it for sequencing depth?

I didn't look at paired t-test, I looked at unpaired. I think the packages people are suggesting are more similar to an unpaired test, but I didn't look to see if they have a paired mode. I think if it is the second scenario then a paired t-test would probably be great.

This is what I know about an unpaired t-test vs the other packages:

I used a t-test when I wrote Scotty for power analysis after considerable angst because it is amenable to power analysis without Monte Carlo modeling.

To verify this was a reasonable approach I did some simulations where I compared the t-test to DESeq. I like DESeq and its big brother EdgeR but if I were to call differential expression I would use a t-test.

The reason I prefer the t-test is that it calculates the variance for each gene individually. EdgeR and DESeq in their default mode share information to calculate more accurate variances. This buys more power when there are a very low number of replicates (really 2, maybe 3). But the cost of this is that it kind of smoothes all variances towards the center so you under call low variance genes and over call high variance genes. That introduces some bias into your call set. That said, I've never been able to get a false gene enrichment or something like that out of DESeq.

However, when I simulated four replicates I got about the same power between DESeq and a t-test. Both had accurate p-values overall (i.e. the p-value predicted the false positive rate) but the t-test did it no matter how you grouped the genes (i.e. you could pull out low expression or high expression, or low variance or high variance and the false positive was still the p-value). DESeq did not but it's performance is going to vary more based on how you do the simulation, so how close the simulation is to what DESeq expects. I wasn't trying to break DESeq but it might perform better with different parameters - it wasn't the main focus of the paper and someone could do more work there.

You can also calculate variance independently on a gene by gene basis with DESeq and the EdgeR. I haven't used those modes so I don't know how accurate the p-values are. I expect they're quite accurate because they seem like smart guys.

I expect the data is really Poisson lognormal, not negative binomial as it should be fundamentally a multiplicative process (lognormal) measured with a sampled count (Poisson approximation to binomial sampling). Funnily enough, I couldn't find anyone that was happy to do 100 replicates of exactly the same thing so I could test that though.

But the assumption of a continuous normal distribution with RNASeq counting data doesn't cause too much trouble in the boundaries between mean and variance we observe in real data. That is, I didn't find many really skewed distributions when I looked at an experiment with 40 biological replicates. There are probably some bimodal genes though. I wasn't able to find too much evidence that proves the data is lognormal rather than normal.

The p-values from the t-test were accurate even at low counts (down to a mean of 2) whether I modeled the data as lognormal or normal. And anyway, the data's not really discrete because the first thing you do is normalize the samples. It's more continuous with funny binning.

Most of this is in the supplement which appeared briefly on the Bioinformatics web site. If it's not there (it's not there now) and you want to look at it email me and I'll send it.

Anyway, you could try both (or use DEB and try three http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3163933/ ). If they're producing call lists that are way different it probably means you need more replicates. It looks like you have some pretty high variability among your replicates there, if that's real data.

We also have a blog post for Newbies:

Thinking about Designing RNA Seq Experiments to Measure Differential Gene Expression: The Basics

Good luck! Michele

0
Entering edit mode

Thanks Michele Busby for such a beautiful explanation. Also, the blog post is written in a lucid way that i could able to comprehend it easily. Awesome !!!

0
Entering edit mode

Thanks! That was the intention of the blog.

I remembered that I wrote this other blog about the last point, that sometimes two things that work look can give different answers with underpowered experiments. It kind of goes with the other one.

0
Entering edit mode

Here is an updated link to the blog post listed above:

Also, there is now some nice work in yeast using some highly replicated samples that one should also read:

This:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4878611/ and some other related ones from these labs.