Robustness Of Deseq For Differential Expression Analysis For Rna Seq Data
3
2
Entering edit mode
10.7 years ago
Jimbo ▴ 120

I have an RNA-seq experiment where I have 6 control and 6 case samples. I am getting incredibly small p-value for my DEseq, diferential expression analysis (p-value < 1e-139 for one detector).

nbinomTest(cds[c("xxx1","xxx2")],"Control", "Case")

id  baseMean baseMeanA baseMeanB foldChange log2FoldChange
1 xxx1  95.15408  4.105041  186.2031  45.359624       5.503337
2 xxx2 237.96751 75.001956  400.9331   5.345635       2.418361

1 7.385318e-139 1.477064e-138 0.07301066 138.9710
2  1.181461e-84  1.181461e-84 5.34503239 266.2948


counts(cds[c("xxx1","xxx2"),])

     Control1 Control2 Control3 Contro4 Control5 Control6 Case1
xxx1        2       11        1       4        7        1    7
xxx2      291       51       27      21      109       28   22

Case2 Case3 Case4 Case5 Case6
xxx1   20    1   23    5  990
xxx2   19   15 2298   36   28


This seems surprising, that the p-values should be so small, given the values, barring one outlier, are very similar between control. I suspect the case with a very high value is more likely an outlier, perhaps due to alignment problems or other reasons.

Is there a more robust method to calculate de for RNA seq data?

Many thanks.

rna deseq differential gene • 3.8k views
0
Entering edit mode

Apologies for the bad spacing of the results tables

0
Entering edit mode

0
Entering edit mode

Absolutely, but even with using padj and running rbinomTest on all genes, the padj goes to e-134

0
Entering edit mode

ask Simon Anders, the developper of the package ;)

1
Entering edit mode
10.7 years ago
Vitis ★ 2.5k

I think you may check replicates first. The two examples you showed here both have an outlier in the case samples (surprisingly high), I think this is giving you problems. Also, did you use any normalization before the DE tests?

0
Entering edit mode

I made a table of counts and then ran

estimateSizeFactors() estimateVarianceFunctions()

Is there a way of flagging up obvious outliers such as this? Or a way of shrinking variance simlar to e.g. limma

0
Entering edit mode

I'd do a MA plot or simple dot plot to see the correlations among replicates. I don't know any good way to flag it automatically, I used some simple criterion like correlation coefficient (r-squre) or just spot check them on the plots.

0
Entering edit mode

Also, I'd do some filtering with a expression threshold before the DE tests. For example, if the read counts are 1~2 in all replicates and conditions, I wouldn't include that gene in my test, I'd rather consider them as noise or not expressed.

1
Entering edit mode
10.4 years ago
Max C ▴ 10

cds<-estimateDispersions(cds, method="pooled")

if you haven't already. Downgrades such cases where one outlier boosts p-values

Works for me

0
Entering edit mode

+1: I've found DESeq to be a pretty robust tool for analysis, and I would always use this step

You could also try the GLM function (the 4th section in the manual, linked below), but I would mostly use that for multivariate analysis (although it works with one variable).

http://bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf

1
Entering edit mode
7.8 years ago
Michele Busby ★ 2.2k

If I'm understanding the data correctly, gene xxx1 should not be differentially expressed.

What is happening, I think, is that DESeq is looking at the mean which is artificially increased by the outlier. This is OK but it is also underestimating the variance because it is using an information sharing approach to calculate the variance which makes all the genes have a more "similar" variance. This boosts power when you have 2 or 3 reps, but decreases power when you have more replicates.

You can normalize your data for the number of reads and just use a t-test if you have six reps. The measured variance is the best estimate of variance so it will give you more statistical power and it also does not introduce systematic bias.

A t-test performed better in simulation with 6 reps than DESeq when we looked at it:

http://bioinformatics.oxfordjournals.org/content/29/5/656

in the supplement (as you would expect).

If these numbers are normalized, the p value from a t-test of 0.32 is more correct.