Question

Robustness Of Deseq For Differential Expression Analysis For Rna Seq Data

2

Entering edit mode

13.2 years ago

Jimbo ▴ 120

I have an RNA-seq experiment where I have 6 control and 6 case samples. I am getting incredibly small p-value for my DEseq, diferential expression analysis (p-value < 1e-139 for one detector).

nbinomTest(cds[c("xxx1","xxx2")],"Control", "Case")

id  baseMean baseMeanA baseMeanB foldChange log2FoldChange
1 xxx1  95.15408  4.105041  186.2031  45.359624       5.503337
2 xxx2 237.96751 75.001956  400.9331   5.345635       2.418361

       pval          padj    resVarA  resVarB
1 7.385318e-139 1.477064e-138 0.07301066 138.9710
2  1.181461e-84  1.181461e-84 5.34503239 266.2948

counts(cds[c("xxx1","xxx2"),])

     Control1 Control2 Control3 Contro4 Control5 Control6 Case1
xxx1        2       11        1       4        7        1    7
xxx2      291       51       27      21      109       28   22

  Case2 Case3 Case4 Case5 Case6
xxx1   20    1   23    5  990
xxx2   19   15 2298   36   28

This seems surprising, that the p-values should be so small, given the values, barring one outlier, are very similar between control. I suspect the case with a very high value is more likely an outlier, perhaps due to alignment problems or other reasons.

Is there a more robust method to calculate de for RNA seq data?

Many thanks.

rna deseq differential gene • 4.7k views

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 13.2 years ago by Jimbo ▴ 120

0

Entering edit mode

Apologies for the bad spacing of the results tables

ADD REPLY • link 13.2 years ago by Jimbo ▴ 120

0

Entering edit mode

Just one thing: I'd rather use "padj" instead of "pval"

ADD REPLY • link 13.2 years ago by Pasta ★ 1.3k

0

Entering edit mode

Absolutely, but even with using padj and running rbinomTest on all genes, the padj goes to e-134

ADD REPLY • link 13.2 years ago by Jimbo ▴ 120

0

Entering edit mode

ask Simon Anders, the developper of the package ;)

ADD REPLY • link 12.9 years ago by Nicolas Rosewick 11k

score 1 · Answer 1 · 2011-09-16

1

Entering edit mode

13.2 years ago

Vitis ★ 2.6k

I think you may check replicates first. The two examples you showed here both have an outlier in the case samples (surprisingly high), I think this is giving you problems. Also, did you use any normalization before the DE tests?

ADD COMMENT • link 13.2 years ago by Vitis ★ 2.6k

0

Entering edit mode

I made a table of counts and then ran

estimateSizeFactors() estimateVarianceFunctions()

Is there a way of flagging up obvious outliers such as this? Or a way of shrinking variance simlar to e.g. limma

ADD REPLY • link 13.2 years ago by Jimbo ▴ 120

0

Entering edit mode

I'd do a MA plot or simple dot plot to see the correlations among replicates. I don't know any good way to flag it automatically, I used some simple criterion like correlation coefficient (r-squre) or just spot check them on the plots.

ADD REPLY • link 13.2 years ago by Vitis ★ 2.6k

0

Entering edit mode

Also, I'd do some filtering with a expression threshold before the DE tests. For example, if the read counts are 1~2 in all replicates and conditions, I wouldn't include that gene in my test, I'd rather consider them as noise or not expressed.

ADD REPLY • link 13.2 years ago by Vitis ★ 2.6k

score 1 · Answer 2 · 2011-12-22

1

Entering edit mode

12.9 years ago

Max C ▴ 10

Try this instead:

cds<-estimateDispersions(cds, method="pooled")

if you haven't already. Downgrades such cases where one outlier boosts p-values

Works for me

ADD COMMENT • link 12.9 years ago by Max C ▴ 10

0

Entering edit mode

+1: I've found DESeq to be a pretty robust tool for analysis, and I would always use this step

You could also try the GLM function (the 4th section in the manual, linked below), but I would mostly use that for multivariate analysis (although it works with one variable).

http://bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf

ADD REPLY • link 10.3 years ago by Charles Warden 8.3k

Ram · Answer 3 · 2014-07-28

If I'm understanding the data correctly, gene xxx1 should not be differentially expressed.

What is happening, I think, is that DESeq is looking at the mean which is artificially increased by the outlier. This is OK but it is also underestimating the variance because it is using an information sharing approach to calculate the variance which makes all the genes have a more "similar" variance. This boosts power when you have 2 or 3 reps, but decreases power when you have more replicates.

You can normalize your data for the number of reads and just use a t-test if you have six reps. The measured variance is the best estimate of variance so it will give you more statistical power and it also does not introduce systematic bias.

A t-test performed better in simulation with 6 reps than DESeq when we looked at it:

http://bioinformatics.oxfordjournals.org/content/29/5/656

in the supplement (as you would expect).

If these numbers are normalized, the p value from a t-test of 0.32 is more correct.