RNA-Seq analysis with DESeq2. Confusing results.
2
0
Entering edit mode
7.5 years ago
Armin • 0

Hi,

I analyze RNA-Seq data from three sample groups (non-pathology (5 samples), mild pathology (7 samples), extreme pathology (5 samples)). My goal is to find differentially expressed genes between non and mild, non and extreme, and mild and extreme. Read counts were prepared using HTSeq. Then I do exactly the same steps as described in the DESeq2 manual.

> library("DESeq2")
> directory <- "my_path"
> sampleFiles <- grep("pathol",list.files(directory),value=TRUE)
> sampleCondition <- sub("(.*pathol).*","\\1",sampleFiles)
>  sampleTable <- data.frame(sampleName = sampleFiles,fileName = sampleFiles,condition = sampleCondition)
> ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,directory = directory,design= ~ condition)
> ddsHTSeq$condition <- factor(ddsHTSeq$condition,levels=c("nonpathol","extpathol"))
> ddsHTSeq$condition <- relevel(ddsHTSeq$condition, "nonpathol")
> ddsHTSeq<- DESeq(ddsHTSeq)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting generalized linear model
> res <- results(ddsHTSeq)


I perform this procedure for all three comparisons and in each separate comparison get seemingly biologically meaningful results. However, the problem is following:

Non vs Mild gives me 15 d.e. genes (padj <0.05)

Non vs Extreme - 1,249 d.e. genes

but

Mild vs. Extreme - only 16 d.e. genes!

Basic logic tells me that IF Non group differs from Extreme group by many hundreds genes AND the same Non group almost doesn't differ from Mild group THEN Mild group also has to differ from Extreme by many genes.

Is such logic wrong? If not then I don't get where could I make any mistake in the analysis.

RNA-Seq deseq2 differential-expression • 4.8k views
3
Entering edit mode
7.5 years ago

Your "logic" is completely incorrect, in fact it reminds me of one of Zeno's paradoxes.

An R example (values generated with rnorm):

a <- c(0.9964361, 1.2685663, 0.5132198, -1.6368091, 3.0478775)
b <- c(3.7843679, 3.4882500, 0.1275235, 3.2042165, 1.9144505)
c <- c(4.205894, 3.270352, 4.922832, 4.107070, 3.656648)

If one performs the pair-wise T-tests, one finds that a vs. b and b vs. c are insignificant, while a vs. c is significant. By your reasoning, that should be an impossibility, but it's certainly not.

0
Entering edit mode

I already asked this question on SeqAnswers but can I ask you here too. In which function can I use the argument minReplicatesForReplace? Both DESeq() and results() produce the following error message:

Error in results(ddsHTSeq, minReplicatesForReplace = 5) :
unused argument (minReplicatesForReplace = 5)

0
Entering edit mode

It should work with the DESeq() function. What version are you using?

0
Entering edit mode

DESeq2_1.0.19

command:

ddsHTSeq<- DESeq(ddsHTSeq, minReplicatesForReplace = 5)


Here is my sessionInfo

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C                 LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[3] Rcpp_0.11.2               lattice_0.20-29
[5] Biobase_2.20.1            GenomicRanges_1.12.5
[7] IRanges_1.18.4            BiocGenerics_0.6.0

loaded via a namespace (and not attached):
[1] annotate_1.38.0      AnnotationDbi_1.22.6 DBI_0.2-7
[4] genefilter_1.42.0    grid_3.0.1           locfit_1.5-9.1
[7] RColorBrewer_1.0-5   RSQLite_0.11.4       splines_3.0.1
[10] stats4_3.0.1         survival_2.37-7      tools_3.0.1
[13] XML_3.98-1.1         xtable_1.7-3

0
Entering edit mode

You're using the version from Bioconductor 2.12, which is fairly outdated and that option didn't exist yet back then. You should just update the libraries on your local installation.

0
Entering edit mode

Thank you, Devon!

0
Entering edit mode

1
Entering edit mode
7.5 years ago

Genes that are not significantly (whatever threshold you choose) differentially expressed aren't necessarily equally expressed. It just means your variance is too high to determine whether these genes are differentially expressed.

Let's say your mild sample replicates vary highly compare to your non and extreme sample replicates. When you perform the differential expression test between mild and non/extreme, you would not be confident in whether the difference is due to variance or real effect.

However, you are using a generalized linear model for your DE. I am not very familiar with how that works in relation to differential expression. But I think that's supposed to take into account different variances.

0
Entering edit mode

Stated somewhat differently, non-significance is not equivalence.

1
Entering edit mode

Yeah that first sentence was pretty confusing. I changed it.

0
Entering edit mode

Thank you for explanation, Devon and Damian! Also, I found another possible reason that adds to such results. In comparisons with Mild many genes were set to NA due to outliers.