I analyze RNA-Seq data from three sample groups (non-pathology (5 samples), mild pathology (7 samples), extreme pathology (5 samples)). My goal is to find differentially expressed genes between non and mild, non and extreme, and mild and extreme. Read counts were prepared using HTSeq. Then I do exactly the same steps as described in the DESeq2 manual.
> directory <- "my_path"
> sampleFiles <- grep("pathol",list.files(directory),value=TRUE)
> sampleCondition <- sub("(.*pathol).*","\\1",sampleFiles)
> sampleTable <- data.frame(sampleName = sampleFiles,fileName = sampleFiles,condition = sampleCondition)
> ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,directory = directory,design= ~ condition)
> ddsHTSeq$condition <- factor(ddsHTSeq$condition,levels=c("nonpathol","extpathol"))
> ddsHTSeq$condition <- relevel(ddsHTSeq$condition, "nonpathol")
> ddsHTSeq<- DESeq(ddsHTSeq)
estimating size factors
gene-wise dispersion estimates
final dispersion estimates
fitting generalized linear model
> res <- results(ddsHTSeq)
I perform this procedure for all three comparisons and in each separate comparison get seemingly biologically meaningful results. However, the problem is following:
Non vs Mild gives me 15 d.e. genes (padj <0.05)
Non vs Extreme - 1,249 d.e. genes
Mild vs. Extreme - only 16 d.e. genes!
Basic logic tells me that IF Non group differs from Extreme group by many hundreds genes AND the same Non group almost doesn't differ from Mild group THEN Mild group also has to differ from Extreme by many genes.
Is such logic wrong? If not then I don't get where could I make any mistake in the analysis.