Question

DESeq2 for pairwise comparison of multiple groups

1

Entering edit mode

3.9 years ago

thjnant ▴ 160

Hello,

I have 4 different groups (species) that I want to look into their differential gene expression. I call them A, B, C and D.

I have 5 - 8 replicates for each group and I am using DESEQ2 for the analysis.

I am facing a difficulty which I cannot interpret.

I first made a separate data frame for each pairwise comparison, that is A vs B, A vs C, etc. I then created a dds data matrix for each pairwise comparison and then called the DESeq function. Using results, I obtained the number of significantly differentially expressed genes.

I then learnt about the contrast option. So, I made the dds data matrix this time using all groups A, B, C and D and proceeded with the DESeq function. I then used results function with contrast to get the output of each pairwise comparison, A vs B, A vs C, etc.

I get different number of differentially expressed genes in the two comparisons. Why is that the case?

Thank you!

PS: I have posted this question to the bioconductor forum: https://support.bioconductor.org/p/131229/#131235

RNA-Seq Deseq2 R • 7.0k views

ADD COMMENT • link 3.9 years ago by thjnant ▴ 160

0

Entering edit mode

Cross-posted on Bioconductor: https://support.bioconductor.org/p/131229/

thjnant, when you do this, in future, can you mention it in your question?

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

So sorry for cross-posting. I mentioned it in my post in bioconductor forum. I will now add it to my question here too.

ADD REPLY • link 3.9 years ago by thjnant ▴ 160

0

Entering edit mode

Sure thing. Oh, it's no problem - just helps so that users do not duplicate efforts.

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

score 3 · Answer 1 · 2020-05-21

3

Entering edit mode

3.9 years ago

Asaf 10k

Probably it's because you have better estimates of the variation of each gene. take a look at the gene lists, they shouldn't be too different, if that's not the case then something is wrong.

ADD COMMENT • link 3.9 years ago by Asaf 10k

0

Entering edit mode

Thank you for your reply. I checked the results between the two. Of the top 50 most significant comparisons, 27 genes are common. I have more significant genes detected when I use a dataset containing only the pair of interest.

out of 13297 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 83, 0.62%
LFC < 0 (down)     : 132, 0.99%
outliers [1]       : 119, 0.89%
low counts [2]     : 1021, 7.7%
(mean count < 5)

But when I use the whole set and use contrast to get the comparison of interest, I have:

out of 13737 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 19, 0.14%
LFC < 0 (down)     : 30, 0.22%
outliers [1]       : 86, 0.63%
low counts [2]     : 1057, 7.7%

I think the second approach might be better as like you mentioned, there will be a better estimation of variation in the gene.

ADD REPLY • link 3.9 years ago by thjnant ▴ 160

3

Entering edit mode

When you process a group of samples together, DESeq2 will estimate and calculate different parameters, including gene dispersion and sample size factors - these calculations are dependent on all samples in your dataset. These parameters are then used when normalising the raw counts and, ultimately, when determining differential expression.

So, if you subset your dataset and normalise subsets independently, these key parameters will have different values. This is all that is happening. The key genes that are genuinely differently expressed should always still appear, unless you have some extreme outliers or some major batch effects.

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k