Question: DESeq2: When to split multiple group samples for 'more' accurate sizefactors, dispersions etc.
gravatar for kelen
8 weeks ago by
London, UK
kelen30 wrote:

Hi! I have been battling with a multifaceted problem for months now and I can't figure out how to solve/clarify it, I would be extremely grateful for any comments/suggestions/opinions/criticism. In short, I can't figure out if I should use all my data or half my data in DESeq2 for calculating the size factors and estimating dispersion etc. i.e. when are the samples 'too different' and including both of them would unfavourably skew these estimations when fitting the model and result in a bunch of TPs and FPs?

The closest I have found to something similar in the vignette is And some discussions here on biostars too, but none quite explaining to the level that would put me at ease.

Experimental setup: 4 sorted cell populations (termed as different cell states) in two replicates for 2 clones of the same PBMC cell type obtained from different individuals = 16 samples sequenced together


  1. Identify differentially expressed genes between pairwise comparisons of different cell states in clone 1.
  2. Identify differentially expressed genes between pairwise comparisons of different cell states in clone 2.
  3. Identify differentially expressed genes between pairwise comparisons of different cell states in both clones. (Either through simply taking the overlap of the previous or attack it through specifying the design matrix to account for “clone” difference - I would hope both would lead to similar conclusions...)

Problems: As these cells come from different individuals there is biological variation between them, which is of interest. However, mutually ‘changing’ genes are of interest as well. There are multiple genes that are only expressed in clone1 or in clone2, which DESeq2 won't include in normalization, but could be important for a clone-specific sample normalization and other estimates (I would think?). To try my best and visually aid what I am taking about:

#This just includes every bit of information I have
#                      condition clones replicate clone_condition
# clone1_stage_1_rep_1    stage1 clone1         1   clone1_stage1
# clone1_stage_2_rep_1    stage2 clone1         1   clone1_stage2
# clone1_stage_3_rep_1    stage3 clone1         1   clone1_stage3
# clone1_stage_4_rep_1    stage4 clone1         1   clone1_stage4
# clone1_stage_1_rep_2    stage1 clone1         2   clone1_stage1
# clone1_stage_2_rep_2    stage2 clone1         2   clone1_stage2
# clone1_stage_3_rep_2    stage3 clone1         2   clone1_stage3
# clone1_stage_4_rep_2    stage4 clone1         2   clone1_stage4
# clone2_stage_1_rep_1    stage1 clone2         1   clone2_stage1
# clone2_stage_2_rep_1    stage2 clone2         1   clone2_stage2
# clone2_stage_3_rep_1    stage3 clone2         1   clone2_stage3
# clone2_stage_4_rep_1    stage4 clone2         1   clone2_stage4
# clone2_stage_1_rep_2    stage1 clone2         2   clone2_stage1
# clone2_stage_2_rep_2    stage2 clone2         2   clone2_stage2
# clone2_stage_3_rep_2    stage3 clone2         2   clone2_stage3
# clone2_stage_4_rep_2    stage4 clone2         2   clone2_stage4

data_set <- DESeqDataSetFromMatrix(countData = DF,
                                 colData = sample_info,
                                 design = ~ clone_condition)

Side-note, would having an intercept 0 be more appropriate for the pairwise comparisons? design(dds) <- formula(~ 0 + clone_condition)

PCA plot: pca-both

It is quite clear the PC1 explains the differences between the clones themselves and PC2 explains the difference between states, but the first 2 goals are looking at intra-clonal effects anyway, so I won’t be directly comparing these clones at this stage. That said, is it then appropriate to even allow DESeq2 to gather information across both clones if I am looking at intraclonal pairwise comparisons? Or would it be more appropriate to model the clones separately for goals 1-2 and together for goal 3?

It gets even more confusing for me when I try either approach and e.g. for state2 vs state1 contrast clone1 gains low p-value gene expression differences when compared to running it alone through the pipeline, while clone2 loses low p-values when run together as opposed to running it alone from the start (shown below).

Sure, some of this is the effect of independent filtering and multiple-testing correction, I won’t know if these are TP or FP, but manually checking some it seemed e.g. that genes with 0-counts in clone2, but showing expression (as well as 'expected dynamics' in expression differences) in clone1, were non-significant when ran together, but significant in clone1 when ran separately (not shown here). Indeed, I am not talking of huge LFC’s (and not trying to massage relevant p-values), but those genes were biologically interesting and potentially something to look into further down the line through separate means, which might otherwise be missed with a general p-value filter and I don't think it is feasible to go through all of these manually….but does make me question what else could I be missing by having a non-ideal approach with analysis setup.

> clone1_stage2_vs_stage1 <- results(dds, contrast=c("clone_condition", "clone1_stage2", "clone1_stage1"))
> clone2_stage2_vs_stage1 <- results(dds, contrast=c("clone_condition", "clone2_stage2", "clone2_stage1"))
> clone1_stage2_vs_stage1_sep <- results(clone1_dds, contrast=c("clone_condition", "clone1_stage2", "clone1_stage1"))
> clone2_stage2_vs_stage1_sep <- results(clone2_dds, contrast=c("clone_condition", "clone2_stage2", "clone2_stage1"))

### These first two are for when both clones are used in data normalization by DESeq2

> summary(clone1_stage2_vs_stage1)

out of 24656 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 4318, 18%
LFC < 0 (down)     : 3786, 15%
outliers [1]       : 0, 0%
low counts [2]     : 4781, 19%
(mean count < 3)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> summary(clone2_stage2_vs_stage1)

out of 24656 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 5243, 21%
LFC < 0 (down)     : 4562, 19%
outliers [1]       : 0, 0%
low counts [2]     : 4781, 19%
(mean count < 3)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

## These second two are when clones are split from the start and DESeq2 does not 'share' information across clones

> summary(clone1_stage2_vs_stage1_sep)

out of 23616 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 3735, 16%
LFC < 0 (down)     : 3086, 13%
outliers [1]       : 0, 0%
low counts [2]     : 4568, 19%
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> summary(clone2_stage2_vs_stage1_sep)

out of 24228 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 5811, 24%
LFC < 0 (down)     : 4793, 20%
outliers [1]       : 0, 0%
low counts [2]     : 3755, 15%
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

I don't know enough to think of any useful plots or metrics to help me decide, I did plot the dispersion plots, but I am not sure if I am overestimating the dispersion when running them together resulting in less TPs?




I keep arguing with myself over this, which is leading nowhere, please help.

ADD COMMENTlink modified 7 weeks ago • written 8 weeks ago by kelen30

I imagine most of your problems would be resolved if you modified the regression formula to include the clone population ~ condition + clone. With this formula, you can use contrasts to ask questions such as, what are the differences in conditions while controlling for differences in clone population. Also, what are the differences in clone populations while controlling for experimental condition. If you think that clonal population can drastically change the effect in the conditions, you can add an interaction term in your formula, such as ~ condition:clone, which asks whether the effect of gene expression changes based on clonal population.

ADD REPLYlink written 8 weeks ago by rpolicastro1.7k

Thanks for the suggestion! I agree, there are potentially interesting results I can reach through messing around with the design matrix (not too fluent in GLMs, so I have been trying to keep it intuitive with simpler designs), especially the first one ~ condition + clone which could help with Goal nr 3. However, if I am getting the dispersion estimates and size factors (more) wrong due to the clones being "too different", then I can't see how a design matrix will save me if I want to look at within-clone differences?

My main concern is whether I am making DESeq2 more inaccurate when allowing it to gather information across these two clones since they have fairly different expression profiles to start with (PC1) and (maybe) it would actually perform much better if I would build both clones separately (it can still share information of gene expression across states and replicates, but only within a single clone) and not introduce some unknown inaccuracies due to the clones not being very comparable - but I don't know how would one even start exploring this...

I do like the second suggestion as well, to look at what makes the clones different unrelated to their which case I would not have any other choice but to model them together from the start. But this would be secondary to the current goals, so I would not focus on that at the moment.

ADD REPLYlink modified 7 weeks ago • written 8 weeks ago by kelen30

Just wanted to bump this as I am having the same problem. In my experiment, samples were taken from patients and cultured, then exposed to an immune stimulating compound. I've been asked to make sure the cell lines match the original patient sample and then to look at the effect of the added compound. As with OP, I am torn on whether I should be normalizing with the patient samples included and go from there, or whether I should be making two datasets, one with patients + cell lines and one with cell lines only.

So far I have gone with the second approach, but as with OP I obviously get slightly different answers between the two analyses in terms of sample correlation and PCA. To further complicate it, the patient samples were sequenced in another lab originally, so I wanted to do a batch correction as well. I've used limma for that but the resulting counts are very un-intuitive as some values become negative.

ADD REPLYlink modified 4 days ago • written 4 days ago by DeadlyFall0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1926 users visited in the last hour