Question: DESeq2: When to split multiple group samples for 'more' accurate sizefactors, dispersions etc.
gravatar for kelen
4 months ago by
London, UK
kelen80 wrote:

UPDATE: I cross-posted this on Bioconductor as well ( ). See Michael Love's answer below.

Hi! I have been battling with a multifaceted problem for months now and I can't figure out how to solve/clarify it, I would be extremely grateful for any comments/suggestions/opinions/criticism. In short, I can't figure out if I should use all my data or half my data in DESeq2 for calculating the size factors and estimating dispersion etc. i.e. when are the samples 'too different' and including both of them would unfavourably skew these estimations when fitting the model and result in a bunch of TPs and FPs?

The closest I have found to something similar in the vignette is And some discussions here on biostars too, but none quite explaining to the level that would put me at ease.

Experimental setup: 4 sorted cell populations (termed as different cell states) in two replicates for 2 clones of the same PBMC cell type obtained from different individuals = 16 samples sequenced together


  1. Identify differentially expressed genes between pairwise comparisons of different cell states in clone 1.
  2. Identify differentially expressed genes between pairwise comparisons of different cell states in clone 2.
  3. Identify differentially expressed genes between pairwise comparisons of different cell states in both clones. (Either through simply taking the overlap of the previous or attack it through specifying the design matrix to account for “clone” difference - I would hope both would lead to similar conclusions...)

Problems: As these cells come from different individuals there is biological variation between them, which is of interest. However, mutually ‘changing’ genes are of interest as well. There are multiple genes that are only expressed in clone1 or in clone2, which DESeq2 won't include in normalization, but could be important for a clone-specific sample normalization and other estimates (I would think?). To try my best and visually aid what I am taking about:

#This just includes every bit of information I have
#                      condition clones replicate clone_condition
# clone1_stage_1_rep_1    stage1 clone1         1   clone1_stage1
# clone1_stage_2_rep_1    stage2 clone1         1   clone1_stage2
# clone1_stage_3_rep_1    stage3 clone1         1   clone1_stage3
# clone1_stage_4_rep_1    stage4 clone1         1   clone1_stage4
# clone1_stage_1_rep_2    stage1 clone1         2   clone1_stage1
# clone1_stage_2_rep_2    stage2 clone1         2   clone1_stage2
# clone1_stage_3_rep_2    stage3 clone1         2   clone1_stage3
# clone1_stage_4_rep_2    stage4 clone1         2   clone1_stage4
# clone2_stage_1_rep_1    stage1 clone2         1   clone2_stage1
# clone2_stage_2_rep_1    stage2 clone2         1   clone2_stage2
# clone2_stage_3_rep_1    stage3 clone2         1   clone2_stage3
# clone2_stage_4_rep_1    stage4 clone2         1   clone2_stage4
# clone2_stage_1_rep_2    stage1 clone2         2   clone2_stage1
# clone2_stage_2_rep_2    stage2 clone2         2   clone2_stage2
# clone2_stage_3_rep_2    stage3 clone2         2   clone2_stage3
# clone2_stage_4_rep_2    stage4 clone2         2   clone2_stage4

data_set <- DESeqDataSetFromMatrix(countData = DF,
                                 colData = sample_info,
                                 design = ~ clone_condition)

Side-note, would having an intercept 0 be more appropriate for the pairwise comparisons? design(dds) <- formula(~ 0 + clone_condition)

PCA plot: pca-both

It is quite clear the PC1 explains the differences between the clones themselves and PC2 explains the difference between states, but the first 2 goals are looking at intra-clonal effects anyway, so I won’t be directly comparing these clones at this stage. That said, is it then appropriate to even allow DESeq2 to gather information across both clones if I am looking at intraclonal pairwise comparisons? Or would it be more appropriate to model the clones separately for goals 1-2 and together for goal 3?

It gets even more confusing for me when I try either approach and e.g. for state2 vs state1 contrast clone1 gains low p-value gene expression differences when compared to running it alone through the pipeline, while clone2 loses low p-values when run together as opposed to running it alone from the start (shown below).

Sure, some of this is the effect of independent filtering and multiple-testing correction, I won’t know if these are TP or FP, but manually checking some it seemed e.g. that genes with 0-counts in clone2, but showing expression (as well as 'expected dynamics' in expression differences) in clone1, were non-significant when ran together, but significant in clone1 when ran separately (not shown here). Indeed, I am not talking of huge LFC’s (and not trying to massage relevant p-values), but those genes were biologically interesting and potentially something to look into further down the line through separate means, which might otherwise be missed with a general p-value filter and I don't think it is feasible to go through all of these manually….but does make me question what else could I be missing by having a non-ideal approach with analysis setup.

> clone1_stage2_vs_stage1 <- results(dds, contrast=c("clone_condition", "clone1_stage2", "clone1_stage1"))
> clone2_stage2_vs_stage1 <- results(dds, contrast=c("clone_condition", "clone2_stage2", "clone2_stage1"))
> clone1_stage2_vs_stage1_sep <- results(clone1_dds, contrast=c("clone_condition", "clone1_stage2", "clone1_stage1"))
> clone2_stage2_vs_stage1_sep <- results(clone2_dds, contrast=c("clone_condition", "clone2_stage2", "clone2_stage1"))

### These first two are for when both clones are used in data normalization by DESeq2

> summary(clone1_stage2_vs_stage1)

out of 24656 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 4318, 18%
LFC < 0 (down)     : 3786, 15%
outliers [1]       : 0, 0%
low counts [2]     : 4781, 19%
(mean count < 3)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> summary(clone2_stage2_vs_stage1)

out of 24656 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 5243, 21%
LFC < 0 (down)     : 4562, 19%
outliers [1]       : 0, 0%
low counts [2]     : 4781, 19%
(mean count < 3)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

## These second two are when clones are split from the start and DESeq2 does not 'share' information across clones

> summary(clone1_stage2_vs_stage1_sep)

out of 23616 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 3735, 16%
LFC < 0 (down)     : 3086, 13%
outliers [1]       : 0, 0%
low counts [2]     : 4568, 19%
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> summary(clone2_stage2_vs_stage1_sep)

out of 24228 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 5811, 24%
LFC < 0 (down)     : 4793, 20%
outliers [1]       : 0, 0%
low counts [2]     : 3755, 15%
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

I don't know enough to think of any useful plots or metrics to help me decide, I did plot the dispersion plots, but I am not sure if I am overestimating the dispersion when running them together resulting in less TPs?




I keep arguing with myself over this, which is leading nowhere, please help.

ADD COMMENTlink modified 23 days ago • written 4 months ago by kelen80

I imagine most of your problems would be resolved if you modified the regression formula to include the clone population ~ condition + clone. With this formula, you can use contrasts to ask questions such as, what are the differences in conditions while controlling for differences in clone population. Also, what are the differences in clone populations while controlling for experimental condition. If you think that clonal population can drastically change the effect in the conditions, you can add an interaction term in your formula, such as ~ condition:clone, which asks whether the effect of gene expression changes based on clonal population.

ADD REPLYlink written 4 months ago by rpolicastro2.3k

Thanks for the suggestion! I agree, there are potentially interesting results I can reach through messing around with the design matrix (not too fluent in GLMs, so I have been trying to keep it intuitive with simpler designs), especially the first one ~ condition + clone which could help with Goal nr 3. However, if I am getting the dispersion estimates and size factors (more) wrong due to the clones being "too different", then I can't see how a design matrix will save me if I want to look at within-clone differences?

My main concern is whether I am making DESeq2 more inaccurate when allowing it to gather information across these two clones since they have fairly different expression profiles to start with (PC1) and (maybe) it would actually perform much better if I would build both clones separately (it can still share information of gene expression across states and replicates, but only within a single clone) and not introduce some unknown inaccuracies due to the clones not being very comparable - but I don't know how would one even start exploring this...

I do like the second suggestion as well, to look at what makes the clones different unrelated to their which case I would not have any other choice but to model them together from the start. But this would be secondary to the current goals, so I would not focus on that at the moment.

ADD REPLYlink modified 4 months ago • written 4 months ago by kelen80

Just wanted to bump this as I am having the same problem. In my experiment, samples were taken from patients and cultured, then exposed to an immune stimulating compound. I've been asked to make sure the cell lines match the original patient sample and then to look at the effect of the added compound. As with OP, I am torn on whether I should be normalizing with the patient samples included and go from there, or whether I should be making two datasets, one with patients + cell lines and one with cell lines only.

So far I have gone with the second approach, but as with OP I obviously get slightly different answers between the two analyses in terms of sample correlation and PCA. To further complicate it, the patient samples were sequenced in another lab originally, so I wanted to do a batch correction as well. I've used limma for that but the resulting counts are very un-intuitive as some values become negative.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by DeadlyFall0

Unfortunately there has not been much discussion here on the matter. I can tell you that I decided to treat the clones as separate experiments and do comparisons later down the line. Mostly because I have a feeling that the clones are different enough (PCA plot, trying every imaginable approach + design and eyeballing the results - didn't lead to any heureka moments, but frustrated me enough to make a choice) that joining them will just do something that is potentially incorrect (a relatively unfounded hunch) . As we are interested in most aspects of the clones (intra and inter) and I only have two clones, it seems feasible. This would possibly be easier if the study design was set before the experiments, but often you just get the data and figure it out from there - bioinformagic.

That said, I think your approach will depend on what the main questions are, if you are looking at the effects of the compound irrelevant of the patient (accounting for patient effect in the design) or wish to compare how each patient responds separately. I am a little confused by what you define as patient samples and cell line samples (you say the cells are cultured cells from patients, do you mean treatment and control cells?). A study design would be helpful and a little more explanation on how many replicates and what the comparisons will be - I strongly encourage you to post this as a separate question and you might get more responses.

I wish I could help you more, but its a strange situation where I am clearly lacking some basic knowledge of something that I don't even know enough to google about and educate myself. The curse of having a background both in wet/dry-lab - Jack of all trades, master of none.

ADD REPLYlink written 7 weeks ago by kelen80

Have you considered posting this on bioconductor? More statisticians (including Mike Love, author of DESeq2) participate there regularly. If you do post it there then refer to that cross-post link here.

ADD REPLYlink written 6 weeks ago by genomax92k

Thanks, that might be the way to go. I didn't want to be a serial-spammer with this fairly born-out-of-overthinking type of question.

ADD REPLYlink written 4 weeks ago by kelen80

You have given biostar users a fair chance of responding so it would not be a problem to cross post. Like I said, if you do please edit the original post and provide a link for the cross-post. Ideally if you get an answer, come back and post it here to give closure to this thread.

ADD REPLYlink written 4 weeks ago by genomax92k
gravatar for kelen
23 days ago by
London, UK
kelen80 wrote:

Michael Love on Bioconductor ( ):

"The final decision is up to you, and I don't think there is necessarily a "right" answer, as it depends on assumptions.

But from your above, I think you can justify analyzing the clones separately for goals 1 and 2. The asymptotic dispersion is very different for the two clones (clone 2 having lower over-dispersion), so then putting them together will find a middle value for the dispersion for each clone, which will increase power for one experiment and decrease it for the other (in rough terms)."

ADD COMMENTlink modified 16 days ago • written 23 days ago by kelen80
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1072 users visited in the last hour