UPDATE: I cross-posted this on Bioconductor as well ( https://support.bioconductor.org/p/p132527/ ). See Michael Love's answer below.
Hi! I have been battling with a multifaceted problem for months now and I can't figure out how to solve/clarify it, I would be extremely grateful for any comments/suggestions/opinions/criticism. In short, I can't figure out if I should use all my data or half my data in DESeq2 for calculating the size factors and estimating dispersion etc. i.e. when are the samples 'too different' and including both of them would unfavourably skew these estimations when fitting the model and result in a bunch of TPs and FPs?
The closest I have found to something similar in the vignette is http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#if-i-have-multiple-groups-should-i-run-all-together-or-split-into-pairs-of-groups. And some discussions here on biostars too, but none quite explaining to the level that would put me at ease.
Experimental setup: 4 sorted cell populations (termed as different cell states) in two replicates for 2 clones of the same PBMC cell type obtained from different individuals = 16 samples sequenced together
- Identify differentially expressed genes between pairwise comparisons of different cell states in clone 1.
- Identify differentially expressed genes between pairwise comparisons of different cell states in clone 2.
- Identify differentially expressed genes between pairwise comparisons of different cell states in both clones. (Either through simply taking the overlap of the previous or attack it through specifying the design matrix to account for “clone” difference - I would hope both would lead to similar conclusions...)
Problems: As these cells come from different individuals there is biological variation between them, which is of interest. However, mutually ‘changing’ genes are of interest as well. There are multiple genes that are only expressed in clone1 or in clone2, which DESeq2 won't include in normalization, but could be important for a clone-specific sample normalization and other estimates (I would think?). To try my best and visually aid what I am taking about:
#This just includes every bit of information I have sample_info # condition clones replicate clone_condition # clone1_stage_1_rep_1 stage1 clone1 1 clone1_stage1 # clone1_stage_2_rep_1 stage2 clone1 1 clone1_stage2 # clone1_stage_3_rep_1 stage3 clone1 1 clone1_stage3 # clone1_stage_4_rep_1 stage4 clone1 1 clone1_stage4 # clone1_stage_1_rep_2 stage1 clone1 2 clone1_stage1 # clone1_stage_2_rep_2 stage2 clone1 2 clone1_stage2 # clone1_stage_3_rep_2 stage3 clone1 2 clone1_stage3 # clone1_stage_4_rep_2 stage4 clone1 2 clone1_stage4 # clone2_stage_1_rep_1 stage1 clone2 1 clone2_stage1 # clone2_stage_2_rep_1 stage2 clone2 1 clone2_stage2 # clone2_stage_3_rep_1 stage3 clone2 1 clone2_stage3 # clone2_stage_4_rep_1 stage4 clone2 1 clone2_stage4 # clone2_stage_1_rep_2 stage1 clone2 2 clone2_stage1 # clone2_stage_2_rep_2 stage2 clone2 2 clone2_stage2 # clone2_stage_3_rep_2 stage3 clone2 2 clone2_stage3 # clone2_stage_4_rep_2 stage4 clone2 2 clone2_stage4 data_set <- DESeqDataSetFromMatrix(countData = DF, colData = sample_info, design = ~ clone_condition)
Side-note, would having an intercept 0 be more appropriate for the pairwise comparisons? design(dds) <- formula(~ 0 + clone_condition)
It is quite clear the PC1 explains the differences between the clones themselves and PC2 explains the difference between states, but the first 2 goals are looking at intra-clonal effects anyway, so I won’t be directly comparing these clones at this stage. That said, is it then appropriate to even allow DESeq2 to gather information across both clones if I am looking at intraclonal pairwise comparisons? Or would it be more appropriate to model the clones separately for goals 1-2 and together for goal 3?
It gets even more confusing for me when I try either approach and e.g. for state2 vs state1 contrast clone1 gains low p-value gene expression differences when compared to running it alone through the pipeline, while clone2 loses low p-values when run together as opposed to running it alone from the start (shown below).
Sure, some of this is the effect of independent filtering and multiple-testing correction, I won’t know if these are TP or FP, but manually checking some it seemed e.g. that genes with 0-counts in clone2, but showing expression (as well as 'expected dynamics' in expression differences) in clone1, were non-significant when ran together, but significant in clone1 when ran separately (not shown here). Indeed, I am not talking of huge LFC’s (and not trying to massage relevant p-values), but those genes were biologically interesting and potentially something to look into further down the line through separate means, which might otherwise be missed with a general p-value filter and I don't think it is feasible to go through all of these manually….but does make me question what else could I be missing by having a non-ideal approach with analysis setup.
> clone1_stage2_vs_stage1 <- results(dds, contrast=c("clone_condition", "clone1_stage2", "clone1_stage1")) > clone2_stage2_vs_stage1 <- results(dds, contrast=c("clone_condition", "clone2_stage2", "clone2_stage1")) > clone1_stage2_vs_stage1_sep <- results(clone1_dds, contrast=c("clone_condition", "clone1_stage2", "clone1_stage1")) > clone2_stage2_vs_stage1_sep <- results(clone2_dds, contrast=c("clone_condition", "clone2_stage2", "clone2_stage1")) ### These first two are for when both clones are used in data normalization by DESeq2 > summary(clone1_stage2_vs_stage1) out of 24656 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 4318, 18% LFC < 0 (down) : 3786, 15% outliers  : 0, 0% low counts  : 4781, 19% (mean count < 3)  see 'cooksCutoff' argument of ?results  see 'independentFiltering' argument of ?results > summary(clone2_stage2_vs_stage1) out of 24656 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 5243, 21% LFC < 0 (down) : 4562, 19% outliers  : 0, 0% low counts  : 4781, 19% (mean count < 3)  see 'cooksCutoff' argument of ?results  see 'independentFiltering' argument of ?results ## These second two are when clones are split from the start and DESeq2 does not 'share' information across clones > summary(clone1_stage2_vs_stage1_sep) out of 23616 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 3735, 16% LFC < 0 (down) : 3086, 13% outliers  : 0, 0% low counts  : 4568, 19% (mean count < 2)  see 'cooksCutoff' argument of ?results  see 'independentFiltering' argument of ?results > summary(clone2_stage2_vs_stage1_sep) out of 24228 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 5811, 24% LFC < 0 (down) : 4793, 20% outliers  : 0, 0% low counts  : 3755, 15% (mean count < 2)  see 'cooksCutoff' argument of ?results  see 'independentFiltering' argument of ?results
I don't know enough to think of any useful plots or metrics to help me decide, I did plot the dispersion plots, but I am not sure if I am overestimating the dispersion when running them together resulting in less TPs?
I keep arguing with myself over this, which is leading nowhere, please help.