Hi Nishant,
Thanks for the detailed post—this is a thoughtful approach to handling collinearity in your experimental design, and it's clear you've put a lot of effort into it. I'll walk through your method step-by-step, flag what's solid, and suggest a couple of tweaks based on common practices in DESeq2 workflows for multifactor RNA-seq with correlated covariates. I'm not a statistician either, but I've wrestled with similar setups in algal/plant RNA-seq datasets, so hopefully this helps.
Quick Recap of Your Setup (to Confirm I Get It)
- Samples: 80 total (5 reps x 16 treatment combos? Assuming the "16 treatments" breaks down across Temp (3 levels) and carbonate chemistry manipulations).
- Covariates: Temp (factor, 3 levels) + multiple collinear carbonate chem vars (e.g., pH, DIC, derived CO2).
- DESeq2 Design: ~ Temp + PC1 + PC2 + Temp:PC1 + Temp:PC2 (smart to include interactions if you suspect Temp modulates chem effects).
- PCA: On centered/scaled chem vars, retaining PC1+PC2 (~95% var explained—nice threshold).
- Contrasts: Wald tests for PC1 and PC2 (full model), plus LRT vs reduced (~ Temp + PC2 + Temp:PC2 for PC1 sig?).
- Interpretation: Back-transform to original vars via loadings: composite_LFC = L1 log2FC_PC1 + L2 log2FC_PC2, with propagated SE and BH-adjusted z-scores.
- Filtering: FDR < 0.05 from Wald/LRT before composite calc, intersecting sig genes.
Overall: Yes, this is a "correct" (and defensible) thing to do. It's a standard way to orthogonalize collinear predictors before fitting, avoiding singularity in the GLM. The linear combination for LFCs is a direct consequence of PCA (since PCs are linear combos of originals, and vice versa via loadings). Your SE propagation assumes independence between PC contrasts (true by PCA construction) and homoscedasticity (DESeq2's dispersion handles this reasonably). The z-score p-adj is a clean way to test the composite effect.
That said, here are some caveats and refinements to make it even more robust:
Strengths
- PCA for Collinearity: Spot-on. Carbonate chem params are highly collinear (e.g., pH ~ -log[H+], DIC includes CO2 terms), so raw inclusion would inflate VIFs and dilute power. PC1/PC2 capture the main axes (e.g., acidification vs. hypercapnia), and 95% var is plenty—I've used 90% as a cutoff in similar ocean acidification studies.
- Interactions: Including Temp:PC* allows for condition-specific effects (e.g., high Temp amplifying CO2 response), which is biologically motivated for marine algae.
- LRT + Wald Intersection: Good for filtering to genes with any PC effect and specific PC1/PC2 contributions. LRT tests the full model utility; Wald pins down which term drives it.
Potential Issues & Fixes
Prefiltering Before Composite Calc:
- Issue: Filtering on FDR < 0.05 before computing composites could bias your final gene list toward effects detectable in the PC space but weak in the original var (e.g., if a var loads heavily on a non-sig PC). It might also inflate false positives if the BH is applied post-filter.
- Fix: Run the composite calcs on all genes first, then filter on the padj_zscore < 0.05. This keeps things unbiased. If compute time is an issue (80 samples isn't huge), it's still fast in R.
- Bonus: Use DESeq2's lfcShrink on the PC contrasts before combining—shrinks noisy LFCs toward 0, stabilizing your composites.
Interpretation of Composites:
SE Propagation:
- Issue: Your formula assumes no covariance between PC1/PC2 LFC estimates, which holds for PCs but not if your design has imbalances (e.g., uneven reps across Temp). In DESeq2, lfcSE is from the GLM covariance matrix, so for precision:
- Fix: If you want full rigor, extract the full covariance between PC1/PC2 contrasts via coef and mcols(dds)$se—but it's overkill here. Your sqrt sum-of-squares is a conservative approximation (ignores positive cov, so SE might be slightly underestimated).
Model Diagnostics:
- Run plotPCA on vst/rlog counts colored by Temp and PC1/PC2—check if batches separate cleanly.
- VIF check post-PCA: car::vif(model_matrix) should be <5.
- If reps are low (n=5), consider blind=FALSE in normalization to leverage design.
R Code Snippet for Your Workflow (Tweaked)
Here's a polished version of your composite calc, wrapped in a function for reuse. Assumes dds is your DESeqDataSet, loadings_matrix has rows as vars/cols as PCs, and you've run resultsNames(dds) to ID contrasts like "PC1", "PC2".
library(DESeq2)
# Function for composite LFC/SE per variable
get_composite_lfc <- function(dds, var_name, loadings_mat, contrast_prefix = "PC") {
L1 <- loadings_mat[loadings_mat$variable == var_name, "PC1"]
L2 <- loadings_mat[loadings_mat$variable == var_name, "PC2"]
var_sd <- attr(scale(carb_chem_data[, var_name]), "scaled:scale") # if you scaled
# Get results (use lfcShrink for stability; type="ashr" or "normal")
res_pc1 <- lfcShrink(dds, coef = paste0(contrast_prefix, "1"), type = "ashr")
res_pc2 <- lfcShrink(dds, coef = paste0(contrast_prefix, "2"), type = "ashr")
# Composites (all genes)
composite_LFC <- res_pc1$log2FoldChange * L1 + res_pc2$log2FoldChange * L2
composite_LFC_scaled <- composite_LFC * var_sd # optional unscale
# SE (your formula is fine; add cov term if paranoid)
composite_SE <- sqrt( (res_pc1$lfcSE * L1)^2 + (res_pc2$lfcSE * L2)^2 )
# Z-test pvals
z_scores <- composite_LFC / composite_SE
raw_p <- 2 * pnorm(-abs(z_scores))
adj_p <- p.adjust(raw_p, method = "BH")
# Tibble for output
tibble(
gene = rownames(res_pc1),
composite_LFC = composite_LFC,
composite_LFC_scaled = composite_LFC_scaled, # if needed
composite_SE = composite_SE,
padj_zscore = adj_p
) %>%
filter(padj_zscore < 0.05) # or intersect with your Wald/LRT list
}
# Usage for CO2
co2_results <- get_composite_lfc(dds, "CO2", loadings_matrix)
head(co2_results)
Kevin
Aren't PC1 and PC2 going to correspond to your treatments?
This is not a standard analysis, why do you think you need something so bespoke?
Hi swbarnes2 The PCA I am referring to here is between my independent variables (carbonate chemistry variables, which are usually highly colinear). PC1 and PC2 in that way are orthogonal axes as a function of pH, alkalinity, DIC, CO2, HCO3, and CO3.
You are right, this is not standard analysis. The easy way out would be to choose two of the least colinear variables from the above, bang in an interactive term and call it a day. However, geochemically, we know these variables behave differently, and also biologically (in theory, due to transporters, carbonic anhydrase, etc.) so why shouldn't I try to pick apart their individual effects?
Hope my reasoning makes sense.
Therefore, I thought that the best DESeq2 design model would be to do a PCA on my carbonate chemistry variables, (where PC1 + PC2 explained ~ 95% of variance). So, I used ~ Temp + PC1+ PC2 + Temp:PC1 + Temp:PC2 as my model design, who gave you this concept?
and what kind of feature values you are using? for deseq2? raw counts or normalized?
Hey, thanks for the question; Before running DESeq, I filter my raw counts using EdgeR's filterByExpr function (group = treatment). I got the concept from an ecology paper (https://www.jstor.org/stable/48678126) which used PCR for understanding the effect of environmental variables (rainfall, temperature, etc.) on tree phenotypic data (growth rate). I think GWAS studies also use PCs commonly. I extended this to RNAseq but I haven't found papers that use PCs specifically. for RNAseq. Maybe I haven't looked hard enough.
Hope that helps.