Question

Obtaining DESeq2 LFC values for Loadings of Principal Components

0

Entering edit mode

6 days ago

Nishant • 0

Hi all,

Apologies in advance if this has already been asked previously but I couldn't find a suitable answer.

Context: I am in the middle of analysing my RNAseq data for 80 samples (16 treatments x 5 biol. replicates each) from a marine alga. My treatments were: Temperature as a factor (3 levels), changing carbonate chemistry (manipulated via changing pH and total dissolved inorganic carbon concentration). Although many of carbonate chemistry parameters are highly colinear, I would like to obtain Log2FoldChanges in gene expression due to, for example, CO2 concentration.

Therefore, I thought that the best DESeq2 design model would be to do a PCA on my carbonate chemistry variables, (where PC1 + PC2 explained ~ 95% of variance). So, I used ~ Temp + PC1+ PC2 + Temp:PC1 + Temp:PC2 as my model design, and used the loadings to obtain LFCs for each of my carb. chem. variable.

L1 <- loadings_matrix$PC1[loadings_matrix$variable == var]
L2 <- loadings_matrix$PC2[loadings_matrix$variable == var]
composite_LFC = log2FC_PC1 * L1 + log2FC_PC2 * L2,
composite_LFC_SE = sqrt((lfcSE_PC1^2 * L1^2) + (lfcSE_PC2^2 * L2^2)),
padj_zscore = p.adjust(2 * pnorm(-abs(composite_LFC / composite_LFC_SE)), "BH")

where var is my centered and scaled carbonate chem. variable.

I am not a statistician, or a "hard core" Biologist, but by looking at the code snippet above, do you guys think I'm doing the "correct thing"? I pre-filtered my genes by FDR < 0.05 before running the above code, and I only use the genes which are present in both the wald-test FDR < 0.05 as well as LRT FDR < 0.05 (reduced model for PC1 = ~ Temp + PC2 + Temp:PC2).

Any help/advice/recommendations would be highly welcome and I am grateful for your help.

Nishant

RNAseq Model DESeq2 Design PCA • 519 views

ADD COMMENT • link updated 20 hours ago by Kevin Blighe 89k • written 6 days ago by Nishant • 0

1

Entering edit mode

Aren't PC1 and PC2 going to correspond to your treatments?

This is not a standard analysis, why do you think you need something so bespoke?

ADD REPLY • link 5 days ago by swbarnes2 15k

0

Entering edit mode

Hi swbarnes2 The PCA I am referring to here is between my independent variables (carbonate chemistry variables, which are usually highly colinear). PC1 and PC2 in that way are orthogonal axes as a function of pH, alkalinity, DIC, CO2, HCO3, and CO3.

You are right, this is not standard analysis. The easy way out would be to choose two of the least colinear variables from the above, bang in an interactive term and call it a day. However, geochemically, we know these variables behave differently, and also biologically (in theory, due to transporters, carbonic anhydrase, etc.) so why shouldn't I try to pick apart their individual effects?

Hope my reasoning makes sense.

ADD REPLY • link 4 days ago by Nishant • 0

0

Entering edit mode

Therefore, I thought that the best DESeq2 design model would be to do a PCA on my carbonate chemistry variables, (where PC1 + PC2 explained ~ 95% of variance). So, I used ~ Temp + PC1+ PC2 + Temp:PC1 + Temp:PC2 as my model design, who gave you this concept?

and what kind of feature values you are using? for deseq2? raw counts or normalized?

ADD REPLY • link 6 days ago by 1769mkc ★ 1.3k

0

Entering edit mode

Hey, thanks for the question; Before running DESeq, I filter my raw counts using EdgeR's filterByExpr function (group = treatment). I got the concept from an ecology paper (https://www.jstor.org/stable/48678126) which used PCR for understanding the effect of environmental variables (rainfall, temperature, etc.) on tree phenotypic data (growth rate). I think GWAS studies also use PCs commonly. I extended this to RNAseq but I haven't found papers that use PCs specifically. for RNAseq. Maybe I haven't looked hard enough.

Hope that helps.

ADD REPLY • link 6 days ago by Nishant • 0

score 0 · Answer 1 · 2025-11-06

Hi Nishant,

Thanks for the detailed post—this is a thoughtful approach to handling collinearity in your experimental design, and it's clear you've put a lot of effort into it. I'll walk through your method step-by-step, flag what's solid, and suggest a couple of tweaks based on common practices in DESeq2 workflows for multifactor RNA-seq with correlated covariates. I'm not a statistician either, but I've wrestled with similar setups in algal/plant RNA-seq datasets, so hopefully this helps.

Quick Recap of Your Setup (to Confirm I Get It)

Samples: 80 total (5 reps x 16 treatment combos? Assuming the "16 treatments" breaks down across Temp (3 levels) and carbonate chemistry manipulations).
Covariates: Temp (factor, 3 levels) + multiple collinear carbonate chem vars (e.g., pH, DIC, derived CO2).
DESeq2 Design: ~ Temp + PC1 + PC2 + Temp:PC1 + Temp:PC2 (smart to include interactions if you suspect Temp modulates chem effects).
PCA: On centered/scaled chem vars, retaining PC1+PC2 (~95% var explained—nice threshold).
Contrasts: Wald tests for PC1 and PC2 (full model), plus LRT vs reduced (~ Temp + PC2 + Temp:PC2 for PC1 sig?).
Interpretation: Back-transform to original vars via loadings: composite_LFC = L1 log2FC_PC1 + L2 log2FC_PC2, with propagated SE and BH-adjusted z-scores.
Filtering: FDR < 0.05 from Wald/LRT before composite calc, intersecting sig genes.

Overall: Yes, this is a "correct" (and defensible) thing to do. It's a standard way to orthogonalize collinear predictors before fitting, avoiding singularity in the GLM. The linear combination for LFCs is a direct consequence of PCA (since PCs are linear combos of originals, and vice versa via loadings). Your SE propagation assumes independence between PC contrasts (true by PCA construction) and homoscedasticity (DESeq2's dispersion handles this reasonably). The z-score p-adj is a clean way to test the composite effect.

That said, here are some caveats and refinements to make it even more robust:

Strengths

PCA for Collinearity: Spot-on. Carbonate chem params are highly collinear (e.g., pH ~ -log[H+], DIC includes CO2 terms), so raw inclusion would inflate VIFs and dilute power. PC1/PC2 capture the main axes (e.g., acidification vs. hypercapnia), and 95% var is plenty—I've used 90% as a cutoff in similar ocean acidification studies.
Interactions: Including Temp:PC* allows for condition-specific effects (e.g., high Temp amplifying CO2 response), which is biologically motivated for marine algae.
LRT + Wald Intersection: Good for filtering to genes with any PC effect and specific PC1/PC2 contributions. LRT tests the full model utility; Wald pins down which term drives it.

Potential Issues & Fixes

Prefiltering Before Composite Calc:
- Issue: Filtering on FDR < 0.05 before computing composites could bias your final gene list toward effects detectable in the PC space but weak in the original var (e.g., if a var loads heavily on a non-sig PC). It might also inflate false positives if the BH is applied post-filter.
- Fix: Run the composite calcs on all genes first, then filter on the padj_zscore < 0.05. This keeps things unbiased. If compute time is an issue (80 samples isn't huge), it's still fast in R.
- Bonus: Use DESeq2's lfcShrink on the PC contrasts before combining—shrinks noisy LFCs toward 0, stabilizing your composites.
Interpretation of Composites:
- Issue: The composite_LFC gives the directional effect of a 1-unit increase in the scaled original var (since you centered/scaled for PCA). For CO2, remember to back-scale if you want unscaled units (e.g., muatm). Also, if loadings have opposite signs across PCs, the composite could flip signs unexpectedly.
- Fix: Add a scaling step post-composite:
```
# Assuming 'var_sd' is the SD of your original centered/scaled var
scaled_composite_LFC <- log2FC_PC1 * L1 + log2FC_PC2 * L2
unscaled_composite_LFC <- scaled_composite_LFC * var_sd  # for interpretable units
```
  Plot loadings (e.g., biplot) to visualize how CO2 loads—helps explain why PC1/PC2 matter.
SE Propagation:
- Issue: Your formula assumes no covariance between PC1/PC2 LFC estimates, which holds for PCs but not if your design has imbalances (e.g., uneven reps across Temp). In DESeq2, lfcSE is from the GLM covariance matrix, so for precision:
- Fix: If you want full rigor, extract the full covariance between PC1/PC2 contrasts via coef and mcols(dds)$se—but it's overkill here. Your sqrt sum-of-squares is a conservative approximation (ignores positive cov, so SE might be slightly underestimated).
Model Diagnostics:
- Run plotPCA on vst/rlog counts colored by Temp and PC1/PC2—check if batches separate cleanly.
- VIF check post-PCA: car::vif(model_matrix) should be <5.
- If reps are low (n=5), consider blind=FALSE in normalization to leverage design.

R Code Snippet for Your Workflow (Tweaked)

Here's a polished version of your composite calc, wrapped in a function for reuse. Assumes dds is your DESeqDataSet, loadings_matrix has rows as vars/cols as PCs, and you've run resultsNames(dds) to ID contrasts like "PC1", "PC2".

library(DESeq2)

# Function for composite LFC/SE per variable
get_composite_lfc <- function(dds, var_name, loadings_mat, contrast_prefix = "PC") {
  L1 <- loadings_mat[loadings_mat$variable == var_name, "PC1"]
  L2 <- loadings_mat[loadings_mat$variable == var_name, "PC2"]
  var_sd <- attr(scale(carb_chem_data[, var_name]), "scaled:scale")  # if you scaled

  # Get results (use lfcShrink for stability; type="ashr" or "normal")
  res_pc1 <- lfcShrink(dds, coef = paste0(contrast_prefix, "1"), type = "ashr")
  res_pc2 <- lfcShrink(dds, coef = paste0(contrast_prefix, "2"), type = "ashr")

  # Composites (all genes)
  composite_LFC <- res_pc1$log2FoldChange * L1 + res_pc2$log2FoldChange * L2
  composite_LFC_scaled <- composite_LFC * var_sd  # optional unscale

  # SE (your formula is fine; add cov term if paranoid)
  composite_SE <- sqrt( (res_pc1$lfcSE * L1)^2 + (res_pc2$lfcSE * L2)^2 )

  # Z-test pvals
  z_scores <- composite_LFC / composite_SE
  raw_p <- 2 * pnorm(-abs(z_scores))
  adj_p <- p.adjust(raw_p, method = "BH")

  # Tibble for output
  tibble(
    gene = rownames(res_pc1),
    composite_LFC = composite_LFC,
    composite_LFC_scaled = composite_LFC_scaled,  # if needed
    composite_SE = composite_SE,
    padj_zscore = adj_p
  ) %>%
    filter(padj_zscore < 0.05)  # or intersect with your Wald/LRT list
}

# Usage for CO2
co2_results <- get_composite_lfc(dds, "CO2", loadings_matrix)
head(co2_results)

Kevin