Hi all,
Apologies in advance if this has already been asked previously but I couldn't find a suitable answer.
Context: I am in the middle of analysing my RNAseq data for 80 samples (16 treatments x 5 biol. replicates each) from a marine alga. My treatments were: Temperature as a factor (3 levels), changing carbonate chemistry (manipulated via changing pH and total dissolved inorganic carbon concentration). Although many of carbonate chemistry parameters are highly colinear, I would like to obtain Log2FoldChanges in gene expression due to, for example, CO2 concentration.
Therefore, I thought that the best DESeq2 design model would be to do a PCA on my carbonate chemistry variables, (where PC1 + PC2 explained ~ 95% of variance). So, I used ~ Temp + PC1+ PC2 + Temp:PC1 + Temp:PC2 as my model design, and used the loadings to obtain LFCs for each of my carb. chem. variable.
L1 <- loadings_matrix$PC1[loadings_matrix$variable == var]
L2 <- loadings_matrix$PC2[loadings_matrix$variable == var]
composite_LFC = log2FC_PC1 * L1 + log2FC_PC2 * L2,
composite_LFC_SE = sqrt((lfcSE_PC1^2 * L1^2) + (lfcSE_PC2^2 * L2^2)),
padj_zscore = p.adjust(2 * pnorm(-abs(composite_LFC / composite_LFC_SE)), "BH")
where var is my centered and scaled carbonate chem. variable.
I am not a statistician, or a "hard core" Biologist, but by looking at the code snippet above, do you guys think I'm doing the "correct thing"? I pre-filtered my genes by FDR < 0.05 before running the above code, and I only use the genes which are present in both the wald-test FDR < 0.05 as well as LRT FDR < 0.05 (reduced model for PC1 = ~ Temp + PC2 + Temp:PC2).
Any help/advice/recommendations would be highly welcome and I am grateful for your help.
Nishant
Therefore, I thought that the best DESeq2 design model would be to do a PCA on my carbonate chemistry variables, (where PC1 + PC2 explained ~ 95% of variance). So, I used ~ Temp + PC1+ PC2 + Temp:PC1 + Temp:PC2 as my model design, who gave you this concept?
and what kind of feature values you are using? for deseq2? raw counts or normalized?
Hey, thanks for the question; Before running DESeq, I filter my raw counts using EdgeR's filterByExpr function (group = treatment). I got the concept from an ecology paper (https://www.jstor.org/stable/48678126) which used PCR for understanding the effect of environmental variables (rainfall, temperature, etc.) on tree phenotypic data (growth rate). I think GWAS studies also use PCs commonly. I extended this to RNAseq but I haven't found papers that use PCs specifically. for RNAseq. Maybe I haven't looked hard enough.
Hope that helps.
Aren't PC1 and PC2 going to correspond to your treatments?
This is not a standard analysis, why do you think you need something so bespoke?
Hi swbarnes2 The PCA I am referring to here is between my independent variables (carbonate chemistry variables, which are usually highly colinear). PC1 and PC2 in that way are orthogonal axes as a function of pH, alkalinity, DIC, CO2, HCO3, and CO3.
You are right, this is not standard analysis. The easy way out would be to choose two of the least colinear variables from the above, bang in an interactive term and call it a day. However, geochemically, we know these variables behave differently, and also biologically (in theory, due to transporters, carbonic anhydrase, etc.) so why shouldn't I try to pick apart their individual effects?
Hope my reasoning makes sense.