Hello,
I am interested in performing a differential gene expression analysis using RNA-seq data (~300 samples) from ICGC and DESeq2 package. I want to compare two patient groups based on "high" and "low" expression of a gene and, then, to develop a GSEA preranked.
After creating a DESeqDataset object specifying only the condition of interest in the design (design = ~ gene_expression_level) and applying a variance stabilizing transformation with vts(dds, blind = FALSE), I evaluated the presence of batch effects using plotPCA function. For this, I generated a PCA plot using only my condition of interest and I observed these two groups defined by diagnonal (top left plot). I thougth that these groups could be explained by another variable and I observed that this division was produced by sex (top right plot). However, I think that PC1 and PC2 do not explain well the differences between these two groups and that a diagonal line could account better this situation.
Then, I checked other clinical variables and I observed clearly defined subroups inside each sex group (low left and low right plots). In this context, I have two questions:
1-How should be these PCA plots interpreted taking into account the diagonal pattern?
2-Should I include sex and the other two clinical variables in my design (design = ~ sex + clin1 + clin2 + gene_expression_level? I think that probably not because it seems that "high" and "low" expression patients are balanced between the 3 variables, but I am not sure and I would appreciate an opinion.
Thank you so much
Thanks for your answer Jared! After trying both options and filtering DEGs by abs(log2FC) > 1.5 and p-value < 0.05, DESeq reported 423 DEGs whith the full design (
design = ~ sex + CLL + MUT + gene_expression_level) and 406 with the single design (design = ~gene_expression_level). Both approaches shared 245 DEGs. Maybe differences in results could be explained by the role of other variables.I'd apply the log fold change threshold in the
resultscall usinglfcThresholdrather than post-hoc filtering, as it changes the threshold during testing rather than making your p-values meaningless. 1.5 is a pretty high threshold on a log2 scale, I'd lower it to 1 or 0.58 if you do end up usinglfcThreshold.