I am interested in performing a differential gene expression analysis using RNA-seq data (~300 samples) from ICGC and DESeq2 package. I want to compare two patient groups based on "high" and "low" expression of a gene and, then, to develop a GSEA preranked.
After creating a DESeqDataset object specifying only the condition of interest in the design (
design = ~ gene_expression_level) and applying a variance stabilizing transformation with
vts(dds, blind = FALSE), I evaluated the presence of batch effects using
plotPCA function. For this, I generated a PCA plot using only my condition of interest and I observed these two groups defined by diagnonal (top left plot). I thougth that these groups could be explained by another variable and I observed that this division was produced by sex (top right plot). However, I think that PC1 and PC2 do not explain well the differences between these two groups and that a diagonal line could account better this situation.
Then, I checked other clinical variables and I observed clearly defined subroups inside each sex group (low left and low right plots). In this context, I have two questions:
1-How should be these PCA plots interpreted taking into account the diagonal pattern?
2-Should I include sex and the other two clinical variables in my design (
design = ~ sex + clin1 + clin2 + gene_expression_level? I think that probably not because it seems that "high" and "low" expression patients are balanced between the 3 variables, but I am not sure and I would appreciate an opinion.
Thank you so much