Exploring batch effects in RNA-seq data with plotPCA
1
0
Entering edit mode
5 weeks ago

Hello,

I am interested in performing a differential gene expression analysis using RNA-seq data (~300 samples) from ICGC and DESeq2 package. I want to compare two patient groups based on "high" and "low" expression of a gene and, then, to develop a GSEA preranked.

After creating a DESeqDataset object specifying only the condition of interest in the design (design = ~ gene_expression_level) and applying a variance stabilizing transformation with vts(dds, blind = FALSE), I evaluated the presence of batch effects using plotPCA function. For this, I generated a PCA plot using only my condition of interest and I observed these two groups defined by diagnonal (top left plot). I thougth that these groups could be explained by another variable and I observed that this division was produced by sex (top right plot). However, I think that PC1 and PC2 do not explain well the differences between these two groups and that a diagonal line could account better this situation.

Then, I checked other clinical variables and I observed clearly defined subroups inside each sex group (low left and low right plots). In this context, I have two questions:

1-How should be these PCA plots interpreted taking into account the diagonal pattern?

2-Should I include sex and the other two clinical variables in my design (design = ~ sex + clin1 + clin2 + gene_expression_level? I think that probably not because it seems that "high" and "low" expression patients are balanced between the 3 variables, but I am not sure and I would appreciate an opinion.

Thank you so much

batch RNA-seq DESeq2 plotPCA PCA • 273 views
0
Entering edit mode
5 weeks ago

Well, PC1 is clearly CLL-subtype/mutation driven, whereas PC2 is sex-driven. Given that your top left plot is relatively equally well-mixed between all of these groups, I'd say you're likely fine just sticking to your variable of interest in the design, though adding sex (or even the other variables as well) will likely not change the results hugely.

It takes ~15 minutes to just try all options and see how they compare.

0
Entering edit mode

Thanks for your answer Jared! After trying both options and filtering DEGs by abs(log2FC) > 1.5 and p-value < 0.05, DESeq reported 423 DEGs whith the full design (design = ~ sex + CLL + MUT + gene_expression_level) and 406 with the single design (design = ~gene_expression_level). Both approaches shared 245 DEGs. Maybe differences in results could be explained by the role of other variables.

0
Entering edit mode

I'd apply the log fold change threshold in the results call using lfcThreshold rather than post-hoc filtering, as it changes the threshold during testing rather than making your p-values meaningless. 1.5 is a pretty high threshold on a log2 scale, I'd lower it to 1 or 0.58 if you do end up using lfcThreshold.