I have a heterogeneous mix of data ranging in quality, read depth, etc. (the libraries were prepared by many groups and now I'm trying to analyze them all together). After initially running the DESeq method, I obtained all the PCs, focused on PC 1-10, and then ran a Pearson correlation including the numeric variables in my overall meta dataframe. I found that the following metrics correlated highly with the top 10 PCs: RNA.Batch, ExpressionProfilingEfficiency, UniqueRateofMapped, ReadLength, DV200, and AvgSplitsperRead. I set these as covariates in my design matrix:
dds <- DESeqDataSetFromMatrix(countData=data,
colData=meta,
design=~RNA.Batch
+~ExpressionProfilingEfficiency
+~UniqueRateofMapped
+~ReadLength
+~DV200
+~AvgSplitsperRead,
tidy=TRUE)
How can I compare the PCA before and after setting these covariates? I'm sorry for the very basic question, but as I understand it, setting these covariates will help to control against any variability caused by these metrics given I have lots of samples prepared from a variety of groups. Any guidance on this would be greatly appreciated, as I'm trying to ensure I can trust the data going into differential expression analysis. Thank you so much!
EDIT: I should add, I tried to re-run a PCA after setting the covariates, but I'm seeing the same values in my PCs (no change compared to before adding these covariates). Here is the code I used to generate a PC dataframe:
rld <- vst(dds, blind=TRUE)
rld_mat <- assay(rld)
rv <- rowVars(assay(rld))
ntop = 500
select_var <- order(rv, decreasing=TRUE)[seq_len(min(ntop, length(rv)))]
pca <- prcomp(t(assay(rld)[select_var,]))
summary(pca)
df <- cbind(meta, pca$x)
Thanks again for your help, and my apologies again for asking these basic questions (I'm very new to this and don't have a formal education in bioinformatics, but plan to enter a graduate program in this in the next year). With that, I guess I'm trying to find a thorough tutorial on how to troubleshoot when dealing with lower quality, very heterogeneous data. Even when I follow the instructions on the link provided, I'm unable to reduce the weight of PC1. I'll calculate the ICC values for my categorical metrics and see if perhaps those are correlating with the top PCs, but I'm stuck on these troubleshooting/cleaning steps at the moment. Anyway, thanks again, I'm hoping to get a better understanding of this with more practice and reading through what's available.