I need to normalize gene expression (RNASeq) data to perform downstream analyses on it:
- unsupervised deconvolution using ICA
- pathway quantification using decoupleR. I want to see which pathways are important in my cohort and if some pathways are specific to subgroups of clinical features (histology, stage, ...).
Actually, to perform unsupervised deconvolution, I do not have to worry about batch effect since ICA should capture a batch in one component. But I have to pay attention to this to perform pathway quantification.
In my dataset, I have a strong batch effect since 2/3 of the samples were captured using the Kapa kit and the ones which failed were captured using the Takara kit.
I used DESeq2 in several studies some years ago to perform DEA and data visualization ago but, I did not have to remove a batch. I would appreciate advice to get an expression matrix of the normalized counts, corrected for this batch effect. I discussed already the input needed for decoupleR here. It seems possible to use normalized, log normalized or vst counts.
My first idea was to get corrected log normalized counts but it does not seem possible. If I am correct, it is impossible to simply put the batch in the design of the model to do that, since the normalized counts are not affected by the model.
Then, I saw the following solution: after normalization and dispersion estimation, use vsd and
limma::removeBatchEffect. I read that it should be better to use
vst(dds, blind = FALSE).
blind=FALSE, the sample information provided in the design formula are used.
I wanted first to provide only the batch information in the design formula. Nevertheless, to explain the counts, it seems better to provide the main clinical information (gender, histology, grade, age, ...) and batch.
My question is the following: if I introduce all these variables, can I somehow introduce a bias / a signal? Is it the best strategy to perform an unsupervised pathway analysis afterwards?
I am not sure my concern is very clear, but the idea is that I want to provide input data to decoupleR on which there is no batch effect and no effect added artificially when computing vst.
If it is the way to go, could you recommend "how many" variables should be used in the model? I have at least 6 of them, but from the unsupervised deconvolution, I see that some have no effect. Should I use only the 2 - 3 - 4 with the highest effect (+ the batch)?
Thanks in advance for your feedback