I've been trying to analyze our RNAseq samples vs samples obtained from external data sources. For example, analyze Breast Cancer samples obtained in one study vs the one obtained in TCGA.
Even if the samples are equal, biologically speaking, there are many differences in the samples because of "batch" effect (different kits, different populations, different machines etc).
What I'm doing is: Create DeSeq2 object in R Obtain FPM values - fpm(deseqobject) Remove noise genes (low reads, pseudogenes, etc) Compare clustering algorithms and visualize everything with a PCA.
Of course, when doing this all samples from one study cluster together, while samples from tcga forms another cluster. I don't want to do differential expression analysis, I need the FPM counts of all samples
Is there a way to remove all "batch" effects in DeSeq2, for this purpose (using batch effect as covariate, maybe?)
Or should I remove batch effect by using limma or ComBat?
I know there are some responses here and there, but most people want to use batch effect removal to do Differential Expression, so I thought that asking would be best
I've been researching about this, because it seems like integration of data across studies is very important. But, of course, there is a big problem of "batch" effect. Many papers are adressing this, and I've found some nice tools such as ComBat-seq (an updated version of ComBat) which can be used for more extreme batch differences in samples (https://academic.oup.com/nargab/article/2/3/lqaa078/5909519)
Of course there is no "magic" math, but real researchers are trying different approaches to solve this type of problem, even though you might not like this type of analysis
It's not about what I like. If batch effect is confounded with experimental differences, there is no algorithm that will separate them.