Hi,
TL;DR: What is the best metric for an expression matrix with RNAseq data obtained in multiple batches?
Background
I'm working on a project where we're compiling RNAseq data on a bunch of human samples. The data has been accumulating for the past few years and will continue to come in. The data is from different sequencing platforms (HiSeq/NovaSeq) and has other differences (such as the fact that some of them were multiplexed while others were not, essentially giving me anywhere between 2-16 PE FASTQ files per sample). The raw data is also multi-species (host+graft) and gets separated before being quantified with STAR+RSEM.
Objective
My question here is: How do I create one matrix that allows cross-sample comparison of gene expression?
Challenges
First off, FPKM and RPKM are off the table because they don't allow for comparison across samples. TPM is an option as it does allow some sort of comparison across samples, although there exists some debate. Quantile/Upper Quartile normalization is another option.
Then of course is the issue of batch effects. An option to deal with that is sva::ComBat()
, which "removes" batch effects and gives me a batch-effect-subtracted matrix. However, it does seem to affect my data weirdly, producing negative values for count data, which doesn't make any sense. Because there is no DE question, using DESeq2 doesn't make a lot of sense to me (unless DESeq2 can somehow export a TPM or count matrix with batch effect corrected)
Discussion
What does the community think is the best way to go about doing this? What is the metric to use (TPM / counts) and what sort of normalization (Quantile Normalization / UQ Normalization / something like rlog - however that would apply)? And how do I account for the batch effect? An ideal approach would allow me to add a new set of samples without altering existing samples' values, but that might be asking for too much.
Afterthought
I'd settle on UQ-normalized counts "because TCGA does it", but that doesn't seem a satisfactory reason to adopt that approach, especially given the fact that UQ-normalization does not really help with batch effect (why are TCGA even using it?)
I haven't read this in a while, but the article by Gilad & Mizrahi-Man discussing batch effects of the modENCODE datasets may contain some relevant approaches