Somewhat simple question, when is the most advantageous point in a RNAseq analysis to perform quantile normalization for future between sample comparisons? Would it be at the raw count level since the documentation on quantile normalization appears to require raw data? Is quantile normalization expecting data distributed as a negative binomial? Does it matter if it's simply looking at some log based ranking ratio? Alternatively some TCGA approaches quantile normalized TPM values, though I thought this had some unwanted effect on the scaling of the TPM values? Also what data rows should be removed prior to normalization? E.g., rows with all zeros wouldn't affect TPM calculation, but it should have more or less effect with quantile normalization as the number of these events increases, right?
At the moment, I'm debating on two approaches:
rawcounts <- quantile normalize <- tpm <- filter out low TPM rows <- log2+1 tpm <- combat batch correction
rawcounts <- tpm <- filter out low TPM rows <- quantile normalize <- log2+1 quantile normalized TPM <- combat batch correction
Any suggestions would be appreciated. M
I'm unsure why you are seemingly trying to re-invent the wheel here... (?). Limma/Voom will employ quantile normalisation on your raw counts, if you choose to use that.
I would not be confident of your data after passing it through batch correction, logging, and then quantile normalisation and TPM.
Why not keep this simple and feed your raw counts into Limma/Voom, edgeR, or DESeq2? I also recommend avoiding the use of ComBat at all costs. If you use DESeq2, include
batchas a covariate in your design model.
You may want to take a look at this publication to understand the normalisation methods: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis.
That's what I ended up doing. My main issue was with between sample normalizations of TPM, say if you don't have easy access to raw count data. Presumably the way I worded my post didn't get this across as well as it should have.
I see, yes, the TCGA raw data is controlled / level 1 access (BAM files) - presume you alluded to that. I was able to download raw HTseq RNA-seq counts recently, though (Level 3 / open access). I believe that TPM is part of the RSEM RNA-seqv2 data.
Unfortunately it's not TCGA data. It's a proprietary dataset. Hypothetically, say I wasn't able to get raw counts, how would you suggest doing between subject normalizations for TPM? TCGA does upper quartile normalization on it's FPKM and TPM data, that just didn't seem to do enough for my data set after looking at the samplewise boxplots. Or maybe I'm just being too nitpicky. I've seen others suggest simply offset log2 TPM? Maybe it's not as powerful as count based approaches, but it seems to pick up the major players in the dataset.
As far as TPM is concerned, it is not ideal, at least according to these:
There are also other Biostars threads, but where TPM is concerned, there never appears to be a conclusive answer:
Also, if you want to start/end your day badly, take a look at the response of the senior Bioconductor member here: https://support.bioconductor.org/p/98820/
Did you get this sorted out?
Sorry for the delay. Yes, I ended up getting things to work. It's just an interesting question regarding how to use TPM/FPKM properly. I'm not sure if there are some benchmarking studies that look at this against the SEQC standards. If there aren't, it would probably make a good paper. I mean what's the point of using TPM/FPKM if they aren't the most reliable indicators of between sample differences in expression?
Indeed, I'm now following up on threads and posting this:
An update (6th October 2018):
You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:
Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis
Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units