Somewhat simple question, when is the most advantageous point in a RNAseq analysis to perform quantile normalization for future between sample comparisons? Would it be at the raw count level since the documentation on quantile normalization appears to require raw data? Is quantile normalization expecting data distributed as a negative binomial? Does it matter if it's simply looking at some log based ranking ratio? Alternatively some TCGA approaches quantile normalized TPM values, though I thought this had some unwanted effect on the scaling of the TPM values? Also what data rows should be removed prior to normalization? E.g., rows with all zeros wouldn't affect TPM calculation, but it should have more or less effect with quantile normalization as the number of these events increases, right?
At the moment, I'm debating on two approaches:
rawcounts <- quantile normalize <- tpm <- filter out low TPM rows <- log2+1 tpm <- combat batch correction
rawcounts <- tpm <- filter out low TPM rows <- quantile normalize <- log2+1 quantile normalized TPM <- combat batch correction
Any suggestions would be appreciated. M