The topic of RNA-Seq data normalisation, and what unit should one use (FPKM, TPM, CPM etc.) in their analysis has attracted tens of questions on here (e.g 1, 2, 3, 4, etc.). The confusion mainly arises from the fact that in RNA-Seq we are dealing with relative measures of expression, and what this "relative" is, has major implications on downstream analysis and conclusions. My questions/concerns are going to focus mainly on A (e.g control) vs B (e.g treatment) (with X replicates) comparisons which are common experimental designs. I'd appreciate any comments from the community, thanks : )
1) PCA/t-SNE plots are common tools to visualise the data. This is usually done to see whether the "A"s and "B"s predominantly cluster together. PCA is typically performed on the raw counts or FPKM (log2(FPKM + 1). But are these suitable units for such analysis? If we are using PCA/t-SNE to indirectly compare the "A"s and the "B"s surely these need to be comparable, in which case some form of between-sample normalisation need to be performed first. If that's the case, then such analysis can't be called unsupervised anymore.
2) Following on from (1): What normalisation + unit should one use for predictive modelling? If I want to build a classifier to discriminate between A and B, these values should be comparable (fundamental assumption). My understanding is that units such as FPKM are not comparable across samples and therefore would be unsuitable for this kind of analysis. I guess in this case one can see predictive modelling as a re-phrasing of the differential expression analysis problem; finding signatures that discriminate between A and B.
3) Following on from (1) + (2): What between-sample normalisation methods (BSN) are the most appropriate for such case e.g TMM? any others?
Thanks in advance for any comments/pointers to the literature!