Question: RNA-Seq data: The "right" normalisation and unit for the job
gravatar for Sirio
2.5 years ago by
United Kingdom
Sirio30 wrote:

The topic of RNA-Seq data normalisation, and what unit should one use (FPKM, TPM, CPM etc.) in their analysis has attracted tens of questions on here (e.g 1, 2, 3, 4, etc.). The confusion mainly arises from the fact that in RNA-Seq we are dealing with relative measures of expression, and what this "relative" is, has major implications on downstream analysis and conclusions. My questions/concerns are going to focus mainly on A (e.g control) vs B (e.g treatment) (with X replicates) comparisons which are common experimental designs. I'd appreciate any comments from the community, thanks : )

1) PCA/t-SNE plots are common tools to visualise the data. This is usually done to see whether the "A"s and "B"s predominantly cluster together. PCA is typically performed on the raw counts or FPKM (log2(FPKM + 1). But are these suitable units for such analysis? If we are using PCA/t-SNE to indirectly compare the "A"s and the "B"s surely these need to be comparable, in which case some form of between-sample normalisation need to be performed first. If that's the case, then such analysis can't be called unsupervised anymore.

2) Following on from (1): What normalisation + unit should one use for predictive modelling? If I want to build a classifier to discriminate between A and B, these values should be comparable (fundamental assumption). My understanding is that units such as FPKM are not comparable across samples and therefore would be unsuitable for this kind of analysis. I guess in this case one can see predictive modelling as a re-phrasing of the differential expression analysis problem; finding signatures that discriminate between A and B.

3) Following on from (1) + (2): What between-sample normalisation methods (BSN) are the most appropriate for such case e.g TMM? any others?

Thanks in advance for any comments/pointers to the literature!

normalisation rna-seq units • 1.5k views
ADD COMMENTlink modified 2.5 years ago by Devon Ryan90k • written 2.5 years ago by Sirio30
gravatar for Devon Ryan
2.5 years ago by
Devon Ryan90k
Freiburg, Germany
Devon Ryan90k wrote:
  1. PCA (or tSNE) on raw counts is usually a waste of time. The high signal genes will drive the first couple PCs and likely mask what the rest of the transcriptome is doing. Some sort of log transform (e.g., rlog in DEseq2) is going to be more useful.
  2. Making a classifier that uses exact FPKMs doesn't make much sense. Classifying according where signals are relative to each other would make more sense.
  3. The normal methods are TMM and RLE (what DESeq2 does), the results are close enough to each other that it doesn't matter which one you do.
ADD COMMENTlink written 2.5 years ago by Devon Ryan90k

using ranks (across conditions) would be an alternative to log - if you have many samples

ADD REPLYlink written 2.5 years ago by unksci150

Funny, I happened to be thinking about that this morning. My only concern with ranks is that they push apart similarly expressed genes, but this might be an overblown concern on my part. I think using some sort of "frozen" normalization (i.e., something akin to frozen RMA for RNAseq...I presume someone has already come up with this) combined with rlog or some similar transform would yield nice results that let you do predictions.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Devon Ryan90k

Depending on overall strategy you could also set some minimal threshold (e.g: on absolute difference of molecules; or fold change; or log-fold change; or some statistical test); similarly one way of avoiding focusing on spurious differences is to use biological and technical replicates.

ADD REPLYlink written 2.5 years ago by unksci150

Thanks for the concise and clear response!

Coming from a maths/statistics background I'm used to work with data that can be directly comparable because the "measuring device" is consistent. Although the sequencer is typically the same, the various biases (my understanding is that mostly has to do with library preparation) that are introduced make A vs B comparisons not straightforward.

What I'd like to confirm and convince myself is that any unit that derives from a within-sample normalisation e.g FPKM/TPM/CPM are not suitable for differential expression analysis and hence computing fold changes as:

Fold change = log2(FPKM A/FPKM B)

Is incorrect. See this paper (page 4).

ADD REPLYlink written 2.5 years ago by Sirio30

I would discriminate between incorrect and unreliable. These sorts of metrics can lead to correct results, but whether they will depends on whether the biases are the same (or close enough) across samples. The classic example of this is different amounts of rRNA signal across samples leading to FPKM/TPM/etc. metrics that aren't comparable. There are ways around this, but naive comparisons are problematic.

ADD REPLYlink written 2.5 years ago by Devon Ryan90k

Sure, I guess I'm feeling uneasy because I don't fully understand the extent of these confounding factors (e.g you mentioned rRNA abundance). I think that there are also multiple "biological/chemical" confounding factors that we cannot measure and/or unaware of. In view of this, surely the most sensible thing to do is to use TMM/RLE etc. as this is our current best estimate of a comparable metric. So when I see a Support Vector Machine, Random Forests etc. applied to FPKM data, surely I should frown : )

ADD REPLYlink written 2.5 years ago by Sirio30

I frown whenever I see "FPKM", for whatever that's worth :P

ADD REPLYlink written 2.5 years ago by Devon Ryan90k

I'd go so far and argue that there is no specific reason why log2 itself should always be the only comparison: e.g.: a) changes in absolute numbers of molecules can also be meaningful (and for instance more suitable to many single-cell gene expression analyses) b) there is a threshold at which pooled expression levels cross the boundary below one molecule per cell (in this cases log2 transformations would partially mask the distinction between cells expressing genes and gradual tuning - which can have different molecular reasons) c) A stronger log change is easier, if the number of molecules is low (and genes with different expression levels can enrich for different biological functions...)

d) having that said, there also is nothing particularly wrong with ratios (given caution to its implicit assumptions on biology)

ADD REPLYlink written 2.5 years ago by unksci150
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1649 users visited in the last hour