Question

RNA-Seq data: The "right" normalisation and unit for the job

3

Entering edit mode

7.5 years ago

Sirio ▴ 30

The topic of RNA-Seq data normalisation, and what unit should one use (FPKM, TPM, CPM etc.) in their analysis has attracted tens of questions on here (e.g 1, 2, 3, 4, etc.). The confusion mainly arises from the fact that in RNA-Seq we are dealing with relative measures of expression, and what this "relative" is, has major implications on downstream analysis and conclusions. My questions/concerns are going to focus mainly on A (e.g control) vs B (e.g treatment) (with X replicates) comparisons which are common experimental designs. I'd appreciate any comments from the community, thanks : )

1) PCA/t-SNE plots are common tools to visualise the data. This is usually done to see whether the "A"s and "B"s predominantly cluster together. PCA is typically performed on the raw counts or FPKM (log2(FPKM + 1). But are these suitable units for such analysis? If we are using PCA/t-SNE to indirectly compare the "A"s and the "B"s surely these need to be comparable, in which case some form of between-sample normalisation need to be performed first. If that's the case, then such analysis can't be called unsupervised anymore.

2) Following on from (1): What normalisation + unit should one use for predictive modelling? If I want to build a classifier to discriminate between A and B, these values should be comparable (fundamental assumption). My understanding is that units such as FPKM are not comparable across samples and therefore would be unsuitable for this kind of analysis. I guess in this case one can see predictive modelling as a re-phrasing of the differential expression analysis problem; finding signatures that discriminate between A and B.

3) Following on from (1) + (2): What between-sample normalisation methods (BSN) are the most appropriate for such case e.g TMM? any others?

Thanks in advance for any comments/pointers to the literature!

RNA-Seq units normalisation • 3.9k views

ADD COMMENT • link updated 7.5 years ago by Devon Ryan 104k • written 7.5 years ago by Sirio ▴ 30

score 5 · Answer 1 · 2016-11-01

5

Entering edit mode

7.5 years ago

Devon Ryan 104k

PCA (or tSNE) on raw counts is usually a waste of time. The high signal genes will drive the first couple PCs and likely mask what the rest of the transcriptome is doing. Some sort of log transform (e.g., rlog in DEseq2) is going to be more useful.
Making a classifier that uses exact FPKMs doesn't make much sense. Classifying according where signals are relative to each other would make more sense.
The normal methods are TMM and RLE (what DESeq2 does), the results are close enough to each other that it doesn't matter which one you do.

ADD COMMENT • link 7.5 years ago by Devon Ryan 104k

1

Entering edit mode

using ranks (across conditions) would be an alternative to log - if you have many samples

ADD REPLY • link 7.5 years ago by unksci ▴ 180

0

Entering edit mode

Funny, I happened to be thinking about that this morning. My only concern with ranks is that they push apart similarly expressed genes, but this might be an overblown concern on my part. I think using some sort of "frozen" normalization (i.e., something akin to frozen RMA for RNAseq...I presume someone has already come up with this) combined with rlog or some similar transform would yield nice results that let you do predictions.

ADD REPLY • link 7.5 years ago by Devon Ryan 104k

1

Entering edit mode

Depending on overall strategy you could also set some minimal threshold (e.g: on absolute difference of molecules; or fold change; or log-fold change; or some statistical test); similarly one way of avoiding focusing on spurious differences is to use biological and technical replicates.

ADD REPLY • link 7.5 years ago by unksci ▴ 180

0

Entering edit mode

Thanks for the concise and clear response!

Coming from a maths/statistics background I'm used to work with data that can be directly comparable because the "measuring device" is consistent. Although the sequencer is typically the same, the various biases (my understanding is that mostly has to do with library preparation) that are introduced make A vs B comparisons not straightforward.

What I'd like to confirm and convince myself is that any unit that derives from a within-sample normalisation e.g FPKM/TPM/CPM are not suitable for differential expression analysis and hence computing fold changes as:

Fold change = log2(FPKM A/FPKM B)

Is incorrect. See this paper (page 4).

ADD REPLY • link 7.5 years ago by Sirio ▴ 30

0

Entering edit mode

I would discriminate between incorrect and unreliable. These sorts of metrics can lead to correct results, but whether they will depends on whether the biases are the same (or close enough) across samples. The classic example of this is different amounts of rRNA signal across samples leading to FPKM/TPM/etc. metrics that aren't comparable. There are ways around this, but naive comparisons are problematic.

ADD REPLY • link 7.5 years ago by Devon Ryan 104k

0

Entering edit mode

Sure, I guess I'm feeling uneasy because I don't fully understand the extent of these confounding factors (e.g you mentioned rRNA abundance). I think that there are also multiple "biological/chemical" confounding factors that we cannot measure and/or unaware of. In view of this, surely the most sensible thing to do is to use TMM/RLE etc. as this is our current best estimate of a comparable metric. So when I see a Support Vector Machine, Random Forests etc. applied to FPKM data, surely I should frown : )

ADD REPLY • link 7.5 years ago by Sirio ▴ 30

0

Entering edit mode

I frown whenever I see "FPKM", for whatever that's worth :P

ADD REPLY • link 7.5 years ago by Devon Ryan 104k

1

Entering edit mode

I'd go so far and argue that there is no specific reason why log2 itself should always be the only comparison: e.g.: a) changes in absolute numbers of molecules can also be meaningful (and for instance more suitable to many single-cell gene expression analyses) b) there is a threshold at which pooled expression levels cross the boundary below one molecule per cell (in this cases log2 transformations would partially mask the distinction between cells expressing genes and gradual tuning - which can have different molecular reasons) c) A stronger log change is easier, if the number of molecules is low (and genes with different expression levels can enrich for different biological functions...)

d) having that said, there also is nothing particularly wrong with ratios (given caution to its implicit assumptions on biology)

ADD REPLY • link 7.5 years ago by unksci ▴ 180