Question

CPM read count normalization: what does it mean between replicates of same group and within same replicate?

1

Entering edit mode

5.6 years ago

salamandra ▴ 550

Hi,

1- In the table ´Normalization method' here says that CPM (counts per million) can be used for gene count comparisons between replicates of the same sample group.

1.1 Does it mean that for eg. we can compare one gene from a sample of group 'control' with same gene of another 'control' sample but that we cannot compare a gene in 'control' sample with same gene in a 'treatment' sample?

1.2 If so, then when looking for a heatmap with CPM values cannot we for e.g. identify genes that seem to have a higher expression in 'treatment' samples than in 'control' samples? Do we need to use a different normalization method?

2- In same table says that CPM cannot be used for within sample comparisons.

2.1 Does it mean we cannot compare different genes of the same sample?

2.2 What if when looking to CPM heatmap it seems one gene is varying more between 'control' and 'treatment' than the other. Can we make this conclusion if heatmap plots CPM values?

RNA-Seq Read count normalization • 17k views

ADD COMMENT • link updated 5.6 years ago by ATpoint 84k • written 5.6 years ago by salamandra ▴ 550

score 8 · Accepted Answer · 2018-12-30

As recommended in this presentation, I would not use per-million methods for anything as there are better methods now. Check this video to get an idea why per-million based methods are not optimal and this one on how the normalization in e.g. DESeq2 works.

Towards your questions:

1 - you can use it but it is not recommended for DE analysis, so better don't use it at all

1.1 - Simply normalize the entire dataset with edgeR or DESeq2 and do comparisons with these values

1.2 - do not use CPM values for a heatmap, use logged/normalized counts, like those produced by the vst or rlog functions in DESeq2. Using non-log counts will bias the heatmap towards highly expressed genes. These video series I inked above also have a video about logs in case you care.

2 - true, because it does not normalize for gene length, so longer genes inherently have higher counts than short genes.

2.1 - one probably could, but not without adjusting for gene length (use the search function on this, there are plenty of posts on that matter already out there).

2.2 - it might give you an idea but you should use appropriate statistics to infer differentially expressed genes.