Question

How to normalize gene expression with CPM

0

Entering edit mode

4.2 years ago

concetta ▴ 10

Hi,

I am analyzing RNA-seq data produced with 3' mRNA-Seq.

I calculated gene counts using HTSEQ-count and I must normalize my counts to perform a Kaplan-Meier analysis. I would like to use CPM normalization considering that I can’t normalize data also based on gene length.

I have a question about the CPM normalization method. Considering that the formula is CPM = ((counts on the features) / library size) X 1,000,000, it normalized the count by the library size.

I was wondering if the library size should be:

the number of raw sequence reads produced by the sequencing;
the number of unique mapped reads to the features, i.e. the sum of the counts of all features given by HTSEQ-count.

Thank you!

Concetta

RNA-Seq gene expression • 6.4k views

ADD COMMENT • link updated 4.2 years ago by swbarnes2 14k • written 4.2 years ago by concetta ▴ 10

score 1 · Answer 1 · 2020-02-27

Don't do this by hand. Feed your raw count matrix into specialized tools such as edgeR to make use of their normalization methods. Below is example code, pretty much copied from edgeR help when typing ?calcNormFactors. Be sure to removed the column that contains the unmapped reads from your count matrix.

library(edgeR)

## example count matrix for five three samples
y <- matrix( rpois(1000, lambda=5), nrow=200 )

## as DGEList
dge <- DGEList(counts=y)

## calculate norm. factors
dge <- calcNormFactors(dge)

## get normalized counts
normalized.counts <- cpm(dge)

score 0 · Answer 2 · 2020-02-27

0

Entering edit mode

4.2 years ago

swbarnes2 14k

As you can see, EdgeR's cpm function takes in the read count file. If you don't have an row for unmapped or unassigned reads, (and you generally won't) they will not be counted at all, so the counts are being divided by the number of counts assigned to something.

cpm is simple enough that you could do it yourself (as opposed to TPM or RPKM, where you really should let software handle the ambiguities)

ADD COMMENT • link 4.2 years ago by swbarnes2 14k

0

Entering edit mode

Be aware though that CPM uses the normalisation factors from calcNormFactors to correct for library composition and therefore is more sophisticated than correcting only for total read count differences.

ADD REPLY • link 4.2 years ago by ATpoint 82k

0

Entering edit mode

Thank you! I have another question. To normalize counts, do you suggest to consider also reads unassigned to something, such as unmapped reads and/or multimapping reads?