Question: How to normalize gene expression with CPM
gravatar for concetta
4 weeks ago by
concetta0 wrote:


I am analyzing RNA-seq data produced with 3' mRNA-Seq.

I calculated gene counts using HTSEQ-count and I must normalize my counts to perform a Kaplan-Meier analysis. I would like to use CPM normalization considering that I can’t normalize data also based on gene length.

I have a question about the CPM normalization method. Considering that the formula is CPM = ((counts on the features) / library size) X 1,000,000, it normalized the count by the library size.

I was wondering if the library size should be:

  1. the number of raw sequence reads produced by the sequencing;
  2. the number of unique mapped reads to the features, i.e. the sum of the counts of all features given by HTSEQ-count.

Thank you!


rna-seq expression gene • 114 views
ADD COMMENTlink modified 4 weeks ago by swbarnes27.5k • written 4 weeks ago by concetta0
gravatar for ATpoint
4 weeks ago by
ATpoint31k wrote:

Don't do this by hand. Feed your raw count matrix into specialized tools such as edgeR to make use of their normalization methods. Below is example code, pretty much copied from edgeR help when typing ?calcNormFactors. Be sure to removed the column that contains the unmapped reads from your count matrix.


## example count matrix for five three samples
y <- matrix( rpois(1000, lambda=5), nrow=200 )

## as DGEList
dge <- DGEList(counts=y)

## calculate norm. factors
dge <- calcNormFactors(dge)

## get normalized counts
normalized.counts <- cpm(dge)
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by ATpoint31k
gravatar for swbarnes2
4 weeks ago by
United States
swbarnes27.5k wrote:

As you can see, EdgeR's cpm function takes in the read count file. If you don't have an row for unmapped or unassigned reads, (and you generally won't) they will not be counted at all, so the counts are being divided by the number of counts assigned to something.

cpm is simple enough that you could do it yourself (as opposed to TPM or RPKM, where you really should let software handle the ambiguities)

ADD COMMENTlink written 4 weeks ago by swbarnes27.5k

Be aware though that CPM uses the normalisation factors from calcNormFactors to correct for library composition and therefore is more sophisticated than correcting only for total read count differences.

ADD REPLYlink written 4 weeks ago by ATpoint31k

Thank you! I have another question. To normalize counts, do you suggest to consider also reads unassigned to something, such as unmapped reads and/or multimapping reads?

ADD REPLYlink written 28 days ago by concetta0

No. Exclude them.

ADD REPLYlink written 28 days ago by ATpoint31k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1274 users visited in the last hour