How to normalize gene expression with CPM
2
0
Entering edit mode
4.1 years ago
concetta ▴ 10

Hi,

I am analyzing RNA-seq data produced with 3' mRNA-Seq.

I calculated gene counts using HTSEQ-count and I must normalize my counts to perform a Kaplan-Meier analysis. I would like to use CPM normalization considering that I can’t normalize data also based on gene length.

I have a question about the CPM normalization method. Considering that the formula is CPM = ((counts on the features) / library size) X 1,000,000, it normalized the count by the library size.

I was wondering if the library size should be:

  1. the number of raw sequence reads produced by the sequencing;
  2. the number of unique mapped reads to the features, i.e. the sum of the counts of all features given by HTSEQ-count.

Thank you!

Concetta

RNA-Seq gene expression • 6.4k views
ADD COMMENT
1
Entering edit mode
4.1 years ago
ATpoint 81k

Don't do this by hand. Feed your raw count matrix into specialized tools such as edgeR to make use of their normalization methods. Below is example code, pretty much copied from edgeR help when typing ?calcNormFactors. Be sure to removed the column that contains the unmapped reads from your count matrix.

library(edgeR)

## example count matrix for five three samples
y <- matrix( rpois(1000, lambda=5), nrow=200 )

## as DGEList
dge <- DGEList(counts=y)

## calculate norm. factors
dge <- calcNormFactors(dge)

## get normalized counts
normalized.counts <- cpm(dge)
ADD COMMENT
0
Entering edit mode
4.1 years ago

As you can see, EdgeR's cpm function takes in the read count file. If you don't have an row for unmapped or unassigned reads, (and you generally won't) they will not be counted at all, so the counts are being divided by the number of counts assigned to something.

cpm is simple enough that you could do it yourself (as opposed to TPM or RPKM, where you really should let software handle the ambiguities)

ADD COMMENT
0
Entering edit mode

Be aware though that CPM uses the normalisation factors from calcNormFactors to correct for library composition and therefore is more sophisticated than correcting only for total read count differences.

ADD REPLY
0
Entering edit mode

Thank you! I have another question. To normalize counts, do you suggest to consider also reads unassigned to something, such as unmapped reads and/or multimapping reads?

ADD REPLY
0
Entering edit mode

No. Exclude them.

ADD REPLY

Login before adding your answer.

Traffic: 1696 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6