Question

RNA-seq normalization: How to use TMM and rpkm() in EdgeR

8

Entering edit mode

10.0 years ago

AW ▴ 350

Hi,

I have some RNA-seq samples that I want to normalize and then output RPKM expression, but I am unsure how to do this.

This is my pipeline so far:

Normalise raw read counts with TMM in edgeR

expr <- DGEList(counts=data, group=conditions)
expr <- calcNormFactors(expr)

output:

$samples
                        group lib.size norm.factors
Sample1     F 19770521    1.0462660
Sample2     F 17970679    0.8794805
Sample3     F 19184265    1.0573665

QUESTION: How do I get normalized raw read counts from this? Do I multiply the read counts by the norm.factors?

QUESTION: Ultimately, I want to end up with RPKM values for each gene in each sample. I know I can use the rpkm() function below in edgeR

expr_norm <- rpkm(expr, log=FALSE,gene.length=vector)

but is expr the output from calcNormFactors or something else?

Thanks for your help!

A

normalization RNA-Seq RPKM TMM EdgeR • 27k views

ADD COMMENT • link updated 2.6 years ago by Ram 43k • written 10.0 years ago by AW ▴ 350

Ram · Accepted Answer · 2014-04-29

4

Entering edit mode

10.0 years ago

Devon Ryan 104k

The library size normalized counts are made by dividing the counts by the normalization factor (you'll note that the larger libraries have larger normalization factors, so if you multiplied things you'd just inflate the difference in sequencing depth).

For the rpkms, just do rpkm(expr, gene.length=vector), since it can take your DGEList, (this is explained in help(rpkm)).

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by Devon Ryan 104k

0

Entering edit mode

great thanks very much!!

so just to clarify if I do rpkm(expr) then the output will be rpkm files that are both TMM normalised and also length normalised? Is this because it is reading in the norm.factors and the unnormalised read counts from DGEList?

thanks!

ADD REPLY • link 10.0 years ago by AW ▴ 350

0

Entering edit mode

Oops, I forgot the "gene.length" part. I've updated my answer accordingly. Regardless, yes, rpkm can get the normalization factors from the DGEList, so you needn't much with that yourself. If you input counts as a matrix then it'll just do library size normalization, which is less robust.

ADD REPLY • link 10.0 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks! And when you say as a matrix, you mean if I divided the counts by the normalisation factor manually and then gave this as an input to rpkm()?

ADD REPLY • link 10.0 years ago by AW ▴ 350

0

Entering edit mode

Or just fed it in as is. Feeding in the normalized counts would probably lead to weird results, since the M in RPKM is there for normalization (so you would have normalized a normalized result...which will likely muck it up slightly).

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by Devon Ryan 104k

0

Entering edit mode

Cool, thanks! So I will use the DGEList output from calcNormFactors as the input for rpkm along with gene length information!

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by AW ▴ 350

0

Entering edit mode

I also have two other questions, that Id be very grateful if you could answer.

Q1. When creating the expr <- DGEList(counts=data, group=conditions), what effect does specifying groups have one the TMM normalisation? How does TMM use this information and how would the results differ if you did specify groups versus not?

Q2. The expression data I am using was obtained from mapping reads onto denovo contigs assembled with Trinity. I then chose the most highly expressed contig from each cluster as the "best isoform" and then summed expression across all the contigs in the cluster as the expression value for that cluster. Therefore I do not have one obvious gene length to use. Should I use the longest contig from the cluster?

Thanks!

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by AW ▴ 350