Question: RNA-seq normalization: How to use TMM and rpkm() in EdgeR
gravatar for AW
6.7 years ago by
United Kingdom
AW350 wrote:


I have some RNA-seq samples that I want to normalize and then output RPKM expression, but I am unsure how to do this.

This is my pipeline so far:

1. Normalise raw read counts with TMM in edgeR

expr <- DGEList(counts=data, group=conditions)
expr <- calcNormFactors(expr)


                        group lib.size norm.factors
Sample1     F 19770521    1.0462660
Sample2     F 17970679    0.8794805
Sample3     F 19184265    1.0573665

QUESTION: How do I get normalized raw read counts from this? Do I multiply the read counts by the norm.factors?

QUESTION: Ultimately, I want to end up with RPKM values for each gene in each sample. I know I can use the rpkm() function below in edgeR

expr_norm <- rpkm(expr, log=FALSE,gene.length=vector)

but is expr the output from calcNormFactors or something else?

Thanks for your help!


ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by AW350
gravatar for Devon Ryan
6.7 years ago by
Devon Ryan98k
Freiburg, Germany
Devon Ryan98k wrote:

The library size normalized counts are made by dividing the counts by the normalization factor (you'll note that the larger libraries have larger normalization factors, so if you multiplied things you'd just inflate the difference in sequencing depth).

For the rpkms, just do rpkm(expr, gene.length=vector), since it can take your DGEList, (this is explained in help(rpkm)).

ADD COMMENTlink modified 13 months ago by _r_am32k • written 6.7 years ago by Devon Ryan98k

great thanks very much!!

so just to clarify if I do rpkm(expr) then the output will be rpkm files that are both TMM normalised and also length normalised? Is this because it is reading in the norm.factors and the unnormalised read counts from DGEList?


ADD REPLYlink written 6.7 years ago by AW350

Oops, I forgot the "gene.length" part. I've updated my answer accordingly. Regardless, yes, rpkm can get the normalization factors from the DGEList, so you needn't much with that yourself. If you input counts as a matrix then it'll just do library size normalization, which is less robust.

ADD REPLYlink written 6.7 years ago by Devon Ryan98k

Thanks! And when you say as a matrix, you mean if I divided the counts by the normalisation factor manually and then gave this as an input to rpkm()?

ADD REPLYlink written 6.7 years ago by AW350

Or just fed it in as is. Feeding in the normalized counts would probably lead to weird results, since the M in RPKM is there for normalization (so you would have normalized a normalized result...which will likely muck it up slightly).

ADD REPLYlink modified 13 months ago by _r_am32k • written 6.7 years ago by Devon Ryan98k

Cool, thanks! So I will use the DGEList output from calcNormFactors as the input for rpkm along with gene length information!

ADD REPLYlink modified 13 months ago by _r_am32k • written 6.7 years ago by AW350

I also have two other questions, that Id be very grateful if you could answer.

Q1. When creating the expr <- DGEList(counts=data, group=conditions), what effect does specifying groups have one the TMM normalisation? How does TMM use this information and how would the results differ if you did specify groups versus not?

Q2. The expression data I am using was obtained from mapping reads onto denovo contigs assembled with Trinity. I then chose the most highly expressed contig from each cluster as the "best isoform" and then summed expression across all the contigs in the cluster as the expression value for that cluster. Therefore I do not have one obvious gene length to use. Should I use the longest contig from the cluster?


ADD REPLYlink modified 13 months ago by _r_am32k • written 6.7 years ago by AW350
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2308 users visited in the last hour