Question: TMM normalisation from RSEM raw counts
gravatar for AW
5.4 years ago by
United Kingdom
AW350 wrote:


I want to normalise raw read counts obtained from RSEM, using TMM in edgeR. 

expr <- DGEList(counts=data)

expr <- calcNormFactors(expr)

I will then obtain rpkm values from this expr object using

expr_norm <- rpkm(expr, log=TRUE, gene.length=gene_length$Length)

For the gene lengths I need to specify, should I use length or effective length from the RSEM output?

Thanks very much


tmm rna-seq edger rpkm rsem • 5.1k views
ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by AW350

Yes I agree, but I just realised because expression is different between my samples, the effective lengths differ for a given gene! And I can only give expr_norm one length per gene.

I have also noticed that gene length actually differs across samples mapped to the same reference. This seems unexpected to me, as the length calculate is independent of expression?



ADD REPLYlink written 5.4 years ago by AW350

I should have mentioned the per-sample effective length issue. The rpkm() function is pretty simple, so you can just implement your own that can use per-sample lengths. It's not surprising that you get different effective lengths between samples, since the fragment distribution can vary. I don't know off-hand how RSEM calculates length, but I suspect that it's the expected length given the transcript composition, which can also vary by sample.

ADD REPLYlink written 5.4 years ago by Devon Ryan92k

Great, thanks, this is very helpful!

So you would recommend using the per-sample lengths instead of one length for all samples eg the length of the longest isoform? Hypothetically, when using rpkm() do you not need to control for the length over which reads can map ie length of longest isoform?

And finally, if I were to use rpkm() for each per-sample length what command would I use?

I currently use

expr_norm <- rpkm(expr, log=TRUE, gene.length=gene_length$Length)

How do I modify this to have different per-sample lengths?

ADD REPLYlink written 5.4 years ago by AW350

The benefit of using a per-sample length is that if you have a difference in isoform distribution between samples (e.g., Group A expresses primarily isoform 1 and Group B isoform 2, where the isoforms have different lengths) then the resulting RPKMs better represent this. This is, in fact, how cufflinks works.

As I mentioned, there's no existing command in edgeR to produce rpkm with per-sample lengths. The rpkm() command itself, however, is very simple:

rpkm <- function(x, gene.length, normalized.lib.sizes=TRUE, log=FALSE, prior.count=0.25)
        y <- cpm(x=x,normalized.lib.sizes=normalized.lib.sizes,log=log,prior.count=prior.count)
        gene.length.kb <- gene.length/1000

So, just take the matrix of cpm values and divide that by the matrix of per-sample lengths (divided by 1000 if the lengths are in bp rather than kb).

ADD REPLYlink written 5.4 years ago by Devon Ryan92k

Thanks very much, that is very clear.

Just as note, Im don't think you divide gene lengths by 1000. The manual rpkm says to use bases.


vector of length nrow(x) giving gene length in bases, or the name of the column x$genes containing the gene lengths.

ADD REPLYlink written 5.4 years ago by AW350

You do need to divide by 1000 if you do it yourself. The code I posted is the rpkm() function from edgeR, so it's expecting bases since it does the division by 1000 internally.

ADD REPLYlink written 5.4 years ago by Devon Ryan92k

Great, thanks for being so clear!

I have one last question. I have used rpkm() with the commands below and then manually calculated rpkm, but I get different values, why?


expr_norm <- rpkm(expr, log=FALSE, gene.length=gene_length$Length)

manual rpkm



ADD REPLYlink written 5.4 years ago by AW350

You should be dividing by norm.factor, not multiplying by it. You could make your life easier and just use the cpm() function, since presumably you already have the expected counts as a DGElist

ADD REPLYlink written 5.4 years ago by Devon Ryan92k

Perfect, thanks very much!

However, when I sum the rpkm columns across each sample, they are very different. I was not expecting this, as I assume that the normalisation is meant to control for this. For example the minimum total rpkm for sample 1 is 790000 but for sample 4 is 1000000. Why are they so different?

ADD REPLYlink written 5.4 years ago by AW350

Summing the columns will only give the same results if you don't use a normalized library size (a bad idea) and use a single length for each gene (kind of defeats the purpose).

ADD REPLYlink written 5.4 years ago by Devon Ryan92k

Sorry, I dont think I was particularly clear. I have used TMM and rpkm() to normalise across 4 samples. I wanted to check that the normalisation has worked as previously I have had problems with this. In order to check, I summed the rpkm columns for each sample, with the assumption that the totals should be reasonably similar. But they are not (up to 25% different), and I was wondering what your thoughts on this were?

Thanks so much!!

ADD REPLYlink written 5.4 years ago by AW350

And, even before rpkm() Im not sure the TMM has worked as the lib.size*norm.factors are quite different.

lib.size norm.factors lib.size*norm.factors
37230243 1.0990646 40918442.13
37682151 1.0782278 40629942.77
42397446 0.974394 41311817
33079364 0.8660274 28647635.6
ADD REPLYlink written 5.4 years ago by AW350

I wouldn't conclude from that that things went amiss. Rather, it looks like the last library has a skew to it, likely due to rRNA or something like that.

ADD REPLYlink written 5.4 years ago by Devon Ryan92k

That makes sense. I did remove any rRNA that blasted to the nearest reference. Have you any other suggestions as how to deal with this, as I feel uncomfortable comparing across samples due to this difference in library size?

ADD REPLYlink written 5.4 years ago by AW350

Any highly expressed gene can cause an effect like this, rRNAs are just the most common of those. Just do a scatter plot of the normalized counts between a couple of the samples and ensure that points cluster around the diagonal. If they do, then things went well. That'll be vastly more informative than looking at the numbers you're looking at.

ADD REPLYlink written 5.4 years ago by Devon Ryan92k

Great thanks! I have one last question though that I am concerned about. I am interested in identifying differentially expressed genes, and if the library sizes differ then won't this impact my assessment of sex-biased genes? eg if the library size of one sample is bigger, then won't i find that genes are expressed higher in the sample than another with a lower library size?

ADD REPLYlink written 5.4 years ago by AW350

If the normalized counts cluster around the diagonal when plotted then the TMM normalization worked and the difference in library sizes is exactly what you want to avoid the situation that you're thinking of.

ADD REPLYlink written 5.4 years ago by Devon Ryan92k
gravatar for Devon Ryan
5.4 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

Effective length would make more sense, since that's the observable length.

ADD COMMENTlink written 5.4 years ago by Devon Ryan92k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2580 users visited in the last hour