Question: Sailfish TPM normalization: does it take into account the size the library?
gravatar for Aurelie MLB
5.0 years ago by
Aurelie MLB340
United Kingdom
Aurelie MLB340 wrote:


I apologize if it is a silly question. I am trying to get my head around the Sailfish TPMs.

I can read in the paper that the "estimated abundances are corrected for systematic errors resulting from sequence composition bias and transcript length". But I cannot see anything about the size of the libraries...

Do the sailfish TPM normalize for the size of the library as well please?

Many thanks



rna-seq • 2.7k views
ADD COMMENTlink modified 5.0 years ago by Rob4.6k • written 5.0 years ago by Aurelie MLB340
gravatar for Rob
5.0 years ago by
United States
Rob4.6k wrote:

Hi Aurelie,

It's not a silly question at all. TPM (like RPKM / FPKM / RPK, etc.) is a relative abundance measure. It accounts for the relative rate at which reads are sampled from transcripts. Intuitively, it answers the question, if I sampled 1 million transcripts (randomly and in an un-biased manner) from the set of all expressed transcripts, how much of each transcript would I see (in expectation). As such, TPM is not designed to normalize for library size. If, for example, I doubled the number of reads deriving from each transcript, the TPMs would not change. This is by design, as the TPM factors out the actual number of mapped reads in the experiment to derive a rate. In this sense, the TPMs are normalized.

However, this is almost not certainly what you mean when you ask for normalization by library size. With the exception of limma / voom, every tool I know of for differential expression will use the counts (or estimated counts) of reads mapping to each transcript in each condition. Given this "raw" input, different tools will then apply different normalization techniques to explicitly account for the fact that different samples may have a different number of reads or a different number of mapped reads. Thus, to use sailfish with other tools for downstream differential expression, you can make use of the "NumReads" field (the last column in the quant.sf files). For specific tips (and a useful software package) in using Sailfish with downstream differential expression tools, I recommend you take a look at this paper.

ADD COMMENTlink modified 12 months ago by _r_am32k • written 5.0 years ago by Rob4.6k


Thanks a lot for your answer. Very helpful!

I did not know I could use the sailfish NumReads for differential expression analysis. I will have a closer look. Thanks!

The first part of your answer is actually the type of explanation I was looking for. So, (please correct me if I am wrong) if I understand well: it should be possible to compare the expression of genes across samples, shouldn't it (even if the libraries are of varying sizes)? I prefer to ask because I just stumbled accross this post where the first post is saying "2. I understand why one shouldn't compare TPM between samples, since the total expression rates, rRNA component etc. varies sample-to-sample.". Well I do not get this very well. I thought that was one of the point of normalising... Sorry I am getting confused now.


ADD REPLYlink modified 12 months ago by _r_am32k • written 5.0 years ago by Aurelie MLB340

So, the reason TPM is not directly comparable across samples is, in large part, because of how aggressively normalized it is. Consider the following (extreme and illustrative) example, you sequence a sample where only 2 transcripts, A and B, are actually present. Say they are present in equal amounts, and there are 5,000 copies of each molecule in the sample. Assuming unbiased sampling and accurate inference, you'll get back TPM estimates of ~500k for each transcript. Now, consider a second sample, where there are 3 transcripts expressed, A, B and C. Further, assume that the number of As and Bs is the same as in our first sample (5,000 copies each), but now C also has 5,000 copies expressed. This time, when we estimate TPM, we'll get a TPM of ~333K for each of A, B and C. This is because the TPM estimates will always sum to 1,000,000. TPM is always breaking up the entire expressed transcriptome into a partition of unity (and then multiplying by 1,000,000). Thus, if the total expression level changes, so will the TPM (i.e. if a transcript is expressed, in "absolute" terms, at the same level across samples, but the expression of the other assayed transcripts changes, then the TPM values of all of the transcripts can change). The normalization approaches used for e.g. DGE, attempt to account for this, by not always forcing the total "expression" to be the same. It's worth noting that it's a tough problem though, because RNA-seq can, by nature, only give you relative estimates of expression (though spike-ins can help one interpret these relative levels). However, TPM, like R/FPKM is designed to be a nice, interpretable unit of expression within a sample, not really to be compared across samples (though you will see people doing this quite often).

ADD REPLYlink modified 12 months ago by _r_am32k • written 5.0 years ago by Rob4.6k

Nice follow-up Rob. Even though I knew that TPM normalises for total reads indirectly, I could not have explained to a colleague why that means it is not possible to compare different samples.

I have a question now though :P

Spike-ins are one method to detect total abundances - could another be to do qPCR on a bit of your library (same library that goes on the sequencer, which we may even still have lying around in a freezer somewhere from old sequencing experiments) and get expression of 100 housekeeping genes from each sample. Since qPCR can be quantitative, could this information be used to adjust your TPMs to actual (quantitative) abundances?

ADD REPLYlink modified 12 months ago by _r_am32k • written 5.0 years ago by John12k

Thanks a lot for your explanation!

ADD REPLYlink written 5.0 years ago by Aurelie MLB340
gravatar for John
5.0 years ago by
John12k wrote:

Kind of - you divide by the sum of all rates, which is the same sort of transformation as dividing the reads per transcript by all reads. Except you're here using rates and not reads, of course.

\text{TPM}_i = \dfrac{X_i}{\widetilde{l}_i} \cdot \left( \dfrac{1}{\sum_j \dfrac{X_j}{\widetilde{l}_j}} \right) \cdot 10^6.

Read the excellent for all the deets :)

ADD COMMENTlink modified 12 months ago by _r_am32k • written 5.0 years ago by John12k

Thank you very much!

ADD REPLYlink written 5.0 years ago by Aurelie MLB340
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2198 users visited in the last hour