Question

Sailfish TPM normalization: does it take into account the size the library?

2

Entering edit mode

8.5 years ago

Aurelie MLB ▴ 360

Hello,

I apologize if it is a silly question. I am trying to get my head around the Sailfish TPMs.

I can read in the paper that the "estimated abundances are corrected for systematic errors resulting from sequence composition bias and transcript length". But I cannot see anything about the size of the libraries...

Do the sailfish TPM normalize for the size of the library as well please?

Many thanks

RNA-Seq • 4.9k views

ADD COMMENT • link updated 2.0 years ago by Ram 44k • written 8.5 years ago by Aurelie MLB ▴ 360

Ram · Answer 1 · 2016-02-02

5

Entering edit mode

8.5 years ago

Rob 6.8k

Hi Aurelie,

It's not a silly question at all. TPM (like RPKM / FPKM / RPK, etc.) is a relative abundance measure. It accounts for the relative rate at which reads are sampled from transcripts. Intuitively, it answers the question, if I sampled 1 million transcripts (randomly and in an un-biased manner) from the set of all expressed transcripts, how much of each transcript would I see (in expectation). As such, TPM is not designed to normalize for library size. If, for example, I doubled the number of reads deriving from each transcript, the TPMs would not change. This is by design, as the TPM factors out the actual number of mapped reads in the experiment to derive a rate. In this sense, the TPMs are normalized.

However, this is almost not certainly what you mean when you ask for normalization by library size. With the exception of limma / voom, every tool I know of for differential expression will use the counts (or estimated counts) of reads mapping to each transcript in each condition. Given this "raw" input, different tools will then apply different normalization techniques to explicitly account for the fact that different samples may have a different number of reads or a different number of mapped reads. Thus, to use sailfish with other tools for downstream differential expression, you can make use of the "NumReads" field (the last column in the quant.sf files). For specific tips (and a useful software package) in using Sailfish with downstream differential expression tools, I recommend you take a look at this paper.

ADD COMMENT • link updated 4.5 years ago by Ram 44k • written 8.5 years ago by Rob 6.8k

0

Entering edit mode

Hello,

Thanks a lot for your answer. Very helpful!

I did not know I could use the sailfish NumReads for differential expression analysis. I will have a closer look. Thanks!

The first part of your answer is actually the type of explanation I was looking for. So, (please correct me if I am wrong) if I understand well: it should be possible to compare the expression of genes across samples, shouldn't it (even if the libraries are of varying sizes)? I prefer to ask because I just stumbled accross this post where the first post is saying "2. I understand why one shouldn't compare TPM between samples, since the total expression rates, rRNA component etc. varies sample-to-sample.". Well I do not get this very well. I thought that was one of the point of normalising... Sorry I am getting confused now.

Thanks!

ADD REPLY • link updated 4.5 years ago by Ram 44k • written 8.5 years ago by Aurelie MLB ▴ 360

2

Entering edit mode

So, the reason TPM is not directly comparable across samples is, in large part, because of how aggressively normalized it is. Consider the following (extreme and illustrative) example, you sequence a sample where only 2 transcripts, A and B, are actually present. Say they are present in equal amounts, and there are 5,000 copies of each molecule in the sample. Assuming unbiased sampling and accurate inference, you'll get back TPM estimates of ~500k for each transcript. Now, consider a second sample, where there are 3 transcripts expressed, A, B and C. Further, assume that the number of As and Bs is the same as in our first sample (5,000 copies each), but now C also has 5,000 copies expressed. This time, when we estimate TPM, we'll get a TPM of ~333K for each of A, B and C. This is because the TPM estimates will always sum to 1,000,000. TPM is always breaking up the entire expressed transcriptome into a partition of unity (and then multiplying by 1,000,000). Thus, if the total expression level changes, so will the TPM (i.e. if a transcript is expressed, in "absolute" terms, at the same level across samples, but the expression of the other assayed transcripts changes, then the TPM values of all of the transcripts can change). The normalization approaches used for e.g. DGE, attempt to account for this, by not always forcing the total "expression" to be the same. It's worth noting that it's a tough problem though, because RNA-seq can, by nature, only give you relative estimates of expression (though spike-ins can help one interpret these relative levels). However, TPM, like R/FPKM is designed to be a nice, interpretable unit of expression within a sample, not really to be compared across samples (though you will see people doing this quite often).

ADD REPLY • link updated 4.5 years ago by Ram 44k • written 8.5 years ago by Rob 6.8k

0

Entering edit mode

Nice follow-up Rob. Even though I knew that TPM normalises for total reads indirectly, I could not have explained to a colleague why that means it is not possible to compare different samples.

I have a question now though :P

Spike-ins are one method to detect total abundances - could another be to do qPCR on a bit of your library (same library that goes on the sequencer, which we may even still have lying around in a freezer somewhere from old sequencing experiments) and get expression of 100 housekeeping genes from each sample. Since qPCR can be quantitative, could this information be used to adjust your TPMs to actual (quantitative) abundances?

ADD REPLY • link updated 4.5 years ago by Ram 44k • written 8.5 years ago by John 13k

0

Entering edit mode

Thanks a lot for your explanation!

ADD REPLY • link 8.5 years ago by Aurelie MLB ▴ 360

Ram · Answer 2 · 2016-02-02

1

Entering edit mode

8.5 years ago

John 13k

Kind of - you divide by the sum of all rates, which is the same sort of transformation as dividing the reads per transcript by all reads. Except you're here using rates and not reads, of course.

$\text{TPM}_i = \dfrac{X_i}{\widetilde{l}_i} \cdot \left( \dfrac{1}{\sum_j \dfrac{X_j}{\widetilde{l}_j}} \right) \cdot 10^6$ .

Read the excellent https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/ for all the deets :)

ADD COMMENT • link updated 4.5 years ago by Ram 44k • written 8.5 years ago by John 13k

1

Entering edit mode

Thank you very much!

ADD REPLY • link 8.5 years ago by Aurelie MLB ▴ 360