Question

Effective length in normalisation for reads counts

2

Entering edit mode

8.2 years ago

Xiaokang ▴ 80

There are several methods to normalise the raw reads counts, such as RPKM (FPKM for pair-end), TPM, TMM, and so on. To remove the effect of gene length, they will divide the raw counts by the gene length, since long genes tend to have more counts mapped to them. But I read from some paper that you should use "effective length" instead of gene length (Pachter, Lior. "Models for transcript quantification from RNA-Seq." arXiv preprint arXiv:1104.3889 (2011)). And the "effective length" equals: gene length - sequencing depth(or reads length) + 1, which is also the number of positions in which a read can start.

But what if the gene length is shorter than the sequencing depth? Because then you'll get a negative effective length. And if you use the negative value to divide the raw reads counts, then you'll also get a negative normalised value in the matrix of RPKM or TPM.

RNA-Seq • 9.4k views

ADD COMMENT • link updated 8.2 years ago by Rob 7.1k • written 8.2 years ago by Xiaokang ▴ 80

score 11 · Answer 1 · 2017-05-20

Methods that implement effective length correction all avoid generating negative effective lengths. Actually, they do this in quite a few different ways (depending on the tool).

Actually, however, it might be most useful to think of the "effective length" as a property of both a transcript and a specific read, rather than a transcript alone. Consider a transcript with length m and a read that maps to this transcript with length (total distance between leftmost and rightmost mapped base) n. In the case that n > m (e.g. the read overhangs the transcript) we can assume n = m --- this is a rare case and likely an artifact of mapping or misannotation etc. Then, this particular read can start in m-n + 1 different locations. So, from the perspective of this particular read, the effective length of the transcript is m - n + 1. Now, a transcript will typically have many reads mapping to it, and we can define (as Li et al. do in RSEM) the expect effective length of the transcript as simply the expected value of the effective length of a transcript, averaged over all reads that map to that transcript. There are other approximations of effective length that have different properties in terms of e.g. computational convenience, but I find the notion of expected effective length to be the most straightforward to understand. Moreover, in this case, you can see why the quantity is never negative; any read that maps to a transcript must have at least one potential start site, though often there could have been many. I think this perspective also helps show why the effective length makes sense to consider rather than the raw length. You can read a slightly longer explanation (with nice math typesetting) at my blog.