5.7 years ago by

Stockholm

malachig gave a very good answer but just to address to question of how Cufflinks assigns reads to transcripts, it uses (at its core, though with many bells and whistles) an EM (expectation maximization) algorithm to estimate ML (maximum-likelihood) probabilities of each read coming from a certain isoform.

Imagine that you have a set of reads mapping to the gene in question, and a set of isoforms for that gene. If you knew the abundance of each isoform, you could probabilistically "distribute" each read across the isoforms that it could possibly have come from (that is, assign a probability for that read coming from each of the matching isoforms), using the abundance of the isoforms. (Of course, you don't have the abundances.) And if you knew in advance the probability of each read coming from each of the isoforms, you could calculate the resulting estimated isoform abundances. Of course, you don't know these either. How to solve this conundrum? By iteration!

The EM algorithm in Cufflinks basically iterates a procedure where it guesses the probabilities of each read coming from each isoform (based on the abundances of the isoforms that are compatible with the read), then re-calculates the abundances with the updated read<->isoform assignment, then again re-assigns the probabilities based on the newly updated abundance estimates, and so on until convergence. (The probabilities and abundances are probably initialized either to be uniform across the reads/isoforms or randomized in some way.) The determination of whether a read is compatible with an isoform can use information from spliced and paired-end alignments.

I may or may not have the steps in the wrong order, but I believe this is essentially how Cufflinks works (in the quantification mode - it also has reference-based transcript assembly modes, of course). Cufflinks has a lot of added features in addition to what I outlined, like a Monte Carlo-based calculation of confidence intervals, bias correction, etc.

For a relatively gentle mathematical description of the basic idea, see the "ancestral paper" An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs (not about Cufflinks but describes the EM framework in simple terms), and for a very good albeit more challenging description of Cufflinks and many other isoform quantification procedures, see Models for transcript quantification from RNA-Seq (by Lior Pachter, one of the people behind Cufflinks.)