This is a really interesting benchmarking paper that compared a bunch of RNAseq methods.
Overview of methods:
They used TopHat, RUM and STAR for alignment. Genome-guided analyses were run without and with annotations (if supported) using Cufflinks, Scripture, CEM, IsoLasso, Casper, IReckon. De-novo analyses were run with Trinity, OASES, SOAPdenovo-Trans, and EBARdenovo. They created "truth" datasets where they predefined the isoforms present using simulated idealized reads, simulated realistic reads, and then a real ("spike-in") dataset of actual RNA sequencing reads for 1,062 in vitro expressed human cDNAs (from the Mammalian Gene Collection). For the simulated data they also defined true negative isoforms by removing exons from known isoforms but not including any simulated reads for those isoforms. They focus on the ability of these algorithms to correctly recapitulate the known isoforms in these datasets. For a true positive they require the joining of exons into a final complete isoform with the identical structure as the known positive but don't require accurate determination of transcription start or stop sites (that is an even harder problem in some ways).
Most algorithms performed well with perfect data and a single splice form, but they tend to falter when predicting multiple splice forms. Once there are two splice forms, all algorithms have a > 10% FDR. The application of Cufflinks using a TopHat alignment
(Cufflinks+TopHat - in de novo mode?), which is common in practice, results in a 40% FDR. The de novo methods incur substantially higher error rates, with Trinity having a nearly 90% FDR on two-splice-form genes, coupled with an approximately 50% false negative rate. Curiously Cufflinks performs better with a TopHat alignment than with a STAR alignment, even though STAR produced a more accurate alignment. On the more realistic data, the application of Cufflinks + TopHat to perform de novo identification incurs an FDR error rate around 30% with FNR around 25%. This only seems better than the idealized data because their more realistic data (in terms of reads with errors, intron noise, etc) was also much simpler in terms of isoform complexity.
With the real in vitro data, both the FDR and FNR were generally much higher. Cufflinks + TopHat + Annotation performed the best, with a FDR of 20% and a FNR of 26%. This was an order of magnitude worse than for the ideal data, which speaks to the complications introduced by alignment errors, polymorphisms, etc. Remember that this spike-in data was almost exclusively just one isoform per gene and just ~1000 genes in total. For the genes with more than one splice form, the results were worse still. Even with known annotations and a very simple sample (with only 1000 genes and very few isoforms), 20% of isoforms that Cufflinks predicts are incorrect and it misses 26% of the real isoforms. And, cufflinks was best-in-class!
In terms of expression estimates, with realistic simulated data, perfect alignments and annotations available, Cufflinks reports 16.7% of isoforms with an FPKM more than one logarithm off of truth. Without annotations, this number increases to 38.62%. With real (non-perfect) alignments and no annotation available it is even worse at 54.12%. Unfortunately the don't mention the performance with real alignments but annotations available (which is the common real-world situation) but we can assume that more than ~17% of isoforms are going to be a least an order of magnitude off in their estimation. Other algorithms all did as bad or worse than Cufflinks.
Some choice quotes from the authors:
- "with a reference genome available with some degree of community annotation, it is hard to imagine any benefit of using a de novo approach"
- "The extreme overcalling of forms makes it unclear how to utilize the output of Scripture in a practical way"
- "These results are not encouraging"
The take-home-message for me:
TopHat/Cufflinks with a reference genome and annotations is the best current option but far from perfect. Given the reality of noisy data, imperfect alignments and multiple isoforms expressed per gene (expecting most genes to express at least two isoforms), we can possibly expect at least ~20% of real isoforms to be missed, ~20% of predicted isoforms to be wrong, and ~20% of expression estimates to be wrong by at least an order of magnitude. And, this is when you have a reference genome and good annotations available to guide Cufflinks. De novo mode using Cufflinks (and especially some other methods) should only be considered experimental/exploratory given the very high FDR/FNRs. We can expect to miss real biologically important isoforms and any novel isoforms predicted should be validated. This paper strongly emphasizes the importance of continued improvement to transcript isoform discovery and quantization methods and/or improved data quality (e.g., longer reads).
In the author's words, "short reads fundamentally lack the information necessary to build local information into globally accurate transcripts ... Most likely a satisfactory solution will involve an evolution in the nature of the data. Or perhaps some keen insight into how to identify and effectively utilize signals in the genome that inform cellular machinery on what splice forms to generate."