Hi all,
Recently I got some results from cuffdiff pipeline (aligned with tophat, assembled with cufflinks, merged transcripts from all samples with cuffcompare, quantified with cuffquant and DE with cuffdiff). I ran two independent datasets in parallel:
- In vitro data with 4 treatments and 2 replicates each, 2 time points (8 samples)
- In vivo data with 2 treatments and 7 samples each (14 samples)
Within the outputs, I was more interested in differentially expressed novel isoforms (those labeled "j" in the merged transcript file). I used IGV to check the alignments and the assembly coordinates of those new isoforms, but turns out that 1. the nearest_ref does not seem the nearest reference at all (http://goo.gl/rPxrJd) and 2. I checked the .tracking file generated by cuffmerge to see which samples had this isoform and only one, from a different time point, had reads. And that puzzles me: how come these reads are reassigned downstream to an isoform that was in only 1 of 8 samples and worse, detected as DE?! Also, from the figure wasn't the Known isoform 1 supposed to be the nearest ref for the novel isoform, since known isoform 2 has two extra exons?
The same thing happened with the second dataset, with some novel isoforms being crazily detected as DE even though they were present in only one out of 14 samples. So am I missing something, interpreting the data the wrong way...? Should I filter the merged transcript file to exclude these weird assemblies? Or in case cufflinks/cuffdiff is really flawed, does anyone have another tool to suggest?
Many thanks!
Hi Andrew, thanks for your reply! I was thinking, though, if it would be possible to use the assembled transcriptome as the reference .gtf that RSEM uses, for example. Or even the assembly is not reliable at all...?
I've had the same thought before, but I honestly think the novel calls that Cufflinks make aren't worth it - to be fair to Cufflinks, it's a very difficult problem, and there aren't that many alternatives out there. Biologically you'll get more out your experiment by reliably identifying statistically robust results with known annotation. Remember that if you do novel discovery using cufflinks and discover another potential X thousands of transcripts, then that's another X thousands of tests you're going to have to do in a differential expression analysis, which makes it harder for features to pass significance thresholds.