Recently I got some results from cuffdiff pipeline (aligned with tophat, assembled with cufflinks, merged transcripts from all samples with cuffcompare, quantified with cuffquant and DE with cuffdiff). I ran two independent datasets in parallel:
- In vitro data with 4 treatments and 2 replicates each, 2 time points (8 samples)
- In vivo data with 2 treatments and 7 samples each (14 samples)
Within the outputs, I was more interested in differentially expressed novel isoforms (those labeled "j" in the merged transcript file). I used IGV to check the alignments and the assembly coordinates of those new isoforms, but turns out that 1. the nearest_ref does not seem the nearest reference at all (http://goo.gl/rPxrJd) and 2. I checked the .tracking file generated by cuffmerge to see which samples had this isoform and only one, from a different time point, had reads. And that puzzles me: how come these reads are reassigned downstream to an isoform that was in only 1 of 8 samples and worse, detected as DE?! Also, from the figure wasn't the Known isoform 1 supposed to be the nearest ref for the novel isoform, since known isoform 2 has two extra exons?
The same thing happened with the second dataset, with some novel isoforms being crazily detected as DE even though they were present in only one out of 14 samples. So am I missing something, interpreting the data the wrong way...? Should I filter the merged transcript file to exclude these weird assemblies? Or in case cufflinks/cuffdiff is really flawed, does anyone have another tool to suggest?