I've got an RNA seq experiment with ~30M reads per sample and two sample types. There are 10 Biological replicates of SampleA and 6 Biological replicates of SampleB. (Human hg19)
I've ran the tuxedo pipeline, Tophat -> Cufflinks -> cuffmerge -> cuffdiff
Upon inspection of the potentially novel isoform list, I found one of interest. The first exon contained a sequence that was a predicted protein coding domain, and fit perfectly between a start and a stop. However, there was an extra ~34 bases in front that seemed out of place, but was identified by cufflinks/cuffdiff as part of the exon.
When looking at the coverage from every sample's bam files, it was clear that the sequence I suspected, and that fit perfectly, was the true sequence. So my question really is where could this extra 34 or so bases have come from when predicted by cufflinks/cuffdiff?
Could it be something to do with the RABT assembly using faux reads?
I've tried every combination of the switches on Cufflinks and every time that novel isoform is detected, the 34 bases are appended to the front of the exon, which doesn't make sense.
If anyone has any insights, I'd be very grateful!