Hello, I have many questions about cufflinks output. Here one of them : First I've used tophat to map my RNAseq (100pb) to obtain a accepted_hits.bam file. Then I've used cufflinks in two ways :
- simply : cufflinks accepted_hits.bam
with a gtf file, that is the actually annotation of my genome (eucalyptus) :
cufflinks -g annotGenome.gtf accepted_hits.bam
Note that I've used the –g and not the –G option.
One example result :
- without reference gtf : one gene / one isoform : 12110-17714
The first part of this isoform 12-16530 has the same structure intron/exon than isoforms formed with the reference. Then I have a last exon 16530-17714.
-with reference gtf one gene/two isoforms + transcript 1: 12024-17350 = exact transcript from the reference full_read_support "no";
The corresponding no reference last exon is now :
16530 - 16561 16597 - 17350
That's my reference, but in my run this intron is mapped. There is no read that split in two parts. A few reads begin at position 16595. I've checked no read ending at 16561. I thing this RNA doesn't exist in my transcriptome.
+ transcript 2 : 12024-17714 full_read_support "yes"; last exon : 16530-17714, the same exon than the no reference version
Why this transcript2 contains the 12024-12109 portion that is not mapped with RNAseq (instead the reference=transcript1 begin with this sequence) ?
for the two isoforms, I have FPKM values (4 for transcript1 that doesn't seem to exist in my transcriptome and 13 for the transcript2). How cufflinks attributes those values ?
With the version without gtf reference, I have a FPKM=36, that is the double comparing with the version with reference (13+4=17) while the mapping file is the same.
At least, note that those transcripts are located on the forward strand of the genome and that there is nothing in gtf and cufflink results on the opposite strand at this location.
Many thanks for your suggestions,