Hello, I have many questions about cufflinks output. Here one of them : First I've used tophat to map my RNAseq (100pb) to obtain a accepted_hits.bam file. Then I've used cufflinks in two ways :
- simply : cufflinks accepted_hits.bam
with a gtf file, that is the actually annotation of my genome (eucalyptus) :
cufflinks -g annotGenome.gtf accepted_hits.bam
Note that I've used the –g and not the –G option.
One example result :
- without reference gtf :
one gene / one isoform : 12110-17714
The first part of this isoform 12-16530 has the same structure intron/exon than isoforms formed with the reference. Then I have a last exon 16530-17714.
-with reference gtf
one gene/two isoforms
+ transcript 1: 12024-17350 = exact transcript from the reference
full_read_support "no";
The corresponding no reference last exon is now :
16530 - 16561
16597 - 17350
That's my reference, but in my run this intron is mapped. There is no read that split in two parts. A few reads begin at position 16595. I've checked no read ending at 16561. I thing this RNA doesn't exist in my transcriptome.
+ transcript 2 : 12024-17714
full_read_support "yes";
last exon : 16530-17714, the same exon than the no reference version
Why this transcript2 contains the 12024-12109 portion that is not mapped with RNAseq (instead the reference=transcript1 begin with this sequence) ?
for the two isoforms, I have FPKM values (4 for transcript1 that doesn't seem to exist in my transcriptome and 13 for the transcript2). How cufflinks attributes those values ?
With the version without gtf reference, I have a FPKM=36, that is the double comparing with the version with reference (13+4=17) while the mapping file is the same.
At least, note that those transcripts are located on the forward strand of the genome and that there is nothing in gtf and cufflink results on the opposite strand at this location.
Many thanks for your suggestions,
Sohnic
I have a similar problem. When i use the GTF file with cufflinks, it detects a few hundred transcripts. When i dont use a GTF file, cufflinks detects and assembles about 5000 transcripts. I know that we do expect a few thousand transcripts from the experiment. But i dont know why this counter-intuitive behavior of cufflinks.
Here's a thought to consider: you can either guide (-g) or constrain (-G) how
cufflinks
handles transcripts. In the 'guide' option, it assumes that you will later on (when runningcuffmerge
) merge you novel transcript with known transcripts. In the 'constrain' option you will exclude a merge step and proceed straight to runningcuffdiff
. Might that explain this behaviour?Hello Hélène!
We believe that this post does not fit the main topic of this site.
I think we can close this years after the last progress.
For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.
If you disagree please tell us why in a reply below, we'll be happy to talk about it.
Cheers!