The situation, as far as I understand it, is not the standard way of analysis for running DE analysis, which is having an independently sequenced reference genome and a genome annotation with exons, introns and other regions. In your case of de novo assembly, you have a transcriptome assembly yielding a set of transcript contigs based on reads and you want to assess differential expression using the same reads that were used to generate the transcriptome assembly. Please comment if I misunderstood your question and provide additional information. The Bow-Top-Cuff... pipeline is mainly designed for the standard case of a reference genome, but you hava the transcripts already so you don't need cufflinks, cufflinks doesn't need a GFF file as input either, it makes one as an output and also can give you the FPKM, but that would re-do the assembly, which is possibly not what you want.
If you have alignments in SAM,BAM (if not create them aligning the reads to the contigs), you can directly run cuffdiff using your SAM/BAM files and the GFF file you are asking for. Making such a file should be straight forward, following the spec it looks like this :
# Fields are: <seqname> <source> <feature> <start> <end> <score> <strand> <frame>
Contig_1 AssemblySoftware transcript 1 <length of contig1> . . .
Contig_2 AssemblySoftware transcript 1 <length of contig2> . . .
Explanation, you make one entry for each contig, starting at 1 and ending at the last base of the contig, there is no score, strand or frame information, so it's left out.
I think the feature field is not relevant, but not 100% sure if the choice of a certain string is required.
Hope this helps.
I would also keep the de novo assembly, in this case I'd just use cuffdiff to do the DE analysis (what's not clear to me is that cuffdiff also do the gene quantification or only cufflinks do so). Anyway I think generating the appropriate GFF file with the transcripts is the key point here...
I guess I'm just not familiar with the GFF file format, and was reluctant to create one myself.
I'm not looking to do DE analysis, I'm just looking to get and FPKM on each transcript to see if which transcripts have high FPKM and why. I didn't want to use Tophat because I didn't it to generate junctions if it thought there were some.
Can someone report, that the procedure actually did work?
Tophat creates just a junction file (.bed), isn't it? It doesn't, for example, tell you the location of exons and CDS and UTR's.
I don't think that will help him in a de novo assembly.
@Michael Dondrup: You're right. I don't know why I was thinking about transcriptome mapping... Thanks for pointing this out
@Michael Dondrup: You're right about the de novo assembly but instead of mapping the reads to his assembled transcriptome library with bowtie he could do it with tophat. It seems to me that if you wan to use Cufflinks/Cuffdiff tools it's better to map with tophat not just with bowtie.
I just think that one should to keep the assembly of the de-novo data, and therefore I wouldn't used the GFF data generated by tophat. As the reference sequence is an assembled transcriptome, there should not be splice-junctions to be discovered anyway as the transcriptome is already spliced. But who knows. So it would do no harm to use it for the alignments.
Indeed, and the transcripts are the already assembled and ready, actually it is like 'contig = transcript', he simply needs to make a GFF file with one line per contig. cuffdiff should be able to read this.