Question: Determining Fpkm Of De Novo Assembly Of Transcriptome
gravatar for James Smith
7.6 years ago by
James Smith150
James Smith150 wrote:

I've done a de novo assembly of RNA seq data, but now want to determine the FPKM of each transcript I've assembled. I've used Bowtie to map the reads to my assembled transcriptome library. However, I'm having difficulty creating a GTF/GFF file from my library of transcripts as input for Cufflinks.

Also Is there a better or easier way to go about calculating FPKM?

assembly fpkm cufflinks rna gff • 6.6k views
ADD COMMENTlink written 7.6 years ago by James Smith150
gravatar for Michael Dondrup
7.6 years ago by
Bergen, Norway
Michael Dondrup45k wrote:

The situation, as far as I understand it, is not the standard way of analysis for running DE analysis, which is having an independently sequenced reference genome and a genome annotation with exons, introns and other regions. In your case of de novo assembly, you have a transcriptome assembly yielding a set of transcript contigs based on reads and you want to assess differential expression using the same reads that were used to generate the transcriptome assembly. Please comment if I misunderstood your question and provide additional information. The Bow-Top-Cuff... pipeline is mainly designed for the standard case of a reference genome, but you hava the transcripts already so you don't need cufflinks, cufflinks doesn't need a GFF file as input either, it makes one as an output and also can give you the FPKM, but that would re-do the assembly, which is possibly not what you want.

If you have alignments in SAM,BAM (if not create them aligning the reads to the contigs), you can directly run cuffdiff using your SAM/BAM files and the GFF file you are asking for. Making such a file should be straight forward, following the spec it looks like this :

# Fields are: <seqname> <source> <feature> <start> <end> <score> <strand> <frame>
Contig_1 AssemblySoftware transcript 1 <length of contig1> . . .
Contig_2 AssemblySoftware transcript 1 <length of contig2> . . .

Explanation, you make one entry for each contig, starting at 1 and ending at the last base of the contig, there is no score, strand or frame information, so it's left out. I think the feature field is not relevant, but not 100% sure if the choice of a certain string is required.

Hope this helps.

ADD COMMENTlink written 7.6 years ago by Michael Dondrup45k
gravatar for Marina Manrique
7.6 years ago by
Marina Manrique1.3k
Marina Manrique1.3k wrote:


I would use Tophat (instead of Bowtie) before using cufflinks. Tophat calls Bowtie and generates GTF/GFF files in the appropriate format for Cufflinks.

See for instance this guide of how to use Cufflinks to discover new transcripts

ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by Marina Manrique1.3k

I would also keep the de novo assembly, in this case I'd just use cuffdiff to do the DE analysis (what's not clear to me is that cuffdiff also do the gene quantification or only cufflinks do so). Anyway I think generating the appropriate GFF file with the transcripts is the key point here...

ADD REPLYlink written 7.6 years ago by Marina Manrique1.3k

I guess I'm just not familiar with the GFF file format, and was reluctant to create one myself.

I'm not looking to do DE analysis, I'm just looking to get and FPKM on each transcript to see if which transcripts have high FPKM and why. I didn't want to use Tophat because I didn't it to generate junctions if it thought there were some.

ADD REPLYlink written 7.6 years ago by James Smith150

Can someone report, that the procedure actually did work?

ADD REPLYlink written 6.7 years ago by Fabian Bull1.3k

Tophat creates just a junction file (.bed), isn't it? It doesn't, for example, tell you the location of exons and CDS and UTR's.

ADD REPLYlink written 7.6 years ago by Arun2.3k

I don't think that will help him in a de novo assembly.

ADD REPLYlink written 7.6 years ago by Michael Dondrup45k

@Michael Dondrup: You're right. I don't know why I was thinking about transcriptome mapping... Thanks for pointing this out

ADD REPLYlink written 7.6 years ago by Marina Manrique1.3k

@Michael Dondrup: You're right about the de novo assembly but instead of mapping the reads to his assembled transcriptome library with bowtie he could do it with tophat. It seems to me that if you wan to use Cufflinks/Cuffdiff tools it's better to map with tophat not just with bowtie.

ADD REPLYlink written 7.6 years ago by Marina Manrique1.3k

I just think that one should to keep the assembly of the de-novo data, and therefore I wouldn't used the GFF data generated by tophat. As the reference sequence is an assembled transcriptome, there should not be splice-junctions to be discovered anyway as the transcriptome is already spliced. But who knows. So it would do no harm to use it for the alignments.

ADD REPLYlink written 7.6 years ago by Michael Dondrup45k

Indeed, and the transcripts are the already assembled and ready, actually it is like 'contig = transcript', he simply needs to make a GFF file with one line per contig. cuffdiff should be able to read this.

ADD REPLYlink written 7.6 years ago by Michael Dondrup45k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1251 users visited in the last hour