Determining Fpkm Of De Novo Assembly Of Transcriptome
2
7
Entering edit mode
11.6 years ago
James Smith ▴ 150

I've done a de novo assembly of RNA seq data, but now want to determine the FPKM of each transcript I've assembled. I've used Bowtie to map the reads to my assembled transcriptome library. However, I'm having difficulty creating a GTF/GFF file from my library of transcripts as input for Cufflinks.

Also Is there a better or easier way to go about calculating FPKM?

fpkm cufflinks assembly gff rna • 8.3k views
4
Entering edit mode
11.6 years ago

The situation, as far as I understand it, is not the standard way of analysis for running DE analysis, which is having an independently sequenced reference genome and a genome annotation with exons, introns and other regions. In your case of de novo assembly, you have a transcriptome assembly yielding a set of transcript contigs based on reads and you want to assess differential expression using the same reads that were used to generate the transcriptome assembly. Please comment if I misunderstood your question and provide additional information. The Bow-Top-Cuff... pipeline is mainly designed for the standard case of a reference genome, but you hava the transcripts already so you don't need cufflinks, cufflinks doesn't need a GFF file as input either, it makes one as an output and also can give you the FPKM, but that would re-do the assembly, which is possibly not what you want.

If you have alignments in SAM,BAM (if not create them aligning the reads to the contigs), you can directly run cuffdiff using your SAM/BAM files and the GFF file you are asking for. Making such a file should be straight forward, following the spec it looks like this :

# Fields are: <seqname> <source> <feature> <start> <end> <score> <strand> <frame>
Contig_1 AssemblySoftware transcript 1 <length of contig1> . . .
Contig_2 AssemblySoftware transcript 1 <length of contig2> . . .
...


Explanation, you make one entry for each contig, starting at 1 and ending at the last base of the contig, there is no score, strand or frame information, so it's left out. I think the feature field is not relevant, but not 100% sure if the choice of a certain string is required.

Hope this helps.

0
Entering edit mode
11.6 years ago
Marina Manrique ★ 1.3k

Hi,

I would use Tophat (instead of Bowtie) before using cufflinks. Tophat calls Bowtie and generates GTF/GFF files in the appropriate format for Cufflinks.

See for instance this guide of how to use Cufflinks to discover new transcripts

1
Entering edit mode

I would also keep the de novo assembly, in this case I'd just use cuffdiff to do the DE analysis (what's not clear to me is that cuffdiff also do the gene quantification or only cufflinks do so). Anyway I think generating the appropriate GFF file with the transcripts is the key point here...

1
Entering edit mode

I guess I'm just not familiar with the GFF file format, and was reluctant to create one myself.

I'm not looking to do DE analysis, I'm just looking to get and FPKM on each transcript to see if which transcripts have high FPKM and why. I didn't want to use Tophat because I didn't it to generate junctions if it thought there were some.

0
Entering edit mode

Can someone report, that the procedure actually did work?

0
Entering edit mode

Tophat creates just a junction file (.bed), isn't it? It doesn't, for example, tell you the location of exons and CDS and UTR's.

0
Entering edit mode

I don't think that will help him in a de novo assembly.

0
Entering edit mode

@Michael Dondrup: You're right. I don't know why I was thinking about transcriptome mapping... Thanks for pointing this out

0
Entering edit mode

@Michael Dondrup: You're right about the de novo assembly but instead of mapping the reads to his assembled transcriptome library with bowtie he could do it with tophat. It seems to me that if you wan to use Cufflinks/Cuffdiff tools it's better to map with tophat not just with bowtie.

0
Entering edit mode

I just think that one should to keep the assembly of the de-novo data, and therefore I wouldn't used the GFF data generated by tophat. As the reference sequence is an assembled transcriptome, there should not be splice-junctions to be discovered anyway as the transcriptome is already spliced. But who knows. So it would do no harm to use it for the alignments.

0
Entering edit mode

Indeed, and the transcripts are the already assembled and ready, actually it is like 'contig = transcript', he simply needs to make a GFF file with one line per contig. cuffdiff should be able to read this.

Traffic: 1266 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.