Question: cufflinks, RPKM for a gene, taking only exons
gravatar for tonja.r
4.2 years ago by
tonja.r460 wrote:

I have (poly-A) mRNA-seq data. I want to have RPKM/FPKM values for each gene and I want to provide a cufflink with a gtf file with only exons (or should the transcript information be also included?) annotation in order to get rid of the reads that can fall into intronic/intergenic region. What risk can into with this approach?

rna-seq • 1.7k views
ADD COMMENTlink modified 4.2 years ago by Carlo Yague4.8k • written 4.2 years ago by tonja.r460
gravatar for Carlo Yague
4.2 years ago by
Carlo Yague4.8k
Carlo Yague4.8k wrote:

I believe there is some confusion between read mapping (that is usually done on the whole genome), and the RPKM/FPKM calculation that is always made on specific features, such as exons.

In your case, Cufflinks take as input reads that are already mapped, so you don't "get rid" of reads that fall into intergenic regions, you'll just ignore them if they don't fall in regions of interests. Now, to answer your question, I think it is fine to input only exons if you really want to. But why not give cufflinks annotated transcript too ? The more information, the better. If you are worried about de novo transcript discovery, cufflinks can do it even if you give him annotated transcripts so its fine !

ADD COMMENTlink written 4.2 years ago by Carlo Yague4.8k

Firstly, If I provide only exon information then cufflink will know which reads to count. But as far as I understood, RPKM is reported per isoform. So, how will it assign RPKM value per isoform if I gave him only annotated exons? Will it assemble the annotated exons based on the reads?
Secondly, if I provide only transcript information, it contains only the start and end of a transcript (no information about start and end of the exons), so cufflink will count also intronic regions if they exist, and I want to avoid it. 
Thirdly, if I get RPKM per isoform, will be it appropriate to take the average over all isoforms and report it as RPKM per gene?

ADD REPLYlink written 4.2 years ago by tonja.r460

You are right, there are pro and cons in exon vs transcript information, this is why you can input BOTH levels to Cufflinks ! look at this gtf file exemple with three levels : gene, transcript and exon.


I    PomBase    gene    31140    32345    .    -    .    Name=gene:SPAC977.18;ID=gene:SPAC977.18
I    PomBase    transcript    31140    32345    .    -    .    Name=transcript:SPAC977.18.1;ID=transcript:SPAC977.18.1;Parent=gene:SPAC977.18
I    ensembl    exon    31140    31557    .    -    .    Name=SPAC977.18.1:exon:1;Parent=transcript:SPAC977.18.1
I    ensembl    exon    31768    31913    .    -    .    Name=SPAC977.18.1:exon:2;Parent=transcript:SPAC977.18.1
I    ensembl    exon    32163    32345    .    -    .    Name=SPAC977.18.1:exon:3;Parent=transcript:SPAC977.18.1

edit : concerning your third point, I don't know. I guess it depends on your question and data.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Carlo Yague4.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1356 users visited in the last hour