Question

cufflinks, RPKM for a gene, taking only exons

0

Entering edit mode

8.6 years ago

tonja.r ▴ 600

I have (poly-A) mRNA-seq data. I want to have RPKM/FPKM values for each gene and I want to provide a cufflink with a gtf file with only exons (or should the transcript information be also included?) annotation in order to get rid of the reads that can fall into intronic/intergenic region. What risk can into with this approach?

RNA-Seq • 2.9k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by tonja.r ▴ 600

Ram · Accepted Answer · 2015-09-24

2

Entering edit mode

8.6 years ago

Carlo Yague 8.7k

I believe there is some confusion between read mapping (that is usually done on the whole genome), and the RPKM/FPKM calculation that is always made on specific features, such as exons.

In your case, Cufflinks take as input reads that are already mapped, so you don't "get rid" of reads that fall into intergenic regions, you'll just ignore them if they don't fall in regions of interests. Now, to answer your question, I think it is fine to input only exons if you really want to. But why not give cufflinks annotated transcript too ? The more information, the better. If you are worried about de novo transcript discovery, cufflinks can do it even if you give him annotated transcripts so its fine!

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by Carlo Yague 8.7k

0

Entering edit mode

Firstly, If I provide only exon information then cufflink will know which reads to count. But as far as I understood, RPKM is reported per isoform. So, how will it assign RPKM value per isoform if I gave him only annotated exons? Will it assemble the annotated exons based on the reads?

Secondly, if I provide only transcript information, it contains only the start and end of a transcript (no information about start and end of the exons), so cufflink will count also intronic regions if they exist, and I want to avoid it.

Thirdly, if I get RPKM per isoform, will be it appropriate to take the average over all isoforms and report it as RPKM per gene?

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.6 years ago by tonja.r ▴ 600

0

Entering edit mode

You are right, there are pro and cons in exon vs transcript information, this is why you can input BOTH levels to Cufflinks ! look at this gtf file exemple with three levels : gene, transcript and exon.

I    PomBase    gene    31140    32345    .    -    .    Name=gene:SPAC977.18;ID=gene:SPAC977.18
I    PomBase    transcript    31140    32345    .    -    .    Name=transcript:SPAC977.18.1;ID=transcript:SPAC977.18.1;Parent=gene:SPAC977.18
I    ensembl    exon    31140    31557    .    -    .    Name=SPAC977.18.1:exon:1;Parent=transcript:SPAC977.18.1
I    ensembl    exon    31768    31913    .    -    .    Name=SPAC977.18.1:exon:2;Parent=transcript:SPAC977.18.1
I    ensembl    exon    32163    32345    .    -    .    Name=SPAC977.18.1:exon:3;Parent=transcript:SPAC977.18.1

Edit: Concerning your third point, I don't know. I guess it depends on your question and data.

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.6 years ago by Carlo Yague 8.7k