cufflinks, RPKM for a gene, taking only exons
1
0
Entering edit mode
8.6 years ago
tonja.r ▴ 600

I have (poly-A) mRNA-seq data. I want to have RPKM/FPKM values for each gene and I want to provide a cufflink with a gtf file with only exons (or should the transcript information be also included?) annotation in order to get rid of the reads that can fall into intronic/intergenic region. What risk can into with this approach?

RNA-Seq • 2.9k views
ADD COMMENT
2
Entering edit mode
8.6 years ago

I believe there is some confusion between read mapping (that is usually done on the whole genome), and the RPKM/FPKM calculation that is always made on specific features, such as exons.

In your case, Cufflinks take as input reads that are already mapped, so you don't "get rid" of reads that fall into intergenic regions, you'll just ignore them if they don't fall in regions of interests. Now, to answer your question, I think it is fine to input only exons if you really want to. But why not give cufflinks annotated transcript too ? The more information, the better. If you are worried about de novo transcript discovery, cufflinks can do it even if you give him annotated transcripts so its fine!

ADD COMMENT
0
Entering edit mode

Firstly, If I provide only exon information then cufflink will know which reads to count. But as far as I understood, RPKM is reported per isoform. So, how will it assign RPKM value per isoform if I gave him only annotated exons? Will it assemble the annotated exons based on the reads?

Secondly, if I provide only transcript information, it contains only the start and end of a transcript (no information about start and end of the exons), so cufflink will count also intronic regions if they exist, and I want to avoid it.

Thirdly, if I get RPKM per isoform, will be it appropriate to take the average over all isoforms and report it as RPKM per gene?

ADD REPLY
0
Entering edit mode

You are right, there are pro and cons in exon vs transcript information, this is why you can input BOTH levels to Cufflinks ! look at this gtf file exemple with three levels : gene, transcript and exon.

I    PomBase    gene    31140    32345    .    -    .    Name=gene:SPAC977.18;ID=gene:SPAC977.18
I    PomBase    transcript    31140    32345    .    -    .    Name=transcript:SPAC977.18.1;ID=transcript:SPAC977.18.1;Parent=gene:SPAC977.18
I    ensembl    exon    31140    31557    .    -    .    Name=SPAC977.18.1:exon:1;Parent=transcript:SPAC977.18.1
I    ensembl    exon    31768    31913    .    -    .    Name=SPAC977.18.1:exon:2;Parent=transcript:SPAC977.18.1
I    ensembl    exon    32163    32345    .    -    .    Name=SPAC977.18.1:exon:3;Parent=transcript:SPAC977.18.1

Edit: Concerning your third point, I don't know. I guess it depends on your question and data.

ADD REPLY

Login before adding your answer.

Traffic: 2228 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6