Question: How To Calculate The Rpkm For The Count Tables Of Rna Seq Data
0
gravatar for narges
6.3 years ago by
narges180
Finland
narges180 wrote:

Hi,

I have a count table of RNA seq data with 8 biological replications for two conditions ( so 4 biological replicates for each conditions ) like below:

           V1 V2 V3 V4 V5 V6 V7 V8 V9
2 ENSG00000000003  0  0  0  0  1  0  0  0
3 ENSG00000000005  0  0  0  0  0  0  0  0
4 ENSG00000000419 10 24 19 20 19  8 14  6
5 ENSG00000000457 17 15 13 18 21 18 21 15
6 ENSG00000000460  2  3  5  2  4  6  8  2
7 ENSG00000000938 20  4 35 16 10 17 19  9

How can I calculate the RPKM values for each gene? I have the count table but I do not have the gene's length.

rpkm rna-seq • 14k views
ADD COMMENTlink modified 6.3 years ago by Ge80 • written 6.3 years ago by narges180
1
gravatar for Damian Kao
6.3 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

You cannot calculate the RPKM if you don't have the gene's length.

ADD COMMENTlink written 6.3 years ago by Damian Kao15k

And how can I get the gene's length?

ADD REPLYlink written 6.3 years ago by narges180

What kind of data do you have exactly? Is it just the table of counts? Do you know what the reads were mapped to?

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by Damian Kao15k

First used the TopHat and the Bowtie2Index to map the bam files and then using a gtf file I calculated the reads with the HTseq.

ADD REPLYlink written 6.3 years ago by narges180
1

Like Ge said below, you can use the gtf file to get the gene lengths. You can write a script to do that. If you don't have experience in scripting, you can try to open up the .gtf file in excel and generate the lengths by subtracting the 4th column (start position) from the 5th column (end position) + 1.

It's a nice little project for you to try to learn scripting if you don't already know how.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by Damian Kao15k

Thank you. Just one more question: is this start and end position you mentioned also includes the introns? I mean is it the start and end position of the genes on the chromosome or not? Is the gene length the simple subtract of these two variables or i should take into account some other factors as well.

ADD REPLYlink written 6.3 years ago by narges180

That depends on how your gtf file is structured. Usually there are only transcript structure listed in a gtf file, but not everyone follows the rules. Can you post a few lines of your file?

ADD REPLYlink written 6.3 years ago by Damian Kao15k

generally in the gtf file, one row is one exon or cds or something else. You can know what it is from "class" column or something..I cannot remember. end-start+1 is the length of one exon, for instance. if you want to get the length for transcript or gene, you can get the mapping relationship between the exons and transcripts, even genes in the "attributes" column.

ADD REPLYlink written 6.3 years ago by Ge80

Right. So you basically need to add up the lengths of the exons for each gene to get the transcript length.

ADD REPLYlink written 6.3 years ago by Damian Kao15k

Many thanks from both of you. I have downloaded the gtf file from the UCSC genome browser site and it is the latest version of hg19 like this:

> chr1    unknown    exon    11874    12227    .    +    .    gene_id    DDX11L1    transcript_id    NR_046018_1    gene_name    DDX11L1    tss_id    TSS14523
> chr1    unknown    exon    12613    12721    .    +    .    gene_id    DDX11L1    transcript_id    NR_046018_1    gene_name    DDX11L1    tss_id    TSS14523
> chr1    unknown    exon    13221    14408    .    +    .    gene_id    DDX11L1    transcript_id    NR_046018_1    gene_name    DDX11L1    tss_id    TSS14523
ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by narges180
1

If your aim is to find differentially expressed genes between the two groups I would suggest using any of edgeR/DESeq/baySeq R packages than calculating RPKM values. See this paper: http://www.ncbi.nlm.nih.gov/pubmed/22988256

ADD REPLYlink written 6.3 years ago by Sudeep1.6k

Actually my goal is to rank genes based on their expression level not DE analysis.

ADD REPLYlink written 6.3 years ago by narges180

Is there any script available for calculating RPKM? I have a matrix ofGenes in the first column, gene_length in second column followed by count of all the samples in other colums.

ADD REPLYlink written 3.0 years ago by genie6620
1
gravatar for Ge
6.3 years ago by
Ge80
Switzerland
Ge80 wrote:

raw counts = FPKM * (length of that transcript/1000) * (# of mapped reads / 1e6) and you can do the math.

The gene length or transcript length can be extracted from one gtf file.

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Ge80

Actually, I had the bam files of my rna seq samples. So I used TopHat to get the alignments and then applied the HTSeq over the accepted hits file to get the count tables. Now I need to know the expression level of genes so I decided to calculate the RPKM. But now I am not sue at which step I should have calculated the RPKM. I mean before using HTSeq and getting the above table or now after getting the count table by HTseq. Can I use the easyRNASeq R package now to get the RPKM values from the present above count table?

ADD REPLYlink written 6.3 years ago by narges180

When you applied the HTSeq to get the counts, it also needs one gtf as input right? This is where you can see the length of exons, genes or transcripts (whatever you are interested in). Then calculate the RPKM after you getting the counts. I have never used the easyRNASeq package, however, I quickly looked a bit. It seems that it can calculate RPKM and other normalized version of the raw counts.

ADD REPLYlink written 6.3 years ago by Ge80

Thank you I will try it.

ADD REPLYlink written 6.3 years ago by narges180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1044 users visited in the last hour