Question: Rpkm (R Code)
2
gravatar for HNK
5.7 years ago by
HNK120
Netherlands
HNK120 wrote:

Hello everyone I am trying to write a code in R to calculate RPKM. (I am totally new in R, so having problems). For example i have the following txt file with length as last column,

Geneid      Rep1       Rep2       gene Length
COMMD10      3        4        13
USP26        4        6        14
DDX17        6        7         20

and using the following formula to calculate RPKM

RPKM= 10^9 * number of mapped reads in gene
             ----------------------------------------------------
             total reads *  gene length

R code:

count <- read.delim("sample_count.txt", row.names=1, stringsAsFactors=FALSE)
head(count)

for (i in 1:2) {
  count[,i] <- (count[,i]*10^9) / sum(count[,i])   #(count[,i]*10^9) is  10^9 * number of mapped reads in gene
                                                   #sum(count[,i]) is total reads
}

I cant handle the gene length in last column of count table. Can anyone please help me with this code to get the correct answer?

rpkm • 7.4k views
ADD COMMENTlink modified 23 months ago by EagleEye6.5k • written 5.7 years ago by HNK120
5
gravatar for Devon Ryan
5.7 years ago by
Devon Ryan93k
Freiburg, Germany
Devon Ryan93k wrote:

Firstly, it's best to avoid for loops in R, particularly since you don't need them here. If the dataframe in your example were namedd:

row.names(d) <- d$Geneid
l <- d[,4] 
d <- d[,-c(1,4)] #Remove the Geneid and length columns, they get in the way
cS <- colSums(d) #Total mapped reads per sample
rpkm <- (10^6)*t(t(d/l)/cS)

I'm assuming the lengths are already in kb, so there's a 10^6 multiplier instead of 10^9.

Edit: The matrix transposition in the last step are due to cS needing to be applied to each column, rather than sequentially across the column-order-stored matrix.

ADD COMMENTlink modified 21 months ago • written 5.7 years ago by Devon Ryan93k

Thankyou so much for the reply... I made some changes in the code as it was giving some problem

row.names(d) <- d$Geneid
l <- d[,3] #accessing the length column
d <- d[,-c(3)] #Remove the Geneid and length columns, they get in the way
cS <- colSums(d) #Total mapped reads per sample 
rpkm <- (10^9)*(d/l/cS)

this do give me a result .. but i think is not correct.. 1st colSums(d) donot shows any result and gives error Error: could not find function "colsums". 2nd Are you using thje rite formula to calculate RPKM? shouldnt that be rpkm <- (10^9d)(l*cS)

Kindly do reply

ADD REPLYlink modified 5.7 years ago by Devon Ryan93k • written 5.7 years ago by HNK120
1

Also note RPKM should count the kilobases of exonic sequence, the length of the spliced transcript, NOT the distance genomically from the transcription start-site to the end. This is more tricky to calculate, but it looks like that data is given in your input table.

ADD REPLYlink written 5.7 years ago by karl.stamm3.6k

I guess you already set the rownames, so you can ignore that part. The error that you received was because you typed colsums rather than colSums, don't forget the capital S. Whether one uses 10^9 or 10^6 depends on whether the lengths in bp or kb. I noted in my original reply that I was assuming that your lengths were actually already in kb (otherwise, everything is really short), but if not then you should indeed use 10^9 instead. The difference of 1000 is just from the conversion of bp to kb.

Edit: BTW, you can't multiply l*cS, one has length 2 and the other 3, so R will have no clue how you want to handle it. Also, you want division anyway, since it's reads per kb per million mapped reads (aka, reads/length.in.kb/mapped.reads.in.millions).

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by Devon Ryan93k

Hi Devon, I also trying to calculate RPKM from miRNA raw count. I tried your R code but i am getting error on this : cS <- colSums(d) #Total mapped reads per sample

my input looks like ( I hope you saw below in this conversation, don't mind) with miRNA_length included in the last coloum. I am new to R. So, will help me.

ADD REPLYlink modified 22 months ago • written 22 months ago by k.kathirvel93210

You presumably have a column of gene names. That won't work. You can figure out the rest from there.

ADD REPLYlink written 22 months ago by Devon Ryan93k

Dear Devon,

Since you've mentioned Edit: Note that the R code above isn't exactly correct, since the division won't occur properly by column. can I just use the formula as rpkm <- (cs*10^9)/(d*l)?

Thank you.

ADD REPLYlink written 21 months ago by vinayjrao150
1

No, you really just need to transpose some of the matrices: t(t(d/l)/cS). I'll update my answer.

ADD REPLYlink written 21 months ago by Devon Ryan93k
0
gravatar for EagleEye
23 months ago by
EagleEye6.5k
Sweden
EagleEye6.5k wrote:
 P<-read.table("input_table.txt",sep="\t",header=T)

len <- ncol(p)

rownames(p) <- p[,1]

for(i in 2:ncol(p)-1)

{

d <-p[,i]

l <- p[,len] #accessing the length column

cS <- sum(as.numeric(p[,i])) #Total mapped reads per sample 

rpkm[[i]] <- (10^9)*(as.numeric(p[,i]))/(as.numeric(l)*cS)

rpkm[[1]] <- p[[1]]

}

write.table(rpkm,"output_table_rpkm.txt",sep="\t",quote=F,row.names=F)‚Äč

R script: http://bioinformaticsonline.com/snippets/view/28042/rpkm-normalization-r-script

PERL script: https://github.com/santhilalsubhash/rpkm_rnaseq_count

ADD COMMENTlink modified 23 months ago • written 23 months ago by EagleEye6.5k

Hi eagleEye,

I am impressed on your R code for RPKM calculation. Actually am working in 7 same tissue miRNA-seq data. I have raw counts for all miRNAsfrom 7 samples from featureCounts. Now my input files is looks like

miRNA_ID    sample1     sample2  sample3    sample4    sample5    sample6   sample7
miR-185        5842        5884      656         52        578       4587       544
miR-200        5487        5268      584        585         57          6       485

So what are the changes I should do in your R Code. I am new to R programming as well as data normalization.Thanks in advance.

ADD REPLYlink modified 22 months ago by Devon Ryan93k • written 22 months ago by k.kathirvel93210

you need a column in the end with lengths for corresponding RNA. That's all you need as input to above code.

ADD REPLYlink modified 22 months ago • written 22 months ago by EagleEye6.5k

Thank you EagleEye, So its not a problem about number of samples, right?

ADD REPLYlink modified 22 months ago • written 22 months ago by k.kathirvel93210

Any number of samples will work. Just make sure you are including lengths in the last column.

ADD REPLYlink modified 22 months ago • written 22 months ago by EagleEye6.5k

If I have not any length column, how can I compute RPKM, or how can I find out or generate length column.

ADD REPLYlink written 10 days ago by ali.cse.bd0

Don't compute RPKM unless you have a compelling reason to do so. If you need to generate gene lengths then get the needed information from biomart or your GTF file.

ADD REPLYlink written 10 days ago by Devon Ryan93k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1002 users visited in the last hour