Question: Rpkm (R Code)
2
HNK120 wrote:

Hello everyone I am trying to write a code in R to calculate RPKM. (I am totally new in R, so having problems). For example i have the following txt file with length as last column,

``````Geneid      Rep1       Rep2       gene Length
COMMD10      3        4        13
USP26        4        6        14
DDX17        6        7         20
``````

and using the following formula to calculate RPKM

``````RPKM= 10^9 * number of mapped reads in gene
----------------------------------------------------
``````

R code:

``````count <- read.delim("sample_count.txt", row.names=1, stringsAsFactors=FALSE)

for (i in 1:2) {
count[,i] <- (count[,i]*10^9) / sum(count[,i])   #(count[,i]*10^9) is  10^9 * number of mapped reads in gene
}
``````

I cant handle the gene length in last column of count table. Can anyone please help me with this code to get the correct answer?

rpkm • 7.4k views
modified 23 months ago by EagleEye6.5k • written 5.7 years ago by HNK120
5
Devon Ryan93k wrote:

Firstly, it's best to avoid `for` loops in R, particularly since you don't need them here. If the dataframe in your example were named`d`:

``````row.names(d) <- d\$Geneid
l <- d[,4]
d <- d[,-c(1,4)] #Remove the Geneid and length columns, they get in the way
cS <- colSums(d) #Total mapped reads per sample
rpkm <- (10^6)*t(t(d/l)/cS)
``````

I'm assuming the lengths are already in kb, so there's a `10^6` multiplier instead of `10^9`.

Edit: The matrix transposition in the last step are due to `cS` needing to be applied to each column, rather than sequentially across the column-order-stored matrix.

Thankyou so much for the reply... I made some changes in the code as it was giving some problem

``````row.names(d) <- d\$Geneid
l <- d[,3] #accessing the length column
d <- d[,-c(3)] #Remove the Geneid and length columns, they get in the way
cS <- colSums(d) #Total mapped reads per sample
rpkm <- (10^9)*(d/l/cS)
``````

this do give me a result .. but i think is not correct.. 1st colSums(d) donot shows any result and gives error Error: could not find function "colsums". 2nd Are you using thje rite formula to calculate RPKM? shouldnt that be rpkm <- (10^9d)(l*cS)

1

Also note RPKM should count the kilobases of exonic sequence, the length of the spliced transcript, NOT the distance genomically from the transcription start-site to the end. This is more tricky to calculate, but it looks like that data is given in your input table.

I guess you already set the rownames, so you can ignore that part. The error that you received was because you typed `colsums` rather than `colSums`, don't forget the capital S. Whether one uses 10^9 or 10^6 depends on whether the lengths in bp or kb. I noted in my original reply that I was assuming that your lengths were actually already in kb (otherwise, everything is really short), but if not then you should indeed use 10^9 instead. The difference of 1000 is just from the conversion of bp to kb.

Edit: BTW, you can't multiply `l*cS`, one has length 2 and the other 3, so R will have no clue how you want to handle it. Also, you want division anyway, since it's `reads per kb per million mapped reads` (aka, `reads/length.in.kb/mapped.reads.in.millions`).

Hi Devon, I also trying to calculate RPKM from miRNA raw count. I tried your R code but i am getting error on this : cS <- colSums(d) #Total mapped reads per sample

my input looks like ( I hope you saw below in this conversation, don't mind) with miRNA_length included in the last coloum. I am new to R. So, will help me.

You presumably have a column of gene names. That won't work. You can figure out the rest from there.

Dear Devon,

Since you've mentioned `Edit: Note that the R code above isn't exactly correct, since the division won't occur properly by column.` can I just use the formula as `rpkm <- (cs*10^9)/(d*l)`?

Thank you.

1

No, you really just need to transpose some of the matrices: `t(t(d/l)/cS)`. I'll update my answer.

0
EagleEye6.5k wrote:
`````` P<-read.table("input_table.txt",sep="\t",header=T)

len <- ncol(p)

rownames(p) <- p[,1]

for(i in 2:ncol(p)-1)

{

d <-p[,i]

l <- p[,len] #accessing the length column

cS <- sum(as.numeric(p[,i])) #Total mapped reads per sample

rpkm[[i]] <- (10^9)*(as.numeric(p[,i]))/(as.numeric(l)*cS)

rpkm[] <- p[]

}

write.table(rpkm,"output_table_rpkm.txt",sep="\t",quote=F,row.names=F)​
``````

Hi eagleEye,

I am impressed on your R code for RPKM calculation. Actually am working in 7 same tissue miRNA-seq data. I have raw counts for all miRNAsfrom 7 samples from featureCounts. Now my input files is looks like

``````miRNA_ID    sample1     sample2  sample3    sample4    sample5    sample6   sample7
miR-185        5842        5884      656         52        578       4587       544
miR-200        5487        5268      584        585         57          6       485
``````

So what are the changes I should do in your R Code. I am new to R programming as well as data normalization.Thanks in advance.

you need a column in the end with lengths for corresponding RNA. That's all you need as input to above code.

Thank you EagleEye, So its not a problem about number of samples, right?

Any number of samples will work. Just make sure you are including lengths in the last column.

If I have not any length column, how can I compute RPKM, or how can I find out or generate length column.