Question

Rpkm (R Code)

2

Entering edit mode

10.1 years ago

HNK ▴ 150

Hello everyone I am trying to write a code in R to calculate RPKM. (I am totally new in R, so having problems). For example i have the following txt file with length as last column,

Geneid      Rep1       Rep2       gene Length
COMMD10      3        4        13
USP26        4        6        14
DDX17        6        7         20

and using the following formula to calculate RPKM

RPKM= 10^9 * number of mapped reads in gene
             ----------------------------------------------------
             total reads *  gene length

R code:

count <- read.delim("sample_count.txt", row.names=1, stringsAsFactors=FALSE)
head(count)

for (i in 1:2) {
  count[,i] <- (count[,i]*10^9) / sum(count[,i])   #(count[,i]*10^9) is  10^9 * number of mapped reads in gene
                                                   #sum(count[,i]) is total reads
}

I cant handle the gene length in last column of count table. Can anyone please help me with this code to get the correct answer?

rpkm • 13k views

ADD COMMENT • link updated 6.3 years ago by EagleEye 7.5k • written 10.1 years ago by HNK ▴ 150

Devon Ryan · Answer 1 · 2014-03-25

6

Entering edit mode

10.1 years ago

Devon Ryan 104k

Firstly, it's best to avoid for loops in R, particularly since you don't need them here. If the dataframe in your example were namedd:

row.names(d) <- d$Geneid
l <- d[,4] 
d <- d[,-c(1,4)] #Remove the Geneid and length columns, they get in the way
cS <- colSums(d) #Total mapped reads per sample
rpkm <- (10^6)*t(t(d/l)/cS)

I'm assuming the lengths are already in kb, so there's a 10^6 multiplier instead of 10^9.

Edit: The matrix transposition in the last step are due to cS needing to be applied to each column, rather than sequentially across the column-order-stored matrix.

ADD COMMENT • link 6.1 years ago by Devon Ryan 104k

0

Entering edit mode

Thankyou so much for the reply... I made some changes in the code as it was giving some problem

row.names(d) <- d$Geneid
l <- d[,3] #accessing the length column
d <- d[,-c(3)] #Remove the Geneid and length columns, they get in the way
cS <- colSums(d) #Total mapped reads per sample 
rpkm <- (10^9)*(d/l/cS)

this do give me a result .. but i think is not correct.. 1st colSums(d) donot shows any result and gives error Error: could not find function "colsums". 2nd Are you using thje rite formula to calculate RPKM? shouldnt that be rpkm <- (10^9d)(l*cS)

Kindly do reply

ADD REPLY • link updated 10.1 years ago by Devon Ryan 104k • written 10.1 years ago by HNK ▴ 150

1

Entering edit mode

Also note RPKM should count the kilobases of exonic sequence, the length of the spliced transcript, NOT the distance genomically from the transcription start-site to the end. This is more tricky to calculate, but it looks like that data is given in your input table.

ADD REPLY • link 10.1 years ago by karl.stamm 4.1k

0

Entering edit mode

I guess you already set the rownames, so you can ignore that part. The error that you received was because you typed colsums rather than colSums, don't forget the capital S. Whether one uses 10^9 or 10^6 depends on whether the lengths in bp or kb. I noted in my original reply that I was assuming that your lengths were actually already in kb (otherwise, everything is really short), but if not then you should indeed use 10^9 instead. The difference of 1000 is just from the conversion of bp to kb.

Edit: BTW, you can't multiply l*cS, one has length 2 and the other 3, so R will have no clue how you want to handle it. Also, you want division anyway, since it's reads per kb per million mapped reads (aka, reads/length.in.kb/mapped.reads.in.millions).

ADD REPLY • link 10.1 years ago by Devon Ryan 104k

0

Entering edit mode

Hi Devon, I also trying to calculate RPKM from miRNA raw count. I tried your R code but i am getting error on this : cS <- colSums(d) #Total mapped reads per sample

my input looks like ( I hope you saw below in this conversation, don't mind) with miRNA_length included in the last coloum. I am new to R. So, will help me.

ADD REPLY • link 6.2 years ago by k.kathirvel93 ▴ 300

0

Entering edit mode

You presumably have a column of gene names. That won't work. You can figure out the rest from there.

ADD REPLY • link 6.2 years ago by Devon Ryan 104k

0

Entering edit mode

Dear Devon,

Since you've mentioned Edit: Note that the R code above isn't exactly correct, since the division won't occur properly by column. can I just use the formula as rpkm <- (cs*10^9)/(d*l)?

Thank you.

ADD REPLY • link 6.1 years ago by vinayjrao ▴ 250

1

Entering edit mode

No, you really just need to transpose some of the matrices: t(t(d/l)/cS). I'll update my answer.

ADD REPLY • link 6.1 years ago by Devon Ryan 104k

Devon Ryan · Answer 2 · 2018-01-15

0

Entering edit mode

6.3 years ago

EagleEye 7.5k

 P<-read.table("input_table.txt",sep="\t",header=T)

len <- ncol(p)

rownames(p) <- p[,1]

for(i in 2:ncol(p)-1)

{

d <-p[,i]

l <- p[,len] #accessing the length column

cS <- sum(as.numeric(p[,i])) #Total mapped reads per sample 

rpkm[[i]] <- (10^9)*(as.numeric(p[,i]))/(as.numeric(l)*cS)

rpkm[[1]] <- p[[1]]

}

write.table(rpkm,"output_table_rpkm.txt",sep="\t",quote=F,row.names=F)

R script: http://bioinformaticsonline.com/snippets/view/28042/rpkm-normalization-r-script

PERL script: https://github.com/santhilalsubhash/rpkm_rnaseq_count

ADD COMMENT • link 6.3 years ago by EagleEye 7.5k

0

Entering edit mode

Hi eagleEye,

I am impressed on your R code for RPKM calculation. Actually am working in 7 same tissue miRNA-seq data. I have raw counts for all miRNAsfrom 7 samples from featureCounts. Now my input files is looks like

miRNA_ID    sample1     sample2  sample3    sample4    sample5    sample6   sample7
miR-185        5842        5884      656         52        578       4587       544
miR-200        5487        5268      584        585         57          6       485

So what are the changes I should do in your R Code. I am new to R programming as well as data normalization.Thanks in advance.

ADD REPLY • link updated 6.2 years ago by Devon Ryan 104k • written 6.2 years ago by k.kathirvel93 ▴ 300

0

Entering edit mode

you need a column in the end with lengths for corresponding RNA. That's all you need as input to above code.

ADD REPLY • link 6.2 years ago by EagleEye 7.5k

0

Entering edit mode

Thank you EagleEye, So its not a problem about number of samples, right?

ADD REPLY • link 6.2 years ago by k.kathirvel93 ▴ 300

0

Entering edit mode

Any number of samples will work. Just make sure you are including lengths in the last column.

ADD REPLY • link 6.2 years ago by EagleEye 7.5k

0

Entering edit mode

If I have not any length column, how can I compute RPKM, or how can I find out or generate length column.

ADD REPLY • link 4.4 years ago by ali.cse.bd • 0

0

Entering edit mode

Don't compute RPKM unless you have a compelling reason to do so. If you need to generate gene lengths then get the needed information from biomart or your GTF file.

ADD REPLY • link 4.4 years ago by Devon Ryan 104k