Question: Acceptability of R dropping zeros from logged RNAseq RPKM data
0
gravatar for ad
3.6 years ago by
ad30
United States
ad30 wrote:

I have several groups of RNAseq data that I'm trying to compare to each other through ggplot in R. It consists of several columns of RPKM data each column a different group of samples. ie column 1: gene1 RPKMs in normal. Column 2:gene 1  RPKMs in tumor etc. 

For example using a small excerpt of data

    library(ggplot2)

   df = read.table(text="G1 G1.1 G1.2 G1.3 G2 G2.1 G2.2 G2.3
        1    0   3    4    3   2    3    1
        2    'NA'   5    5    5   2    1    2
        2     'NA'   2    1    2   1    2    5", header=TRUE)

    dfmelt<-melt(df)

 

 

     ggplot(dfmelt, aes(variable, value, fill=variable)) +
     geom_boxplot() +
     theme(axis.text.x=element_text(angle=90))+
     scale_x_discrete(labels=c('C1','C2','C3','C4','C5','C6','C7','C8'))+
     scale_fill_manual(values=rep(c("red","green","blue","yellow"),2))+
     stat_summary(fun.y = median, geom = "point", position =     position_dodge(width = .9))+
     scale_y_log10()

 

 

 

The problem occurs when I attempt to do boxplots of the data in ggplot2 and have it on a log10 y scale. Necessary due to the data distribution. Ggplot appears to simply drop zero values with the message

    Removed x rows containing non-finite values (stat_boxplot)
    Removed x rows containing missing values (stat_summary)

from what I've read ggplot attempts to take the log of 0 and comes up with -Inf so it drops it. Is this of concern in RNAseq expression analysis? If so how do I best handle it to get what I want without distorting the data? 

 

 

 

rna-seq expression R • 1.4k views
ADD COMMENTlink modified 3.6 years ago by JC7.6k • written 3.6 years ago by ad30

just add a small number to all. Like 1

ADD REPLYlink written 3.6 years ago by Zhilong Jia1.4k

just add a small number to all. Like 1

ADD REPLYlink written 3.6 years ago by Zhilong Jia1.4k
0
gravatar for JC
3.6 years ago by
JC7.6k
Mexico
JC7.6k wrote:

RPKM generally produces a lot of zero-values, IMO it's better to use other metric such as CPM or CPK.

Related: 

Rpkm Calculation For Genes

http://blog.nextgenetics.net/?e=51

ADD COMMENTlink written 3.6 years ago by JC7.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2354 users visited in the last hour