Question

Acceptability of R dropping zeros from logged RNAseq RPKM data

0

Entering edit mode

8.7 years ago

ad ▴ 30

I have several groups of RNAseq data that I'm trying to compare to each other through ggplot in R. It consists of several columns of RPKM data each column a different group of samples. i.e., column 1: gene1 RPKMs in normal. Column 2:gene 1 RPKMs in tumor etc.

For example using a small excerpt of data

library(ggplot2)

df = read.table(text="G1 G1.1 G1.2 G1.3 G2 G2.1 G2.2 G2.3
     1    0   3    4    3   2    3    1
     2    'NA'   5    5    5   2    1    2
     2     'NA'   2    1    2   1    2    5", header=TRUE)

dfmelt<-melt(df)

ggplot(dfmelt, aes(variable, value, fill=variable)) +
  geom_boxplot() +
  theme(axis.text.x=element_text(angle=90))+
  scale_x_discrete(labels=c('C1','C2','C3','C4','C5','C6','C7','C8'))+
  scale_fill_manual(values=rep(c("red","green","blue","yellow"),2))+
  stat_summary(fun.y = median, geom = "point", position =     position_dodge(width = .9))+
  scale_y_log10()

The problem occurs when I attempt to do boxplots of the data in ggplot2 and have it on a log10 y scale. Necessary due to the data distribution. Ggplot appears to simply drop zero values with the message

Removed x rows containing non-finite values (stat_boxplot)
Removed x rows containing missing values (stat_summary)

From what I've read ggplot attempts to take the log of 0 and comes up with -Inf so it drops it. Is this of concern in RNAseq expression analysis? If so how do I best handle it to get what I want without distorting the data?

RNA-Seq expression R • 2.5k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.7 years ago by ad ▴ 30

0

Entering edit mode

just add a small number to all. Like 1

ADD REPLY • link 8.7 years ago by Zhilong Jia ★ 2.2k

0

Entering edit mode

just add a small number to all. Like 1

ADD REPLY • link 8.7 years ago by Zhilong Jia ★ 2.2k

Ram · Answer 1 · 2015-08-17

0

Entering edit mode

8.7 years ago

JC 13k

RPKM generally produces a lot of zero-values, IMO it's better to use other metric such as CPM or CPK.