I have several groups of RNAseq data that I'm trying to compare to each other through ggplot in R. It consists of several columns of RPKM data each column a different group of samples. ie column 1: gene1 RPKMs in normal. Column 2:gene 1 RPKMs in tumor etc.
For example using a small excerpt of data
library(ggplot2) df = read.table(text="G1 G1.1 G1.2 G1.3 G2 G2.1 G2.2 G2.3 1 0 3 4 3 2 3 1 2 'NA' 5 5 5 2 1 2 2 'NA' 2 1 2 1 2 5", header=TRUE) dfmelt<-melt(df)
ggplot(dfmelt, aes(variable, value, fill=variable)) + geom_boxplot() + theme(axis.text.x=element_text(angle=90))+ scale_x_discrete(labels=c('C1','C2','C3','C4','C5','C6','C7','C8'))+ scale_fill_manual(values=rep(c("red","green","blue","yellow"),2))+ stat_summary(fun.y = median, geom = "point", position = position_dodge(width = .9))+ scale_y_log10()
The problem occurs when I attempt to do boxplots of the data in ggplot2 and have it on a log10 y scale. Necessary due to the data distribution. Ggplot appears to simply drop zero values with the message
Removed x rows containing non-finite values (stat_boxplot)Removed x rows containing missing values (stat_summary)
from what I've read ggplot attempts to take the log of 0 and comes up with -Inf so it drops it. Is this of concern in RNAseq expression analysis? If so how do I best handle it to get what I want without distorting the data?