Multiple Histograms In One Plot
4
-1
Entering edit mode
9.4 years ago
Assa Yeroslaviz ★ 1.7k

Hi there,

I would like to combine several histograms into one plot, but keep the conditional coloring i am using in the single histograms.

This is how I create the single histograms:

tmp =hist(temp$insertSize, breaks = 100, plot=F); hist(temp$insertSize, breaks=100, col=ifelse(tmp$breaks<=600, "blue", "red"), labels =T, main = as.name(i)) This is how it looks like: I would like to plot two data sets into one plot, but keep somehow the conditional coloring I am using in the one above. Is there something similar using barplot in R? thanks for the help Assa r • 67k views ADD COMMENT 4 Entering edit mode I prefer to use density plots, try this: #Make dummy data dat <- rnorm(1000) extra_dat <- rnorm(1000) #Plot plot(density(dat),col="blue") lines(density(extra_dat),col="red") ADD REPLY 0 Entering edit mode This is by far not what I want to show. I don't need the distribution, but the actual numbers ADD REPLY 1 Entering edit mode The title of the question mentions histogram, I assumed you wanted distribution. ADD REPLY 1 Entering edit mode note how taking the actual numbers from an histogram can be misleading. These numbers are very dependent on the size of the histogram bins, and if the bins are too high you risk to merge together two or more different distributions. ADD REPLY 2 Entering edit mode Passing add=T in the next call to hist() will add the second histogram to the same plotting area. But so far this does not look like a good method of displaying the distribution, I'd consider either removing the 0 size inserts or using kernel density estimates or transforming your data (or some combination of the three). ADD REPLY 0 Entering edit mode No I can't as the 0's are important for the results (it's a long story) :-) ADD REPLY 1 Entering edit mode You can still make a useful plot though, e.g. with a broken y-axis that jumps from ~50 to ~2100, unless the only point of the plot is to emphasise that there's a lot of 0s. ADD REPLY 0 Entering edit mode ok, I can do that. But what about combining the histograms together. ADD REPLY 0 Entering edit mode add=T in your subsequent call to hist() ADD REPLY 0 Entering edit mode with add=T it creates a stacked barplot. I would like to have the bars next to each other for each of the group/data sets. ADD REPLY 2 Entering edit mode No it overplots a second histogram to the same axes (the bars aren't stacked, just plotted on top of each other). It seems what you're really asking for is just a simple barplot (not histogram) with beside=TRUE ADD REPLY 0 Entering edit mode Important or not, this plot in this shape doesn't say much. I suggest cutting Y ylim=c(0,100) , and add textbox to show the number of Zero values: ADD REPLY 0 Entering edit mode well, to be honest it does! The idea behind it is to show, that some data sets have very few hits in the bigger bins (15000 onwards). Other data sets show a lot more hits on the right hand side. This is why I would like to plot them together, but keep the colors (or use completely different colors). ADD REPLY 1 Entering edit mode As it stands this is an R programming question. Please explain the relevance to a bioinformatics research problem. ADD REPLY 5 Entering edit mode 9.4 years ago Woa ★ 2.9k You can use GGPLOT to make the following kind of histogram: there should be two columns in the data file for which the histogram to be made and category like "A","B" for how many histograms to be made : say 'dat' and 'catg' library("ggplot2") my.df <-read.table("data_category.txt",header=TRUE) ggplot(mydf, aes(x=dat, color=catg,fill=catg)) + geom_bar(position="dodge") ADD COMMENT 5 Entering edit mode 9.4 years ago I would like to elaborate a bit on Woa's answer. Let's imagine you have the following dataset: > set.seed(2) > d = data.frame("B1"=rnorm(100),"B2"=rnorm(100), "B3"=rnorm(100), "B4"=rnorm(100), "B5"=rnorm(100), "B6"=rnorm(100), "B7"=rnorm(100), "B8"=rnorm(100)) > d$id = row.names(d)
> d
B1         B2         B3         B4         B5         B6         B7          B8 id
1 -0.89691455  1.0744594  0.2979836 -0.3181198 -0.2140756 -0.4597894 -1.1150718  1.23874433  1
2  0.18484918  0.2605978 -1.0195522 -0.3154903 -2.7218162  0.6179261 -0.1142184  0.23189621  2
3  1.58784533 -0.3142720  2.8708974  0.8843223 -1.0142618 -0.7204224 -0.8946214 -0.31443788  3
4 -1.13037567 -0.7496301  0.2187100 -1.8854213 -0.8291451 -0.5835119 -0.6540889  1.49970370  4
5 -0.08025176 -0.8621983 -0.9665543  0.7321793  0.8577089  0.2163245  1.1787163  0.06957437  5
6  0.13242028  2.0480403  0.3838382  0.7905447 -0.2385101  1.2449912  0.9515165  1.33403372  6

To plot a histogram of a column using ggplot, you can use the qplot function:

> qplot(B1, data=d, geom='histogram')

To plot multiple histograms, you can add a geom_histogram for each property:

> qplot(B1, data=d, geom='histogram', fill=I('green')) + geom_histogram(aes(B2), data=d, fill='red')

Since it would be impractical to add a new geom_histogram for each column, you can melt the dataframe, transforming it to a long format:

> d.long = melt.data.frame(id.var='id', data=d)
id variable       value
1  1       B1 -0.89691455
2  2       B1  0.18484918
3  3       B1  1.58784533
4  4       B1 -1.13037567
5  5       B1 -0.08025176
6  6       B1  0.13242028

Note how the long format is structured. All the values are stored in the "value" column. The "variable" column keeps tracks of the original columns. Each data point is also determined by an unique id.

Transforming your dataset to a long format is an essential step for plotting multiple distributions together. Most R functions, such as ggplot2, and others like anova, assume that your data is in the long format. Now that you have a dataset in the long format, you can use plot all the histograms in a single statement:

> qplot(value, fill=variable, data=d.long, geom='histogram')

If you look in the documentation for geom_histogram, you will see that there are many ways to arrange the histograms. For example, you can use position='dodge' to put all the values separately:

> qplot(value, fill=variable, data=d.long, position='dodge')

In my opinion, if there are too many columns, it is better to use the density geom instead of the histogram, using a degree of transparency:

> qplot(value, fill=variable, data=d.long, geom='density')

If there are too many columns, one alternative is to plot some histograms on the negative y axis:

> qplot(value, fill=variable, data=subset(d.long, variable %in% c("B1", "B2", "B3", "B4")), position='dodge', geom='density', alpha=0.2) + geom_density(aes(y=-..density..), data=subset(d.long, variable %in% c("B5", "B6", "B7", "B8")))

# histogram version:
> qplot(value, fill=variable, data=subset(d.long, variable %in% c("B1", "B2", "B3", "B4")), position='dodge', geom='histogram', alpha=0.2) + geom_density(aes(y=-..count..), position='dodge', data=subset(d.long, variable %in% c("B5", "B6", "B7", "B8")))

Finally, another approach is to use faceting to plot each property in a different panel:

> qplot(value, fill=variable, facets=~variable, data=d.long)

0
Entering edit mode

I think this adds to the confusion, there's no such thing as a dodged/beside histogram, it's a bar chart.

0
Entering edit mode

VERY good comment. Thanks to share!

0
Entering edit mode

Your post was pretty useful. Thank you very much!! Just to add a little info,

To make the plot look transparent I used the alpha argument

qplot(value, fill=variable, alpha=I(.5), data=d.long, geom='density')
3
Entering edit mode
9.4 years ago
Woa ★ 2.9k

Alternatively you can use R's transparent color scheme:

p1 <- hist(rnorm(500,4))                     # centered at 4
p2 <- hist(rnorm(500,6))                     # centered at 6
plot( p1, col=rgb(0,0,1,1/4), xlim=c(0,10))  # first histogram
plot( p2, col=rgb(1,0,0,1/4), xlim=c(0,10), add=T)  # second
0
Entering edit mode

You can play a little with different color schemes and transparency(alpha),which you probably already know:

rgb(red=188,green=143,blue=143,alpha=90, max=255)#Rosy Brown [1] "#BC8F8F5A" rgb(red=199,green=21,blue=133,alpha=90,max=255)#Medium Violet red [1] "#C715855A"