Visualization Of Cnv With Genomic Coordinates In R Ggplot2 Or Ggbio
4
10
Entering edit mode
7.9 years ago

Hi all,

I came through a heatmap of CNVs from Illumina Genome Studio which has four samples (see attachment). On Y axis they have genomic coordinates and on X axis the samples. Red denotes amplification and blue denotes deletion. I was wondering how can we make similar heatmap in R for CNV data or expression data? I know a little basis of ggplot and ggbio but I don't know how to make heatmap with genomic coordinates on Y axis. There are fix length of each chromosomes defined say chr-1 is 250 million base-pairs (units), chr-2 is 240 M, chr-3 is 200 M etc. Now there are segments of copy number changes with start position and end position for which heatmap is to be made. Now for chr-1 of size proportional to 250M, we need red block at location proportional to base pair 23432-to-25925 and another red block at 34564-to-44572 etc... similarly for each chromosome.

CNV data is in segments:

chr start end copy-number-T1 copy-number-T2

chr1 23432 25925 4 3

chr1 34564 44572 5 5

chr1 78463 85634 3 4

chr2 1375364 1378364 1 2

chr2 1463723 1469367 4 6

chr2 1678573 1683642 2 5

etc....

r visualization copynumber • 11k views
0
Entering edit mode

if you could make a picture visible and provide an example of your data, hacking R code for you will be much easier

0
Entering edit mode

Sorry my image hosting site link broke. Updated the question with new image link. Thanks.

0
Entering edit mode

Thank you. I answered below. In my example each of 23 chromosomes is divided into 20 segments which are then colored by a random value. I plot these using scale_fill_identity (discrete) so labels are getting crowded and I don't know how to make them nice. If you can use continuous scale, then things will be easier.

5
Entering edit mode
7.9 years ago
Pavel Senin ★ 1.9k

I can help with something like this, but I don't know how to prettify the Y axis labels - all chromosomes in my example are of length 20, so labels are getting overcrowded and I disabled them.

-- edit: fixed labels

require(ggplot2)
require(reshape)

data <- data.frame(matrix(rnorm(2*20*4)-0.5, nrow=2*20, byrow=F))
names(data) <- paste("T",1:4,sep="")
data$chromosome=factor(1:(2*20),1:(2*20), labels=paste(rep(paste("chr",1:2,sep=""),each=20),1:20,sep=".")) data$labels=matrix(rbind(paste("chr",1:2,sep=""),
matrix(rep("",19*2),nrow=19)),byrow=T,nrow=20*2)

dm <- melt(data,id.var=c("chromosome","labels"))

dm$fill <- rep("white",2*20*4) dm$fill[dm$value < -0.5] <- "red" dm$fill[dm$value > 0.5] <- "blue" p <- ggplot(dm, aes(variable, chromosome, fill=fill)) + geom_tile() p + scale_fill_identity(expand=c(0,0),guide = "legend", labels=c("deletion","amplification","none")) + scale_y_discrete(breaks=dm$chromosome,labels=dm$labels) + ggtitle("Example for 2 chromosomes")  -- edit2, with your data, just changed two values to negatives: require(ggplot2) require(reshape) # four samples, here genome is linear, 2M long, each bin is 1000bp # matrix rows = bins, matrix columns = samples data <- matrix(rep(rep(0,2000000/1000),4),ncol=4) # the intervals from your sample intervals=data.frame(chromosome=c("chr1","chr1","chr1","chr2","chr2","chr2"), start=c(23432,34564,78463,1375364,1463723,1678573), end=c(25925,44572,85634,1378364,1469367,1683642), t1=c(4,5,3,-1,4,2), t2=c(3,5,4,2,-6,5)) # transfer the values from data frame into the matrix, probably can code a single function mark_t1 <- function(a,b,c){ data[ (a/1000) : (b/1000), 1] <<- c} apply(intervals[,c('start','end','t1')], 1 , function(x) mark_t1(x[1],x[2],x[3])) mark_t2 <- function(a,b,c){ data[ (a/1000) : (b/1000), 2] <<- c} apply(intervals[,c('start','end','t2')], 1 , function(x) mark_t2(x[1],x[2],x[3])) # converting the matrix into a data frame data <- as.data.frame(data) names(data) <- paste("T",1:4,sep="") # add bins column data$bin=seq(1,2000000,by=1000)
dm <- melt(data,id.var=c("bin"))

# color the bins according to their values
dm$fill <- "white" dm$fill[dm$value < -0.5] <- "red" dm$fill[dm$value > 0.5] <- "blue" # plot as red-white-blue rectangles, by using bins and continuous scale, it's easy to # place Chromosome or even Gene markers # p <- ggplot(dm, aes(variable, bin, fill=fill)) + geom_tile() p + scale_fill_identity(expand=c(0,0),guide = "legend", labels=c("amplification","deletion","none")) + scale_y_continuous(breaks=c(0,1400000,2000000),labels=c("Chr 1","Chr 2","Chr 3")) + ggtitle("Example for 2 chromosomes") + theme_bw() + theme(panel.grid.major = element_blank(),panel.grid.minor = element_blank(), panel.border = element_blank(),panel.background = element_blank(), axis.title.x= element_blank(),axis.title.y = element_blank())  or just like a heatmap plot: #plot as white-to-blue heatmap p2 <- ggplot(dm, aes(variable, bin, fill=value)) + geom_tile() p2 + scale_y_continuous(breaks=c(0,1400000,2000000),labels=c("Chr 1","Chr 2","Chr 3")) + scale_fill_gradient(low = "white", high = "steelblue") + theme_bw() + ggtitle("Example for 2 chromosomes") + theme(panel.grid.major = element_blank(),panel.grid.minor = element_blank(), panel.border = element_blank(),panel.background = element_blank(), axis.title.x= element_blank(),axis.title.y = element_blank())  ADD COMMENT 0 Entering edit mode Thanks you very much Senin. That is awesome. I am able to use this code to generate plot from simulated data. According to your code, there is a constant increment on Y-axis but in real data, there is a start position and end position given in data set. I have updated question with more data. ADD REPLY 0 Entering edit mode code might not be optimal, but have no more time, happy new year! ADD REPLY 0 Entering edit mode thank you very much Senin. Wish you happy new year and enjoy the time.... Will try out with the code and post my working code. ADD REPLY 0 Entering edit mode Have a look into geom_segment in stead of geom_tile to make the y-axis fit with start and end positions of your segments ADD REPLY 1 Entering edit mode 7.9 years ago PoGibas 5.0k Why not circos? Of course, if you want "classical" heatmap you can stick with your idea... Otherwise, CNV looks really nice in circos: Moreover, tutorials are very detailed and they have CNV recipe special for you. ADD COMMENT 0 Entering edit mode Circos is good and I actually have one per sample but it's not fitting with the requirement of visualization for this particular figure. By the way thanks for the CNV recipe link. ADD REPLY 0 Entering edit mode looks interesting and novel, i really like the rearrangements part. but how it'll look like for 10 samples? ADD REPLY 0 Entering edit mode Here is example how it looks for 2 samples (you just add an additional circle) http://genomebiology.com/2010/11/8/R82/figure/F1?highres=y ADD REPLY 0 Entering edit mode It looks good for 2-3 sample. But as you increase # of samples, the inner circle will become so small that it's almost difficult to compare with other samples. That is why I made individual circos for each sample. ADD REPLY 0 Entering edit mode 7.9 years ago Hi Senin, your code gave the exact visualization required. But when I tried with the full data set , chromosome sorting is unusual. chr1 is followed by chr10, chr11 etc (plot1). I got to learn that factor level can be explicitly specified so I tried that(plot2) but NAs are automatically added for each empty field in lable which I think are coming in the axis label. cnv.exp <- read.delim("output.txt", header=T) cnv.exp$sample1 = as.numeric(cnv.exp$sample1) cnv.exp$label = as.factor(cnv.exp$label) dm = melt(cnv.exp, id.var=c("chromosome","label")) dm$fill = rep("white",25055)

dm$fill[dm$value > 2] <- "red"

dm$fill[dm$value < 2] <- "blue"

p <- ggplot(dm, aes(variable, chromosome, fill=fill)) + geom_tile()

p + scale_fill_identity(expand=c(0,0),guide = "legend", label=c("deletion","amplification","none")) + scale_y_discrete(breaks=dm$chromosome,labels=dm$label) + ggtitle("plot1")

chromosome sample1 label

1 chr.1.1.761584 2 chr1

2 chr.1.761585.762902 3

3 chr.1.762903.803449 2

4 chr.1.803450.812182 3

5 chr.1.812183.861119 2

6 chr.1.861120.879961 3

str(cnv.exp)

'data.frame': 25055 obs. of 3 variables:

$chromosome: Factor w/ 25045 levels "chr.1.1.761584",..: 1 2443 2448 2474 2478 2513 2527 2526 2540 2547 ...$ sample1 : num 2 3 2 3 2 3 2 3 2 3 ...

$label : Factor w/ 23 levels "","chr1","chr10",..: 2 1 1 1 1 1 1 1 1 1 ... cnv.exp <- read.delim("output.txt", header=T) cnv.exp$sample1 = as.numeric(cnv.exp$sample1) cnv.exp$label = factor(cnv.exp$label, levels = c("chr.1","chr.2","chr.3", "chr.4","chr.5","chr.6", "chr7","chr8","chr9", "chr10","chr11","chr12", "chr13","chr14","chr15", "chr16","chr17","chr18", "chr19","chr20", "chr21","chr22"), labels = c("chr1","chr2","chr3","chr4","chr5", "chr6","chr7","chr8","chr9","chr10", "chr11","chr12","chr13","chr14","chr15", "chr16","chr17","chr18","chr19","chr20", "chr21","chr22")) dm = melt(cnv.exp, id.var=c("chromosome","label")) dm$fill = rep("white",25055)

dm$fill[dm$value > 2] <- "red"

dm$fill[dm$value < 2] <- "blue"

p <- ggplot(dm, aes(variable, chromosome, fill=fill)) + geom_tile()

p + scale_fill_identity(expand=c(0,0),guide = "legend", label=c("deletion","amplification","none")) + scale_y_discrete(breaks=dm$chromosome,labels=dm$label) + ggtitle("plot2")

chromosome sample1 label

1 chr.1.1.761584 2 <na>

2 chr.1.761585.762902 3 <na>

3 chr.1.762903.803449 2 <na>

4 chr.1.803450.812182 3 <na>

5 chr.1.812183.861119 2 <na>

6 chr.1.861120.879961 3 <na>

str(cnv.exp)

'data.frame': 25055 obs. of 3 variables:

$chromosome: Factor w/ 25045 levels "chr.1.1.761584",..: 1 2443 2448 2474 2478 2513 2527 2526 2540 2547 ...$ sample1 : num 2 3 2 3 2 3 2 3 2 3 ...

$label : Factor w/ 22 levels "chr1","chr2",..: NA NA NA NA NA NA NA NA NA NA ... with NA: ADD COMMENT 0 Entering edit mode Hi there: (1) I am not getting why is there so many labels on the Y axis, moreover, I don't know which scale discrete or continuous you are using. Nevertheless, for both scales, you can specify two vectors - labels and breaks which will make those pretty. (2) Since labels are not sorted right, I would assume that your scale is discrete, - there are two options available, one is to switch to continuous, second is to fix the factor - probably it doesn't work for you ADD REPLY 0 Entering edit mode I have updated the post with exact code and data structure. I got to learn that levels can be explicitly specified for factors but when I an doing so, NAs are added for empty fields which I think are coming up in the labels. ADD REPLY 0 Entering edit mode Hi there! Hope you had an awesome weekend. So, I would guess that there is a bit of misunderstanding between us. You see, when you call scale_y_discrete(breaks=dm$chromosome,labels=dm\$label) ggplot will populate Y scale labels with all the stuff you have in your dataframe -- bunch of labels where some were set, and some were not -- but not with what you want. i.e. problem is in the way data structures are populated and in the plotting instructions. If you will look on my example below edit2 - which I think is better than the first one - you'll find that the whole genome is imagined as a continuous canvas of an arbitrary width with length from 0 to the total length of sum of all chromosomes; then, each interval where some phenomenon occurs is imagined as a "tile" - with start, stop, and color; the plotting instruction places all colored tiles onto blank canvas; the labels along the canvas are then populated manually, not in automatic fashion - moreover, you can customize them the way you want. Let me know if you'd need more information.

0
Entering edit mode
2.9 years ago
Shixiang ▴ 80

copynumber package maybe a good option, please see http://127.0.0.1:11794/library/copynumber/doc/copynumber.pdf