Question: Visualization Of Cnv With Genomic Coordinates In R Ggplot2 Or Ggbio
9
gravatar for pirates.of.the.genome
3.9 years ago by
pirates.of.the.genome90 wrote:

Hi all,

I came through a heatmap of CNVs from Illumina Genome Studio which has four samples (see attachment). On Y axis they have genomic coordinates and on X axis the samples. Red denotes amplification and blue denotes deletion. I was wondering how can we make similar heatmap in R for CNV data or expression data? I know a little basis of ggplot and ggbio but I don't know how to make heatmap with genomic coordinates on Y axis. There are fix length of each chromosomes defined say chr-1 is 250 million base-pairs (units), chr-2 is 240 M, chr-3 is 200 M etc. Now there are segments of copy number changes with start position and end position for which heatmap is to be made. Now for chr-1 of size proportional to 250M, we need red block at location proportional to base pair 23432-to-25925 and another red block at 34564-to-44572 etc... similarly for each chromosome.

CNV data is in segments:

chr start end copy-number-T1 copy-number-T2

chr1 23432 25925 4 3

chr1 34564 44572 5 5

chr1 78463 85634 3 4

chr2 1375364 1378364 1 2

chr2 1463723 1469367 4 6

chr2 1678573 1683642 2 5

etc....

Thanks in advance.

CNV heatmap

R copynumber visualization • 5.6k views
ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by pirates.of.the.genome90

if you could make a picture visible and provide an example of your data, hacking R code for you will be much easier

ADD REPLYlink written 3.9 years ago by Pavel Senin1.8k

Sorry my image hosting site link broke. Updated the question with new image link. Thanks.

ADD REPLYlink written 3.9 years ago by pirates.of.the.genome90

Thank you. I answered below. In my example each of 23 chromosomes is divided into 20 segments which are then colored by a random value. I plot these using scale_fill_identity (discrete) so labels are getting crowded and I don't know how to make them nice. If you can use continuous scale, then things will be easier.

ADD REPLYlink written 3.9 years ago by Pavel Senin1.8k
4
gravatar for Pavel Senin
3.9 years ago by
Pavel Senin1.8k
Los Alamos, NM
Pavel Senin1.8k wrote:

I can help with something like this, but I don't know how to prettify the Y axis labels - all chromosomes in my example are of length 20, so labels are getting overcrowded and I disabled them.

-- edit: fixed labels

enter image description here

require(ggplot2)
require(reshape)

data <- data.frame(matrix(rnorm(2*20*4)-0.5, nrow=2*20, byrow=F))
names(data) <- paste("T",1:4,sep="")
data$chromosome=factor(1:(2*20),1:(2*20),
  labels=paste(rep(paste("chr",1:2,sep=""),each=20),1:20,sep="."))
data$labels=matrix(rbind(paste("chr",1:2,sep=""),
  matrix(rep("",19*2),nrow=19)),byrow=T,nrow=20*2)

dm <- melt(data,id.var=c("chromosome","labels"))

dm$fill <- rep("white",2*20*4)
dm$fill[dm$value < -0.5] <- "red"
dm$fill[dm$value > 0.5] <- "blue"

p <- ggplot(dm, aes(variable, chromosome, fill=fill)) + geom_tile()
p + scale_fill_identity(expand=c(0,0),guide = "legend",
                        labels=c("deletion","amplification","none")) +
  scale_y_discrete(breaks=dm$chromosome,labels=dm$labels) +
  ggtitle("Example for 2 chromosomes")

-- edit2, with your data, just changed two values to negatives:

require(ggplot2)
require(reshape)

# four samples, here genome is linear, 2M long, each bin is 1000bp
# matrix rows = bins, matrix columns = samples
data <- matrix(rep(rep(0,2000000/1000),4),ncol=4)

# the intervals from your sample
intervals=data.frame(chromosome=c("chr1","chr1","chr1","chr2","chr2","chr2"),
                    start=c(23432,34564,78463,1375364,1463723,1678573),
                    end=c(25925,44572,85634,1378364,1469367,1683642),
                    t1=c(4,5,3,-1,4,2),
                    t2=c(3,5,4,2,-6,5))

# transfer the values from data frame into the matrix, probably can code a single function
mark_t1 <- function(a,b,c){ data[ (a/1000) : (b/1000), 1] <<- c}
apply(intervals[,c('start','end','t1')], 1 , function(x) mark_t1(x[1],x[2],x[3]))

mark_t2 <- function(a,b,c){ data[ (a/1000) : (b/1000), 2] <<- c}
apply(intervals[,c('start','end','t2')], 1 , function(x) mark_t2(x[1],x[2],x[3]))

# converting the matrix into a data frame
data <- as.data.frame(data)
names(data) <- paste("T",1:4,sep="")

# add bins column
data$bin=seq(1,2000000,by=1000)
dm <- melt(data,id.var=c("bin"))

# color the bins according to their values
dm$fill <- "white"
dm$fill[dm$value < -0.5] <- "red"
dm$fill[dm$value > 0.5] <- "blue"

# plot as red-white-blue rectangles, by using bins and continuous scale, it's easy to
# place Chromosome or even Gene markers
#
p <- ggplot(dm, aes(variable, bin, fill=fill)) + geom_tile()
p + scale_fill_identity(expand=c(0,0),guide = "legend",
                        labels=c("amplification","deletion","none")) +
  scale_y_continuous(breaks=c(0,1400000,2000000),labels=c("Chr 1","Chr 2","Chr 3")) +
  ggtitle("Example for 2 chromosomes") +
  theme_bw() +
  theme(panel.grid.major = element_blank(),panel.grid.minor = element_blank(),
       panel.border = element_blank(),panel.background = element_blank(),
        axis.title.x= element_blank(),axis.title.y = element_blank())

enter image description here

or just like a heatmap plot:

#plot as white-to-blue heatmap
p2 <- ggplot(dm, aes(variable, bin, fill=value)) + geom_tile()
p2 + scale_y_continuous(breaks=c(0,1400000,2000000),labels=c("Chr 1","Chr 2","Chr 3")) +
  scale_fill_gradient(low = "white", high = "steelblue") +
  theme_bw() + ggtitle("Example for 2 chromosomes") +
  theme(panel.grid.major = element_blank(),panel.grid.minor = element_blank(),
        panel.border = element_blank(),panel.background = element_blank(),
        axis.title.x= element_blank(),axis.title.y = element_blank())

enter image description here

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Pavel Senin1.8k

Thanks you very much Senin. That is awesome. I am able to use this code to generate plot from simulated data. According to your code, there is a constant increment on Y-axis but in real data, there is a start position and end position given in data set. I have updated question with more data.

ADD REPLYlink written 3.9 years ago by pirates.of.the.genome90

code might not be optimal, but have no more time, happy new year!

ADD REPLYlink written 3.9 years ago by Pavel Senin1.8k

thank you very much Senin. Wish you happy new year and enjoy the time.... Will try out with the code and post my working code.

ADD REPLYlink written 3.9 years ago by pirates.of.the.genome90

Have a look into geom_segment in stead of geom_tile to make the y-axis fit with start and end positions of your segments

ADD REPLYlink written 3.9 years ago by Irsan6.4k
1
gravatar for PoGibas
3.9 years ago by
PoGibas4.7k
Vilnius
PoGibas4.7k wrote:

Why not circos? Of course, if you want "classical" heatmap you can stick with your idea...
Otherwise, CNV looks really nice in circos:

enter image description here

Moreover, tutorials are very detailed and they have CNV recipe special for you.

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by PoGibas4.7k

Circos is good and I actually have one per sample but it's not fitting with the requirement of visualization for this particular figure. By the way thanks for the CNV recipe link.

ADD REPLYlink written 3.9 years ago by pirates.of.the.genome90

looks interesting and novel, i really like the rearrangements part. but how it'll look like for 10 samples?

ADD REPLYlink written 3.9 years ago by Pavel Senin1.8k

Here is example how it looks for 2 samples (you just add an additional circle) http://genomebiology.com/2010/11/8/R82/figure/F1?highres=y

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by PoGibas4.7k

It looks good for 2-3 sample. But as you increase # of samples, the inner circle will become so small that it's almost difficult to compare with other samples. That is why I made individual circos for each sample.

ADD REPLYlink written 3.9 years ago by pirates.of.the.genome90
0
gravatar for pirates.of.the.genome
3.9 years ago by
pirates.of.the.genome90 wrote:

Hi Senin, your code gave the exact visualization required. But when I tried with the full data set , chromosome sorting is unusual. chr1 is followed by chr10, chr11 etc (plot1). I got to learn that factor level can be explicitly specified so I tried that(plot2) but NAs are automatically added for each empty field in lable which I think are coming in the axis label.

cnv.exp <- read.delim("output.txt", header=T)

cnv.exp$sample1 = as.numeric(cnv.exp$sample1)

cnv.exp$label = as.factor(cnv.exp$label)

dm = melt(cnv.exp, id.var=c("chromosome","label"))

dm$fill = rep("white",25055)

dm$fill[dm$value > 2] <- "red"

dm$fill[dm$value < 2] <- "blue"

p <- ggplot(dm, aes(variable, chromosome, fill=fill)) + geom_tile()

p + scale_fill_identity(expand=c(0,0),guide = "legend", label=c("deletion","amplification","none")) + scale_y_discrete(breaks=dm$chromosome,labels=dm$label) + ggtitle("plot1")

head(cnv.exp)

chromosome sample1 label

1 chr.1.1.761584 2 chr1

2 chr.1.761585.762902 3

3 chr.1.762903.803449 2

4 chr.1.803450.812182 3

5 chr.1.812183.861119 2

6 chr.1.861120.879961 3

str(cnv.exp)

'data.frame': 25055 obs. of 3 variables:

$ chromosome: Factor w/ 25045 levels "chr.1.1.761584",..: 1 2443 2448 2474 2478 2513 2527 2526 2540 2547 ...

$ sample1 : num 2 3 2 3 2 3 2 3 2 3 ...

$ label : Factor w/ 23 levels "","chr1","chr10",..: 2 1 1 1 1 1 1 1 1 1 ...

without NA

cnv.exp <- read.delim("output.txt", header=T)

cnv.exp$sample1 = as.numeric(cnv.exp$sample1)

cnv.exp$label = factor(cnv.exp$label, levels = c("chr.1","chr.2","chr.3", "chr.4","chr.5","chr.6", "chr7","chr8","chr9", "chr10","chr11","chr12", "chr13","chr14","chr15", "chr16","chr17","chr18", "chr19","chr20", "chr21","chr22"), labels = c("chr1","chr2","chr3","chr4","chr5", "chr6","chr7","chr8","chr9","chr10", "chr11","chr12","chr13","chr14","chr15", "chr16","chr17","chr18","chr19","chr20", "chr21","chr22"))

dm = melt(cnv.exp, id.var=c("chromosome","label"))

dm$fill = rep("white",25055)

dm$fill[dm$value > 2] <- "red"

dm$fill[dm$value < 2] <- "blue"

p <- ggplot(dm, aes(variable, chromosome, fill=fill)) + geom_tile()

p + scale_fill_identity(expand=c(0,0),guide = "legend", label=c("deletion","amplification","none")) + scale_y_discrete(breaks=dm$chromosome,labels=dm$label) + ggtitle("plot2")

head(cnv.exp)

chromosome sample1 label

1 chr.1.1.761584 2 <na>

2 chr.1.761585.762902 3 <na>

3 chr.1.762903.803449 2 <na>

4 chr.1.803450.812182 3 <na>

5 chr.1.812183.861119 2 <na>

6 chr.1.861120.879961 3 <na>

str(cnv.exp)

'data.frame': 25055 obs. of 3 variables:

$ chromosome: Factor w/ 25045 levels "chr.1.1.761584",..: 1 2443 2448 2474 2478 2513 2527 2526 2540 2547 ...

$ sample1 : num 2 3 2 3 2 3 2 3 2 3 ...

$ label : Factor w/ 22 levels "chr1","chr2",..: NA NA NA NA NA NA NA NA NA NA ...

with NA:

enter image description here

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by pirates.of.the.genome90

Hi there: (1) I am not getting why is there so many labels on the Y axis, moreover, I don't know which scale discrete or continuous you are using. Nevertheless, for both scales, you can specify two vectors - labels and breaks which will make those pretty. (2) Since labels are not sorted right, I would assume that your scale is discrete, - there are two options available, one is to switch to continuous, second is to fix the factor - probably it doesn't work for you

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Pavel Senin1.8k

I have updated the post with exact code and data structure. I got to learn that levels can be explicitly specified for factors but when I an doing so, NAs are added for empty fields which I think are coming up in the labels.

ADD REPLYlink written 3.9 years ago by pirates.of.the.genome90

Hi there! Hope you had an awesome weekend. So, I would guess that there is a bit of misunderstanding between us. You see, when you call scale_y_discrete(breaks=dm$chromosome,labels=dm$label) ggplot will populate Y scale labels with all the stuff you have in your dataframe -- bunch of labels where some were set, and some were not -- but not with what you want. i.e. problem is in the way data structures are populated and in the plotting instructions. If you will look on my example below edit2 - which I think is better than the first one - you'll find that the whole genome is imagined as a continuous canvas of an arbitrary width with length from 0 to the total length of sum of all chromosomes; then, each interval where some phenomenon occurs is imagined as a "tile" - with start, stop, and color; the plotting instruction places all colored tiles onto blank canvas; the labels along the canvas are then populated manually, not in automatic fashion - moreover, you can customize them the way you want. Let me know if you'd need more information.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Pavel Senin1.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 944 users visited in the last hour