Note: This post was an answer to another Tutorial doing the same with ggplot. Moved here as suggested by others.
I had to create whole genome plots with data on them more than than a few times, and it's not easy to get it right! So I finally decided to create a package to make my life easier.
The package is called karyoploteR. It's in Bioconductor so you can find it in the package landing page where you'll find the vignette and other documentation. If you want more extensive documentation and a few more complex examples you can go to karyoploteR's examples and tutorial page. It uses R base graphics instead of ggplot so the way of building the plots is a bit different, but it shouldn't be a problem.
Since most of the code is encapsulated in the package functions you can skip most of the code and data you needed but I think it's almost as customizable as plots created from scratch.
To reproduce your example we'll only need the CNVkit data (in this case converted to a GenomicRanges with
sample_cns <- toGRanges( data.frame( chromosome = c("chr5", "chr19", "chr5", "chr9", "chr1", "chr11", "chr4", "chr17", "chr11", "chr1", "chr17", "chr5", "chr17", "chr17", "chr19", "chr1", "chr1", "chr22", "chr22", "chr17", "chr10", "chr11", "chr19", "chr2", "chr5"), start = c(131893016L, 4183350L, 112043414L, 15465517L, 27022894L, 108098351L, 13571634L, 41197694L, 108180886L, 6166339L, 48262862L, 112162804L, 78326739L, 11501815L, 2164183L, 97544531L, 97564044L, 41489008L, 41572250L, 37844086L, 123239094L, 118307227L, 36208920L, 140990754L, 56111400L), end = c(131978056L, 4224502L, 112179823L, 15465578L, 27107247L, 108236235L, 13629211L, 41276113L, 108236235L, 6240083L, 48278874L, 112179823L, 78367298L, 11872844L, 2229791L, 98386478L, 97771853L, 41574960L, 41574960L, 37884297L, 123353331L, 118392887L, 36229458L, 142888298L, 56189507L), cn = c(1L, 1L, 1L, 7L, 1L, 1L, 3L, 3L, 1L, 1L, 3L, 1L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 1L, 1L, 4L, 1L, 1L), gene = c("AC116366.7", "ANKRD24", "APC", "SNAPC3", "ARID1A", "ATM", "BOD1L1", "BRCA1", "C11orf65", "CHD5", "COL1A1", "CTC-554D6.1", "CTD-2047H16.4", "DNAH9", "DOT1L", "DPYD", "DPYD-AS1", "EP300", "EP300-AS1", "ERBB2", "FGFR2", "KMT2A", "KMT2B", "LRP1B", "MAP3K1"), stringsAsFactors=FALSE)) seqlevels(sample_cns) <- paste0("chr", c(1:22, "X", "Y")) sample_cns <- sort(sample_cns)
We don't need chromosome lengths or centromere positions, since it's included in the package for human (and a bunch of other organisms), but you can use it to plot custom genomes if you need it, for example for non-model organisms of with unfinished genomes.
To create the plot, we first create an empty plot with
plotKaryotypeand after that we just add data and non-data graphical elements such as base numbers, etc... In this case, we would need only four lines of code (the last one a bit long to show some of the customization options).
kp <- plotKaryotype() kpAddBaseNumbers(kp) cn_col <- ifelse(sample_cns$cn>2, "red", "blue") kpPlotMarkers(kp, data=sample_cns, labels=sample_cns$gene, label.color = cn_col, text.orientation = "horizontal", r1=0.7, cex=0.8, marker.parts = c(0.2, 0.7, 0.1))
And would get something like this
As you can see the cytobands are automatically added (this can be deactivated too) and we don't need to worry about basic chromosome data (length, names, etc...).
Once we have this we can change if in many different ways. For example, we can decide to plot the gains above the ideograms and the losses below them just changing a single parameter (
plot.type=2 to get two plotting areas) and adding a second call to
kp <- plotKaryotype(plot.type=2) kpAddBaseNumbers(kp) gains <- sample_cns[sample_cns$cn>2] losses <- sample_cns[sample_cns$cn<2] kpPlotMarkers(kp, data=gains, labels=gains$gene, label.color = "red", text.orientation = "horizontal", r1=0.8, cex=0.8, marker.parts = c(0.2, 0.7, 0.1), data.panel=1) kpPlotMarkers(kp, data=losses, labels=losses$gene, label.color = "blue", text.orientation = "horizontal", r1=0.8, cex=0.8, marker.parts = c(0.2, 0.7, 0.1), data.panel=2)
There are many other options and customizations possible, and it's actually capable of plotting almost anything on any genome.