Question: Calculate Gene Density Per Kb And Plot Density Over Position For All Scaffolds Of A Draft Genome Using R
1
6.8 years ago by
Saima10
United States
Saima10 wrote:

I am trying to calculate gene density per 1kb for all scaffolds of a draft genome with IRanges package in R. Based on related posts in biostar and other forums, I have come up with this so far:

test_scaffolds.bed :
scaffold_1 0 2500
scaffold_2 0 6001
test_genes.bed:
scaffold_1 5 900 geneA . -
scaffold_1 950 1850 geneB . -
scaffold_1 1100 2600 geneC . + scaffold_2 200 3200 geneD . +

``````library(IRanges)
library(rtracklayer)
genes<import("test_genes.bed",asRangedData=FALSE)
s<-import("test_scaffolds.bed",asRangedData=FALSE)#bed file with scaffold names, start and end cooordinates
seqlengths(s)<-width(s)
bins <- IRangesList(lapply(seqlengths(s),
function(seqlen)
IRanges(breakInChunks(seqlen,1000))))
``````

I am not sure how to get no. of each genes per each bin after this, I tried `countOverlaps(bins,genes)`, which gives me

scaffold_1 scaffold_2 2 2

which is not what I wanted. I will really appreciate any suggestions about improving the code to get the desired plot, and also about the best way to calculate the gene density in this case. Thanks!

gene R • 9.7k views
modified 4.8 years ago by archie90 • written 6.8 years ago by Saima10
11
6.8 years ago by
Irsan7.1k
Amsterdam
Irsan7.1k wrote:

Make your life easy and learn to use the popular ggplot2 R-package. You only need a flat text file with column1 = chromosome name and column2 = start position of the gene. Then do this:

``````# check if ggplot2 is installed, if so, load it,
# if not, install and load it
if("ggplot2" %in% rownames(installed.packages())){
library(ggplot2)
} else {
install.packages("ggplot2")
library(ggplot2)
}

# import a text file with gene positions
# columns should be: chr, position (no end or gene name required)

# make sure the chromosomes are ordered in the way you want
# them to appear in the plot
genes\$chr <- with(genes, factor(chr, levels=paste("chr",c(1:22,"X","Y"),sep=""), ordered=TRUE))

# make a density plot of genes over the provided chromosomes (or scaffolds ...)
plottedGenes <- ggplot(genes) + geom_histogram(aes(x=pos),binwidth=1000000) + facet_wrap(~chr,ncol=2) + ggtitle("RefSeq genes density over human genome 19") + xlab("Genomic position (bins 1 Mb)") + ylab("Number of genes")

# save it to an image
png("genes.png",width=1000,height=1500)
print(plottedGenes)
dev.off()
``````

See resulting image here (can anyone tell me why the attach image button in the rich text editor does not work for me? What can I do? Paste Images In Questions And Answers)

That's a pretty cool approach, thanks Irsan. Since, you have the gene density plotted, how do you manage to get (extract) the plotted data into a dataframe? I mean, I would like to save the values for each bin.

2

The ggplot()-function performs a statistical transformation (in this case binning) on the provided data-frame internally but does not save the transformed data. When you still want to get it out you can do this

``````allPlottingData <- ggplot_build(plottedGenes)
df <- allPlottingData[["data"]][[1]]
``````

But I am not sure if this is what you are looking for. If you have problems with binning genomic data you can also put your data in BED-format and use a combination of `bedtools makewindows` and `bedtools intersect` to calculate the amount of genes on certain genomic intervals

Thanks for your help, that's exactly what I was looking for ! This library is definitely a must !