Calculate Gene Density Per Kb And Plot Density Over Position For All Scaffolds Of A Draft Genome Using R
1
1
Entering edit mode
11.0 years ago
Saima ▴ 10

I am trying to calculate gene density per 1kb for all scaffolds of a draft genome with IRanges package in R. Based on related posts in biostar and other forums, I have come up with this so far:

test_scaffolds.bed :
scaffold_1 0 2500
scaffold_2 0 6001
test_genes.bed:
scaffold_1 5 900 geneA . -
scaffold_1 950 1850 geneB . -
scaffold_1 1100 2600 geneC . + scaffold_2 200 3200 geneD . +

library(IRanges)
library(rtracklayer)
genes<import("test_genes.bed",asRangedData=FALSE)
s<-import("test_scaffolds.bed",asRangedData=FALSE)#bed file with scaffold names, start and end cooordinates 
seqlengths(s)<-width(s)
bins <- IRangesList(lapply(seqlengths(s),
                           function(seqlen)
                             IRanges(breakInChunks(seqlen,1000))))

I am not sure how to get no. of each genes per each bin after this, I tried countOverlaps(bins,genes), which gives me

scaffold_1 scaffold_2 2 2

which is not what I wanted. I will really appreciate any suggestions about improving the code to get the desired plot, and also about the best way to calculate the gene density in this case. Thanks!

r gene • 14k views
ADD COMMENT
23
Entering edit mode
11.0 years ago
Irsan ★ 7.8k

Make your life easy and learn to use the popular ggplot2 R-package. You only need a flat text file with column1 = chromosome name and column2 = start position of the gene. Then do this:

# check if ggplot2 is installed, if so, load it, 
# if not, install and load it
if("ggplot2" %in% rownames(installed.packages())){
    library(ggplot2)
} else {
    install.packages("ggplot2")
    library(ggplot2)
}

# import a text file with gene positions
# columns should be: chr, position (no end or gene name required)
genes <- read.table("genes.txt",sep="\t",header=T)

# make sure the chromosomes are ordered in the way you want
# them to appear in the plot
genes$chr <- with(genes, factor(chr, levels=paste("chr",c(1:22,"X","Y"),sep=""), ordered=TRUE))

# make a density plot of genes over the provided chromosomes (or scaffolds ...)
plottedGenes <- ggplot(genes) + geom_histogram(aes(x=pos),binwidth=1000000) + facet_wrap(~chr,ncol=2) + ggtitle("RefSeq genes density over human genome 19") + xlab("Genomic position (bins 1 Mb)") + ylab("Number of genes")

# save it to an image
png("genes.png",width=1000,height=1500)
print(plottedGenes)
dev.off()

See resulting image here (can anyone tell me why the attach image button in the rich text editor does not work for me? What can I do? Paste Images In Questions And Answers)

ADD COMMENT
0
Entering edit mode

That's a pretty cool approach, thanks Irsan. Since, you have the gene density plotted, how do you manage to get (extract) the plotted data into a dataframe? I mean, I would like to save the values for each bin.

ADD REPLY
3
Entering edit mode

The ggplot()-function performs a statistical transformation (in this case binning) on the provided data-frame internally but does not save the transformed data. When you still want to get it out you can do this

allPlottingData <- ggplot_build(plottedGenes)
df <- allPlottingData[["data"]][[1]]

But I am not sure if this is what you are looking for. If you have problems with binning genomic data you can also put your data in BED-format and use a combination of bedtools makewindows and bedtools intersect to calculate the amount of genes on certain genomic intervals

ADD REPLY
0
Entering edit mode

Thanks for your help, that's exactly what I was looking for ! This library is definitely a must !

ADD REPLY
0
Entering edit mode

Thanks a lot for the ggplot solution, it is much better and easier than what I was attempting, and its great that I can extract the plotted data!

ADD REPLY

Login before adding your answer.

Traffic: 2533 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6