I am trying to calculate gene density per 1kb for all scaffolds of a draft genome with IRanges package in R. Based on related posts in biostar and other forums, I have come up with this so far:
test_scaffolds.bed :
scaffold_1 0 2500
scaffold_2 0 6001
test_genes.bed:
scaffold_1 5 900 geneA . -
scaffold_1 950 1850 geneB . -
scaffold_1 1100 2600 geneC . +
scaffold_2 200 3200 geneD . +
library(IRanges)
library(rtracklayer)
genes<import("test_genes.bed",asRangedData=FALSE)
s<-import("test_scaffolds.bed",asRangedData=FALSE)#bed file with scaffold names, start and end cooordinates
seqlengths(s)<-width(s)
bins <- IRangesList(lapply(seqlengths(s),
function(seqlen)
IRanges(breakInChunks(seqlen,1000))))
I am not sure how to get no. of each genes per each bin after this, I tried countOverlaps(bins,genes)
, which gives me
scaffold_1 scaffold_2 2 2
which is not what I wanted. I will really appreciate any suggestions about improving the code to get the desired plot, and also about the best way to calculate the gene density in this case. Thanks!
That's a pretty cool approach, thanks Irsan. Since, you have the gene density plotted, how do you manage to get (extract) the plotted data into a dataframe? I mean, I would like to save the values for each bin.
The ggplot()-function performs a statistical transformation (in this case binning) on the provided data-frame internally but does not save the transformed data. When you still want to get it out you can do this
But I am not sure if this is what you are looking for. If you have problems with binning genomic data you can also put your data in BED-format and use a combination of
bedtools makewindows
andbedtools intersect
to calculate the amount of genes on certain genomic intervalsThanks for your help, that's exactly what I was looking for ! This library is definitely a must !
Thanks a lot for the ggplot solution, it is much better and easier than what I was attempting, and its great that I can extract the plotted data!