I have the UCSC RefSeq track with txstart sites for all the genes. I want to find the distance between consecutive pairs of genes in order to determine which consecutive genes lie within a particular range of each other. How can I do this using Python or R ?
You could probably do this relatively easily in R with GenomicRanges. Read in the RefSeq track, convert the transcripts to GRanges and then apply a function that computes the distance between a range and the output from nearest().
Edit: Here's an example:
library("GenomicRanges") refGene <- read.delim("~/Downloads/refGene.txt", header=T) gr <- GRanges(seqnames=Rle(refGene$chrom), ranges=IRanges(start=refGene$txStart, end=refGene$txEnd, names=refGene$name), strand=Rle(strand(refGene$strand))) neighbors <- nearest(gr) #This can return NA REMOVE <- whichis.na(neighbors)) neighbors <- neighbors[-REMOVE] neighbor <- gr[neighbors] gr <- gr[-REMOVE] distances <- distance(gr, neighbor)
I just tried this on my laptop and it seems to work fine. If this isn't exactly what you want, you should be able to easily modify it.