Handling Ranged Data In Python
2
3
Entering edit mode
12.3 years ago
Abhi ★ 1.6k

Hi Guys

I am wondering if there is something that I could use to handle ranged data in python similar to GenomicRanges/IRanges in R ?

A specific example:

My input data is gene coordinates and chromosomes. For each gene I would like to know the end coordinate of the gene just before it to calculate the intergenic distance.

Thanks! -Abhi

python biopython r • 4.7k views
4
Entering edit mode

There are several python packages and other alternatives mentioned here: http://biostar.stackexchange.com/questions/2245/what-is-the-quickest-algorithm-for-range-overlap

0
Entering edit mode

I remember that package was called 'pygr', but I haven't used it myself.

0
Entering edit mode

I think it would be better if you paste a little portion of your input file on your question.

0
Entering edit mode

Hi Abhi. Can you post your input file?

4
Entering edit mode
12.2 years ago
Weronika ▴ 300

Michael Dondrup's comment is probably the best solution if you want to do anything complex with ranges, but for the problem stated (intergenic distance calculation), you really only need a few lines of python code:

# I don't know what your data looks like, so I'll start with a list of
#  name,chromosome,start_pos,end_pos tuples
gene_positions = [('geneA', 'chr1', 100,200), ('geneB', 'chr1', 300,400),
('geneC', 'chr1', 401, 450), ('geneD', 'chr2', 100,200)]
# sorting genes by the chromosome and start_position fields so they're in order
gene_positions.sort(key = lambda x: (x[1],x[2]))

intergenic_distances = []

# using zip to get a list of pairs of adjacent genes
for gene1,gene2 in zip(gene_positions, gene_positions[1:]):
(gene1_name,gene1_chr,gene1_start,gene1_end) = gene1
(gene2_name,gene2_chr,gene2_start,gene2_end) = gene2
# only take gene pairs on the same chromosome
if gene1_chr == gene2_chr:
# add the distance and both gene names to intergenic_distance
#  (may need to adjust the distance math depending
#   if the positions are end-inclusive)
intergenic_distances.append((gene2_start - gene1_end,
gene1_name, gene2_name))


I don't know what you want to do with the data afterward, but you get a list of (intergenic_distance, gene1_name, gene2_name) tuples, like this:

for distance,gene1,gene2 in intergenic_distances:
print "Distance between %s and %s is %s"%(gene1,gene2,distance)
# For the example data above, this prints the following:
#  Distance between geneA and geneB is 100
#  Distance between geneB and geneC is 1


Of course, you probably already knew that.

1
Entering edit mode
6.1 years ago
endrebak ▴ 960

You now can: https://github.com/endrebak/pyranges

It is still in pre-alpha, but will be updated by me as I use it all the time.

Edit: in beta now. To my knowledge it is faster and more widely tested than any other alternative.