Question

Python: Finding Gene Closest To A Given Location

2

Entering edit mode

11.6 years ago

bbio ▴ 90

What is the preferred way to, given locations in a genome, find the first genes that are upstream or downstream of these locations?

I have been looking into processing a GFF file using BioPython, but it is taking an insanely long time (2h+) to parse a file other tools can parse in seconds. I also considered using Ensembl, but there also doesn't seem to be a good API for Python there.

So what would be my best course of action here? Use MySqlDb to hook straight into the Ensembl database? Parse the GFF file myself?

python annotation gff parsing ensembl • 9.1k views

ADD COMMENT • link updated 10.3 years ago by Biostar 20 • written 11.6 years ago by bbio ▴ 90

Ram · Answer 1 · 2012-09-25

7

Entering edit mode

11.6 years ago

Pierre Lindenbaum 161k

You can use bedtools closestBed

closestBed 

Summary: For each feature in A, finds the closest 
     feature (upstream or downstream) in B.

Usage:   closestBed [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf>

See also:

How to map a SNP to a gene around +/- 60KB ?

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 11.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

That seems to do exactly what I want, thank you! So I guess the best way to do this in Python is to not use Python...

ADD REPLY • link 11.6 years ago by bbio ▴ 90

0

Entering edit mode

there is a implementation of bedtools incase you need to use a a part of a script. http://packages.python.org/pybedtools/

ADD REPLY • link 11.5 years ago by Lorena Pantano ▴ 380

Ram · Answer 2 · 2012-10-30

3

Entering edit mode

11.5 years ago

brentp 24k

Since noone has mentioned it, you can use BEDTools and python (your cake and eat it too) with pybedtools

With that, the code would be something like:

import sys
from bedtools import BedTool

locs_file = sys.argv[1]
genes_file = sys.argv[2]

for nearest in BedTool(locs_file).closest(genes_file):
    print nearest

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 11.5 years ago by brentp 24k

1

Entering edit mode

Thank you for the suggestion. Do you know how well this performs compared to the command-line bedtools?

ADD REPLY • link 11.4 years ago by bbio ▴ 90

0

Entering edit mode

BedTool(locs_file).closest(genes_file) will be virtually identical in performance to the command-line. Looping over the result will incur some overhead because it is converting each line into a python object. But generally, speed is very good, most of the performance-critical parts of pybedtools are written in C.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 11.4 years ago by brentp 24k

score 2 · Answer 3 · 2012-09-25

2

Entering edit mode

11.6 years ago

DG 7.3k

Pierre's bedtools approach is probably the best. If you are set on developing you're own in-house application with Python I would recommend using bx.python and IntervalTrees. You can parse a feature list (such as a gene list with coordinates and chromosome) in to IntervalTrees (one per chromosome) and look for the closest intervals to your position(s) of interest. However bedtools will work much quicker and be easier "out-of-the-box".

But learning IntervalTrees can be quite handy for dealing with genomic data in general, especially with Python, so it may be something worth looking in to as a learning exercise.

ADD COMMENT • link 11.6 years ago by DG 7.3k

1

Entering edit mode

Yeah intervals trees are a good way to go. I posted some code on a python implementation recently: http://blog.nextgenetics.net/?e=45

ADD REPLY • link 11.6 years ago by Damian Kao 16k

0

Entering edit mode

Looks good. There is the implementation in bx.python, which includes a cython implementation already, I highly recommend checking it out as it is a pretty robust library

ADD REPLY • link 11.6 years ago by DG 7.3k

0

Entering edit mode

I think I will go with the easy solution above for now, but IntervalTrees are definitely something I will have to play around with in the future, so thank you for the suggestion!

ADD REPLY • link 11.6 years ago by bbio ▴ 90

0

Entering edit mode

I agree, I think the bedtools solution is optimal for your current needs. I am using IntervalTrees pretty heavily though for downstream data analysis and additional annotations that I am doing as part of my pipeline.

ADD REPLY • link 11.6 years ago by DG 7.3k