Question: Python: Finding Gene Closest To A Given Location
gravatar for bbio
5.7 years ago by
United Kingdom
bbio80 wrote:

What is the preferred way to, given locations in a genome, find the first genes that are upstream or downstream of these locations?

I have been looking into processing a GFF file using BioPython, but it is taking an insanely long time (2h+) to parse a file other tools can parse in seconds. I also considered using Ensembl, but there also doesn't seem to be a good API for Python there.

So what would be my best course of action here? Use MySqlDb to hook straight into the Ensembl database? Parse the GFF file myself?

ADD COMMENTlink modified 4.5 years ago by Biostar ♦♦ 20 • written 5.7 years ago by bbio80
gravatar for Pierre Lindenbaum
5.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum108k wrote:

You can use bedtools: closestBed


Summary: For each feature in A, finds the closest 
     feature (upstream or downstream) in B.

Usage:   closestBed [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf>

See also:

How to map a SNP to a gene around +/- 60KB ?

ADD COMMENTlink written 5.7 years ago by Pierre Lindenbaum108k

That seems to do exactly what I want, thank you! So I guess the best way to do this in Python is to not use Python...

ADD REPLYlink written 5.7 years ago by bbio80

there is a implementation of bedtools incase you need to use a a part of a script.

ADD REPLYlink written 5.6 years ago by Lorena Pantano360
gravatar for brentp
5.6 years ago by
Salt Lake City, UT
brentp22k wrote:

Since noone has mentioned it, you can use BEDTools and python (your cake and eat it too) with pybedtools

With that, the code would be something like:

import sys
from bedtools import BedTool

locs_file = sys.argv[1]
genes_file = sys.argv[2]

for nearest in BedTool(locs_file).closest(genes_file):
    print nearest
ADD COMMENTlink written 5.6 years ago by brentp22k

Thank you for the suggestion. Do you know how well this performs compared to the command-line bedtools?

ADD REPLYlink written 5.6 years ago by bbio80

BedTool(locsfile).closest(genesfile) will be virtually identical in performance to the command-line. Looping over the result will incur some overhead because it is converting each line into a python object. But generally, speed is very good, most of the performance-critical parts of pybedtools are written in C.

ADD REPLYlink written 5.6 years ago by brentp22k
gravatar for Dan Gaston
5.7 years ago by
Dan Gaston7.0k
Dan Gaston7.0k wrote:

Pierre's bedtools approach is probably the best. If you are set on developing you're own in-house application with Python I would recommend using bx.python and IntervalTrees. You can parse a feature list (such as a gene list with coordinates and chromosome) in to IntervalTrees (one per chromosome) and look for the closest intervals to your position(s) of interest. However bedtools will work much quicker and be easier "out-of-the-box".

But learning IntervalTrees can be quite handy for dealing with genomic data in general, especially with Python, so it may be something worth looking in to as a learning exercise.

ADD COMMENTlink written 5.7 years ago by Dan Gaston7.0k

Yeah intervals trees are a good way to go. I posted some code on a python implementation recently:

ADD REPLYlink written 5.7 years ago by Damian Kao14k

Looks good. There is the implementation in bx.python, which includes a cython implementation already, I highly recommend checking it out as it is a pretty robust library

ADD REPLYlink written 5.7 years ago by Dan Gaston7.0k

I think I will go with the easy solution above for now, but IntervalTrees are definitely something I will have to play around with in the future, so thank you for the suggestion!

ADD REPLYlink written 5.7 years ago by bbio80

I agree, I think the bedtools solution is optimal for your current needs. I am using IntervalTrees pretty heavily though for downstream data analysis and additional annotations that I am doing as part of my pipeline.

ADD REPLYlink written 5.7 years ago by Dan Gaston7.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2017 users visited in the last hour