Question: Python: Finding Gene Closest To A Given Location
2
gravatar for bbio
5.2 years ago by
bbio80
United Kingdom
bbio80 wrote:

What is the preferred way to, given locations in a genome, find the first genes that are upstream or downstream of these locations?

I have been looking into processing a GFF file using BioPython, but it is taking an insanely long time (2h+) to parse a file other tools can parse in seconds. I also considered using Ensembl, but there also doesn't seem to be a good API for Python there.

So what would be my best course of action here? Use MySqlDb to hook straight into the Ensembl database? Parse the GFF file myself?

ADD COMMENTlink modified 3.9 years ago by Biostar ♦♦ 20 • written 5.2 years ago by bbio80
7
gravatar for Pierre Lindenbaum
5.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum101k wrote:

You can use bedtools: closestBed

closestBed 

Summary: For each feature in A, finds the closest 
     feature (upstream or downstream) in B.

Usage:   closestBed [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf>

See also:

How to map a SNP to a gene around +/- 60KB ?

ADD COMMENTlink written 5.2 years ago by Pierre Lindenbaum101k

That seems to do exactly what I want, thank you! So I guess the best way to do this in Python is to not use Python...

ADD REPLYlink written 5.2 years ago by bbio80

there is a implementation of bedtools incase you need to use a a part of a script. http://packages.python.org/pybedtools/

ADD REPLYlink written 5.1 years ago by Lorena Pantano360
2
gravatar for Dan Gaston
5.2 years ago by
Dan Gaston6.9k
Canada
Dan Gaston6.9k wrote:

Pierre's bedtools approach is probably the best. If you are set on developing you're own in-house application with Python I would recommend using bx.python and IntervalTrees. You can parse a feature list (such as a gene list with coordinates and chromosome) in to IntervalTrees (one per chromosome) and look for the closest intervals to your position(s) of interest. However bedtools will work much quicker and be easier "out-of-the-box".

But learning IntervalTrees can be quite handy for dealing with genomic data in general, especially with Python, so it may be something worth looking in to as a learning exercise.

ADD COMMENTlink written 5.2 years ago by Dan Gaston6.9k
1

Yeah intervals trees are a good way to go. I posted some code on a python implementation recently: http://blog.nextgenetics.net/?e=45

ADD REPLYlink written 5.2 years ago by Damian Kao14k

Looks good. There is the implementation in bx.python, which includes a cython implementation already, I highly recommend checking it out as it is a pretty robust library

ADD REPLYlink written 5.2 years ago by Dan Gaston6.9k

I think I will go with the easy solution above for now, but IntervalTrees are definitely something I will have to play around with in the future, so thank you for the suggestion!

ADD REPLYlink written 5.2 years ago by bbio80

I agree, I think the bedtools solution is optimal for your current needs. I am using IntervalTrees pretty heavily though for downstream data analysis and additional annotations that I am doing as part of my pipeline.

ADD REPLYlink written 5.2 years ago by Dan Gaston6.9k
2
gravatar for brentp
5.1 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

Since noone has mentioned it, you can use BEDTools and python (your cake and eat it too) with pybedtools

With that, the code would be something like:

import sys
from bedtools import BedTool

locs_file = sys.argv[1]
genes_file = sys.argv[2]

for nearest in BedTool(locs_file).closest(genes_file):
    print nearest
ADD COMMENTlink written 5.1 years ago by brentp22k
1

Thank you for the suggestion. Do you know how well this performs compared to the command-line bedtools?

ADD REPLYlink written 5.0 years ago by bbio80

BedTool(locsfile).closest(genesfile) will be virtually identical in performance to the command-line. Looping over the result will incur some overhead because it is converting each line into a python object. But generally, speed is very good, most of the performance-critical parts of pybedtools are written in C.

ADD REPLYlink written 5.0 years ago by brentp22k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 695 users visited in the last hour