How to get gene name from coordinates using python.
5
6
Entering edit mode
9.2 years ago
alec_djinn ▴ 380

I have a very long list of coordinates that looks like that: genome_build:chrN:coordinate

I want to know if the given coordinate hits a gene or whatever else. What is the most efficient way to do so using python?

Example: having hg19:chr8:119331103 as input, it should return [SAMD12]

gene coordinates python • 13k views
ADD COMMENT
0
Entering edit mode

Do you have annotation files for each of the genome builds or would you instead need to query that information (e.g., via SQL)?

ADD REPLY
0
Entering edit mode

I don't have the annotation files and I would like to query ensembl online instead of locally.

ADD REPLY
0
Entering edit mode

If you have a long list of coordinates then I would suggest you doing it locally using tools such as bedtools, BEDOPS. See:http://bedtools.readthedocs.org/en/latest/content/tools/window.html

ADD REPLY
0
Entering edit mode

Guys thank you for all your tips. I am trying them out and I will let you know.

ADD REPLY
8
Entering edit mode
9.2 years ago
alec_djinn ▴ 380

Finally, I wanna share the best solution I have found so far. 'Best' because it is extremely simple and pythonic.

All you need is PyEnsembl. Here is how easy it goes:

from pyensembl import EnsemblRelease

data = EnsemblRelease(75, auto_download=True) # will take a while to get the genome data installed in your system
gene_names = data.gene_names_at_locus(contig=6, position=29945884)
print gene_names

>>[u'AC023590.1', u'SAMD12']
ADD COMMENT
0
Entering edit mode

I get TypeError: __init__() got an unexpected keyword argument 'auto_download'

ADD REPLY
0
Entering edit mode

The new version of pyensembl does not use auto_download anymore. Instead, you should first install the genome you need via pyensembl install --release 75 --species human where 75 can be replaced with any version you want

ADD REPLY
0
Entering edit mode

I'm getting this error on Windows with Anaconda prompt.

pyensembl install --release 93 --species homo_sapiens
urllib.error.URLError: <urlopen error ftp error: URLError("ftp error: error_perm('550 Failed to change directory.',)",)>

I tried also downloading it manually but where I should place it or which is the variable path to change? Thanks

ADD REPLY
4
Entering edit mode
9.2 years ago
Tariq Daouda ▴ 220

Hi,

You could use Ensembl's rest API for that using requests.

But if you have a very long list I would suggest that you try pyGeno, it's typically the kind of applications if was made for:

x1 = 119331103
g = Genome(name = "GRCh37.75")
for gene in g.get(Gene, {"start >=": x1, "end <" : x1, "chromosome.number" : "8"}) :
        print gene
ADD COMMENT
0
Entering edit mode

Nice suggestions about pyGeno! An alternative is to query biomart using https://github.com/sebriois/biomart

ADD REPLY
0
Entering edit mode

Thanks man, worked perfectly. However it seems to run only with python2 do you know something similar for python3?

ADD REPLY
6
Entering edit mode
9.2 years ago

Python is probably not a good tool for this. A command-line approach is probably going to be easier, faster, and less to type in.

First, you might download a set of human (hg19) genes, using gtf2bed to make a BED file, e.g.:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gtf.gz \
    | gunzip -c - \
    | gtf2bed - \
    | grep -w gene - \
    > gencode.v21.genes.bed

Second, you might convert your coordinate list (variants?) into a BED file, using sort-bed to ensure the output is in correct sort order, e.g.:

$ awk -F':' '{print $2"\t"($3-1)"\t"$3}' very_long_list.txt \
    | sort-bed - \
    > elements.bed

Third, use bedmap to map the elements in elements.bed to the gene names in gencode.v21.genes.bed, e.g.:

$ bedmap --echo --echo-map-id-uniq --delim '\t' elements.bed gencode.v21.genes.bed > answer.bed
ADD COMMENT
6
Entering edit mode
9.2 years ago
dandan ▴ 370

You could use SolveBio's Python client for this (disclosure, I work there). It's fast and will do exactly what you want. Right now we have GENCODE in our Data Library - GENCODE19 in particular is in hg19/GRCh37. If there's a different transcript annotation library you'd prefer, just request it and we'll add it.

Basically, you sign up for an account with SolveBio (it's free for academics), then run the following commands in your terminal (full tutorial/documentation is available here).

$ pip install solvebio
$ solvebio login
$ solvebio

The last command will open up our Python shell (you can also write a script and do import solvebio - if you're interested, I can write up a sample iPython notebook for that). Once you're in the Python shell, you can just type

gencode = Dataset.retrieve('GENCODE/GENCODE19')
results = gencode.query().filter(feature='gene').range(8,119331103,119331103)
for result in results:
    print result['gene_symbol']

And that'll print out SAMD12 and AC023590.1 (if you change the filter to .filter(feature='gene',gene_status='KNOWN'), you'll get only SAMD12).

You can even do it as a one liner if you want

Dataset.retrieve('GENCODE/GENCODE19').query(fields='gene_symbol').filter(feature='gene',gene_status='known').range(8,119331103,119331103)

It's all very pythonic, so you can iterate through all the results in ways you expect.

ADD COMMENT
0
Entering edit mode
9.2 years ago
piet ★ 1.8k

If you want to code it yourself in python the key concept is calculating the overlap between two ranges of coordinates. If your unknown region of the sequence overlaps with a named feature, you will assign that name to the unknown region.

See here for more discussion

Although the problem seems to be simple it is not easy to implement efficiently. Efficiency matters if you have really large datasets. I ended up with a solution where I used a sqlite3 database within a python program.

ADD COMMENT

Login before adding your answer.

Traffic: 3720 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6