pythonic way to get gene_id, gene_symbol from given list of gene, transcript co-ordinates
1
0
Entering edit mode
7.0 years ago
badribio ▴ 290

Any quick methods to get ensemble gene_id, gene_symbol for a list of co-ordinates (transcript), this post A: Identify gene symbols given a list of chromosome positions talks about ucsc only.

python • 1.5k views
ADD COMMENT
2
Entering edit mode

Hello,

you could use ensembl's REST-Api for this, e.g. the Overlap endpoint.

fin swimmer

ADD REPLY
0
Entering edit mode

any snippet to query rest api using python?

ADD REPLY
1
Entering edit mode
7.0 years ago

I'd be surprised if this couldn't be done with biomart.

In fact, here is a very simple example using GRCh38.

ADD COMMENT
0
Entering edit mode

biomart has a limit of 500 queries, (I may be wrong here) I have a lot of lines as this is transcript level co-ordinates

ADD REPLY
1
Entering edit mode

BTW, if for some reason you really want a python-based solution then download a GTF file and:

pip install deeptools

then in python

from deeptoolsintervals import GTF
anno = GTF("foo.gtf", transcriptID="gene_id", transcript_id_designator="gene")
anno.findOverlaps("chr1", 1, 1000)

That will get you the gene_id field and coordinate information. The python wrapper doesn't allow access to the symbol, so you'd need to just download the mapping from biomart.

If you don't want to perform a bunch of remote queries then something along those lines would work. I never really intended for others to use that python module, but if you ever want to it's documented here.

ADD REPLY
0
Entering edit mode

Thanks I will try this out, I need python solution as I need to modify a pipeline which has been written using python. else bedtools was my first choice, having said that pybedtools should also do the job I am not wrong.

ADD REPLY
0
Entering edit mode

Yeah, pybedtools would have been my other suggestion.

ADD REPLY
0
Entering edit mode

At that point don't you have the transcript IDs? Then you don't need to look anything up with coordinates, you just need to convert the transcript to gene ID (also available on biomart).

ADD REPLY
0
Entering edit mode

Nope, I have the co ordinates just like output from tophat junctions.bed file.

ADD REPLY
1
Entering edit mode

I expect that bedtools intersect and a bit of awk will turn out to be the simplest solution :P

ADD REPLY

Login before adding your answer.

Traffic: 2717 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6