Question: Ensembl & biomaRt: extracting in-frame codons near specified position
gravatar for bsmith030465
3.8 years ago by
United States
bsmith030465140 wrote:


My objective is to find the frequency of in-frame codons within 50 bp of a specified location. I have the ensembl transcript ID and the genomic coordinates,e.g.:

chromosome    position    strand    ensemblTID
chr2    219130603    -    ENST00000538028


I have been trying to use the getBM and getSequence functions, but I'm not even close to getting what I want.

Any help to point me in the correct direction would be appreciated!


biomart codon ensembl • 1.4k views
ADD COMMENTlink modified 3.8 years ago by Tariq Daouda210 • written 3.8 years ago by bsmith030465140
gravatar for Tariq Daouda
3.8 years ago by
Tariq Daouda210
IRIC | Institute for Research in Immunology and Cancer
Tariq Daouda210 wrote:

A very similar example of what you need is on the front page of pyGeno's website.

Here's a way to get what you're asking for:

from pyGeno.Genome import *

ref = Genome(name = "GRCh37.75") #or whatever other ref genome you've chosen to import
exons = ref.get(Exon, {"chromosome.number" : "2", "start >=": 219130603 - 50, "end <=" : 219130603 + 50 } )

#to print the sequences for example do:
for e in exons :
  print exon.sequence
ADD COMMENTlink written 3.8 years ago by Tariq Daouda210
Hi Tariq, That looks like an interesting package. I tried your code and got the following error:

Traceback (most recent call last):
  File "/Applications/", line 7, in <module>
  File "/Library/Python/2.7/site-packages/pyGeno/", line 67, in __init__
    pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)
  File "/Library/Python/2.7/site-packages/pyGeno/", line 83, in __init__
    self.wrapped_object = self._wrapped_class(*args, **kwargs)
  File "/Library/Python/2.7/site-packages/rabaDB/", line 301, in __call__
    raise KeyError("Couldn't find any object that fit the arguments you've prodided to the constructor")
KeyError: "Couldn't find any object that fit the arguments you've prodided to the constructor"

Also, I'm not sure if this is quite what I am looking for. I think that using biomaRt and biostrings I can get the sequence information (given the coordinates).
ADD REPLYlink written 3.8 years ago by bsmith030465140
gravatar for Emily_Ensembl
3.8 years ago by
Emily_Ensembl17k wrote:

I don't think you're going to get the data you need through BioMart. BioMart is a gene-centric tool, it works by defining a list of genes (filters) then printing information about those genes (attributes). Also, BioMart attributes tend to be things that are commonly looked-for by the community, not very esoteric things.

What you're trying to do is define a locus and get very specific data on the genomic region around it – BioMart's just not going to do it. I think you're going to have to look at using the Ensembl Perl API. This allows completely flexible access to the Ensembl database, so you can define your regions of interest and get whatever data you like for them. There's an online course on using the API here, you'll just need the Core module.

ADD COMMENTlink written 3.8 years ago by Emily_Ensembl17k

Hi Emily,

Thanks for the reply! Hmm...I don't know if my question is too esoteric! I think with a little processing, I may be able to get the answer, but I may be wrong! Anyway, here's how I view the problem:

1. Given the ensembl transcript ID, identify the coding start and coding end coordinates.

2. Given the coding coordinates, get the dna sequence

3. Convert dna sequence to mRNA sequence.

4. From the 5' end, identify where the first codon starts.

5. Given the coordinates of the first codon, keep moving down until you hit the region of interest (location plus/minus 50 bp)

6. Identify codons in this region.


Am I thinking about this correctly? Did I pose the question correctly?

many thanks for all your help!

ADD REPLYlink written 3.8 years ago by bsmith030465140

It's too esoteric for BioMart to be able to get you your answer all by itself, but if you're happy to do post-processing, that's fine. You might find it easier to get the exon attributes, since introns will mess with your coding frame. You can get the exon sequence, phase and coding start/end and this might be easier to work with.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Emily_Ensembl17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1122 users visited in the last hour