Hopefully this question isn't too specific. I am using the latest release of the human genome in the Ensembl database (homo_sapiens_core_76_38). I would like to map exons to their dna sequence. The database schema seems to indicate that I can take the seq_region_id from the exon table and use that to reference the dna table. However there isn't a dna sequence for every exon. For example, the exon with exon_id=28550800, it's corresponding seq_region_id does not exist in the dna table. This is my first time using Ensembl, so is there something I'm missing?
Magali answered this on the Ensembl dev list as follows:
Exons and other features tend to be stored on toplevel sequences, which are generally chromosomes.
Dna sequence however is stored on the contig level.
The assembly table contains information to map a contig sequence to a chromosome.
Retrieving dna sequence directly from the mysql schema is tricky in the best of case.
This is why we recommend using Biomart, the perl API (http://www.ensembl.org/info/docs/api/index.html) or REST queries (http://rest.ensembl.org) for this type of use.
I wrote a python module for this kind of queries on Ensembl data, it's called pyGeno and it is freely available on github: https://github.com/tariqdaouda/pyGeno
Once you've imported the genome into it you can simply do:
from pyGeno.Genome import * ref = Genome(name = "GRCh7.75") exon = ref.get(Exon, id = "EN...") print exon.CDS print exon.sequence
Hope that helps,