Ensembl Gene Sequence Data File?
2
0
Entering edit mode
9.3 years ago
pwg46 ▴ 540

Hello,

I am looking for a sequence file for Ensembl gene identifiers. In particular, I have been searching for a file like the CDS (Fasta) file here, which simply maps ENSTs to their coding sequences. So, I'm wondering if there is a file like that for Ensembl genes. I am trying to avoid downloading the entire genome though. Just a simple ENSG->sequence mapping file would be best, but so far I am unable to find such a file. Does anyone know if such a file exists?

data sequence ensg ensembl fasta • 3.2k views
ADD COMMENT
2
Entering edit mode
9.3 years ago

I don't think Ensembl produces this file but there are several ways you could produce one:

  • with the API: extract the chromosome slice between gene start and gene end.
  • get gene coordinates from the GTF file and extract the corresponding sequence from the DNA file
  • use Ensembl's biomart
ADD COMMENT
0
Entering edit mode
9.3 years ago
Tariq Daouda ▴ 220

Hi,

There's a python module for working with Ensembl annotations and sequences that can save you a lot of time. It's called pyGeno, and those few lines should do the trick :

from pyGeno.Genome import *

ref = Genome(name = "GRCh37.75")

#For a specific transcript you can do:
trans = ref.get(Transcript, id = "ENST...")
print trans.cDNA

If you want the sequences for all the transcripts of the genome:

for trans in ref.iterGet(Transcript) :
   print trans.id, trans.cDNA

But first you'll have to import the human reference genome

import pyGeno.bootstrap as B
B.importHumanReference()

pyGeno: https://github.com/tariqdaouda/pyGeno

Cheers

ADD COMMENT

Login before adding your answer.

Traffic: 2813 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6