Question: Getting All Exons From Ensembl...
9
gravatar for Steve Moss
9.1 years ago by
Steve Moss2.3k
United Kingdom
Steve Moss2.3k wrote:

Hi,

I'm interested in whether anyone has worked up a solution for retrieving all unique exons from Ensembl in relation to the gene IDs and not the transcript IDs.

I have already used the Perl Ensembl Core API to retrieve all exons, for all transcripts, for all genes, but this results in redundant data, due to alternative splicing in different transcripts. Some exons therefore overlap or are replicated and therefore the true exon data is exaggerated. I want the number of exons per gene, not the number of exons for all transcripts.

It just confuses me because on the Assembly and Genebuild page for Genome Statistics (e.g. http://www.ensembl.org/Takifugu_rubripes/Info/StatsTable) it has the number of gene exons listed at 322,585, but when I download using BioMart or the Perl API I get nearly 650,000.

I guess I'm going to have to either choose the transcript with the most exons, or work on a solution that removes any redundancy by amending overlapping regions and removing complete duplicates? I suspect the former will be the easiest and hopefully not exhibit too many errors?

Cheers,

Steve

Update

A simple way to do this is by using the canonical_transcript method for the gene! So we call:

my $can_tr = $gene->canonical_transcript();
my $exons = $can_tr->get_all_Exons();
gene ensembl api exon transcript • 12k views
ADD COMMENTlink modified 8.3 years ago • written 9.1 years ago by Steve Moss2.3k
1

I think you need to provide a definition of "unique" for your purposes. Naively this might mean "having an unique sequence within a gene", in which case creating a hash-table with an appropriate key within your script would do it, but it seems you want something more complex.

ADD REPLYlink written 9.1 years ago by iw9oel_ad6.0k

By unique I mean I want to retrieve all the non-redundant i.e. non-overlapping and non-replicated exons for each gene. The exons appear to be defined by transcripts; as new studies merge new transcripts, the exon number can increase and also become redundant.

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k
5
gravatar for Cassj
9.1 years ago by
Cassj1.3k
London
Cassj1.3k wrote:

What did you use as your biomart query? If you take the unique results from a query with no filters and attributes of Ensembl Gene ID, Ensembl Exon ID, Chr Start, Chr End and Chr Name then you get back 322622 exons. Results I got are here: http://mng.iop.kcl.ac.uk/~cassj/martquery_0804162753_537.txt

You'll still get overlapping exons though - I think ensembl defines an exon on the basis of its splice sites, so you can have unique exons that only differ in a few bps at one end.

ADD COMMENTlink written 9.1 years ago by Cassj1.3k
1

But you get slightly closer to the expected answer! Comparing the two sets, our results cover exactly the same genes, though. Possibly some exons have different exon IDs, but the same coordinates. The correct result depends on your definition of "unique".

ADD REPLYlink written 9.1 years ago by iw9oel_ad6.0k

which is exactly what Keith has already said while I was typing. Sorry!

ADD REPLYlink written 9.1 years ago by Cassj1.3k

By unique I mean I want to retrieve all the non-redundant i.e. non-overlapping and non-replicated exons for each gene. I think I will need to do some sort of redundancy checks on the exon coordinates by the look of things to ensure that none of the output contradicts this criteria.

I had done something similar previously, checking for redundancy in repeat elements by creating a hash table (dict in python) for the start and end coordinates with a matching incremental integer value for each pair. I then sorted by the start coordinates and matched up the pairs again before comparing them all.

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k

Not sure if that was the most optimal solution, but always welcome to suggestions :)

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k
4
gravatar for Pierre Lindenbaum
9.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

The SQL tables for ensembl are available here.

Here, you'll find the definition of the tables in ftp://ftp.ensembl.org/pub/current_mysql/homo_sapiens_core_58_37c/homo_sapiens_core_58_37c.sql.gz

There is a table named exons:

CREATE TABLE `exon` (
  `exon_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `seq_region_id` int(10) unsigned NOT NULL,
  `seq_region_start` int(10) unsigned NOT NULL,
  `seq_region_end` int(10) unsigned NOT NULL,
  `seq_region_strand` tinyint(2) NOT NULL,
  `phase` tinyint(2) NOT NULL,
  `end_phase` tinyint(2) NOT NULL,
  `is_current` tinyint(1) NOT NULL DEFAULT '1',
  `is_constitutive` tinyint(1) NOT NULL DEFAULT '0',
  PRIMARY KEY (`exon_id`),
  KEY `seq_region_idx` (`seq_region_id`,`seq_region_start`)
) ENGINE=MyISAM AUTO_INCREMENT=1536937 DEFAULT CHARSET=latin1;

Its data are here.

and the seq_region_id (chromosome) is defined in here.

You can access those data using mysql , see this other question.

Another solution using UCSC, not ensembl:

curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/knownGene.txt.gz" |\
  gunzip -c |\
  awk -F '\t' '{
        split($9,S,"[,]");
        split($10,E,"[,]");
        n=int($8);
        for(i=1;i<=n;++i) printf("%s\t%d\t%d\n",$2,int(S[i]),int(E[i]));
        }'
  chr1    1115    2090
  chr1    2475    2584
  chr1    3083    4121
  chr1    1115    2090
  chr1    2475    4272
  chr1    4268    4692
  chr1    4832    4901
  chr1    5658    5810
  chr1    6469    6628
  chr1    6716    6918
  (...)
ADD COMMENTlink modified 18 days ago by RamRS24k • written 9.1 years ago by Pierre Lindenbaum122k
2

It's because the data from the API and Biomart are denormalized. If you look at the number of rows in the exon_transcript link table (635796), you'll see each exon in the exon table is linked to multiple transcripts.

ADD REPLYlink written 9.1 years ago by iw9oel_ad6.0k

Hi Pierre,

I'm finishing off a PhD students work, who used Ensembl, so not sure UCSC will be relevant, although it is helpful to know about, thank you!

According to a gentleman from the Ensembl helpdesk, there is no way to retrieve all exons per gene using BioMart or the Core Perl API; as it needs to retrieve all transcripts for all gene ids and then all exons for all transcript ids.

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k

Looking at the MySQL tables however (e.g. ftp://ftp.ensembl.org/pub/current_mysql/takifugu_rubripes_core_58_4l/exon.txt.gz) I get more like what I am expecting, so I don't quite understand how BioMart and the Perl Core API are retrieving such vastly different numbers (closer to 650,000 exons for Takifugu).

Cheers,
Steve

ADD REPLYlink modified 18 days ago by RamRS24k • written 9.1 years ago by Steve Moss2.3k

Yeah, I've been investigating it for the last couple of hours and had noticed that. It just makes it a little awkward to retrieve the information I need, but I can get it directly from the MySQL database I guess!?

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k
3
gravatar for Darked89
9.1 years ago by
Darked894.2k
Barcelona, Spain
Darked894.2k wrote:

Few thoughts:

  • transcript with largest number of exons still may miss some of them
  • there are transcripts with equal number of exons but not all of them will be common
  • simple overlap checking will not work in possible cases (I am not sure if these are listed in Ensembl as genes) of reverse transcripts regulating a given gene. If you restrict yourself to only protein coding genes you should be OK at least in vertebrates, I think.
ADD COMMENTlink written 9.1 years ago by Darked894.2k
3
gravatar for iw9oel_ad
9.1 years ago by
iw9oel_ad6.0k
iw9oel_ad6.0k wrote:

You can do this from Biomart, despite what the man on the EnsEMBL helpdesk said, provided you limit your query attributes strictly to those that identify the gene and the exon start, end and strand.

E.g. select Takifugu rubripes genes, select no filters and attributes Structures plus Ensembl Gene ID, Gene Strand, Exon Chr Start (bp) and Exon Chr End (bp). Check the box for unique results only.

This will give a file like this:

Ensembl Gene ID Exon Chr Start (bp)     Exon Chr End (bp)       Strand
ENSTRUG00000000001      2982    3983    -1

This contains 321610 lines (including the header). I don't think this contains the 703 RNA genes, so 321609 + 703 = 322,312. Not sure where the other 273 are to make the 322585 total. Unfortunately, the stats table doesn't say exactly how their total was calculated.

ADD COMMENTlink modified 18 days ago by RamRS24k • written 9.1 years ago by iw9oel_ad6.0k

Many thanks! As with Cass below, I had realised that removing the Transcript ID on BioMart gave me 322,622 exons and I simply assumed this was due to minor revisions since the last release?

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k
0
gravatar for Khader Shameer
9.1 years ago by
Manhattan, NY
Khader Shameer18k wrote:

If you have already downloaded all exons, you may run a CD-HIT at a specific threshold to get non-redundant sequences. I have never tried CD-HIT with HIT, but as per the help page it should work with nucleotide sequence as well.

ADD COMMENTlink written 9.1 years ago by Khader Shameer18k

That might work? I have used CD-HIT previously, although I think UClust 3.x is probably faster now? I have actually developed my own novel solution in Python called remred.py that is faster than both of them, but only works for 100% sequence identity, although that should work in this case, because I aren't wanting any fuzzy matches. It searches for smaller sequences within larger sequences too.

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k

Actually, saying that, if an exon sequence matches in a different gene, then it may result in an under-representation of the true dataset? I could possibly due redundancy checking per gene, but it would mean splitting up my multifasta into separate gene ids first.

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k

Saying that if a sequence matches in a different exon, then it could result in an underestimation of the true dataset? I could possibly do redundancy checks per gene id to get around this? But then, what if exon sequences match within genes? I think checking coordinates may actually be better!

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k

It's more to do with redundancy in the position of the data, as opposed to the actual sequence itself.

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k
0
gravatar for Steve Moss
9.1 years ago by
Steve Moss2.3k
United Kingdom
Steve Moss2.3k wrote:

I've hit another obstacle here in that the DNA sequences for the sequence region ids of each exon, aren't actually in the dna table.

I contacted the Ensembl Help-desk and I am told it would be an extremely complex task to retrieve the exon sequence data using MySQL.

Where are these missing DNA sequences? For danio_rerio_core_58_5d for example I am missing 73,448 to 89,002 (select seq_region_id from dna;), which correlates exactly with the exon sequence region ids.

Has anyone done this previously?

ADD COMMENTlink written 9.1 years ago by Steve Moss2.3k
1

Gawbul: You may add these as a separate question or include in your question for more attention from BioStar members.

ADD REPLYlink written 9.1 years ago by Khader Shameer18k

If you've got the genome position and strand for each exon, could you maybe retrieve the sequences in Bioconductor, there seems to be D.rerio genome package - http://www.bioconductor.org/packages/release/Danio_rerio.html

ADD REPLYlink written 9.1 years ago by Cassj1.3k

Hi Cass, thanks for the comment! I need to retrieve the data for five fish and it looks like only Danio rerio is available on bioconductor? I think I'm going to parse the unique IDs I've retrieved from BioMart and then extract only the information for those exon IDs from the exon fasta file I currently have. I'll then need to build coordinates for the introns from the exons and do the same with the intron fasta file I created using the Perl Core API?

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k

Thanks for the comment Khader. Thinking about etiquette, I was worried about posting lots of messages, if they were all connected in relevance. I know on some forums this can be an issue. I'll certainly bare that in mind for future posts though.

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k

Thanks for the comment Khader. Thinking about etiquette, I was worried about posting lots of questions, if they were all connected in relevance. I know on some forums this can be an issue. I'll certainly bare that in mind for future posts though

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1459 users visited in the last hour