I'm interested in whether anyone has worked up a solution for retrieving all unique exons from Ensembl in relation to the gene IDs and not the transcript IDs.
I have already used the Perl Ensembl Core API to retrieve all exons, for all transcripts, for all genes, but this results in redundant data, due to alternative splicing in different transcripts. Some exons therefore overlap or are replicated and therefore the true exon data is exaggerated. I want the number of exons per gene, not the number of exons for all transcripts.
It just confuses me because on the Assembly and Genebuild page for Genome Statistics (e.g. http://www.ensembl.org/Takifugu_rubripes/Info/StatsTable) it has the number of gene exons listed at 322,585, but when I download using BioMart or the Perl API I get nearly 650,000.
I guess I'm going to have to either choose the transcript with the most exons, or work on a solution that removes any redundancy by amending overlapping regions and removing complete duplicates? I suspect the former will be the easiest and hopefully not exhibit too many errors?
A simple way to do this is by using the canonical_transcript method for the gene! So we call:
my $can_tr = $gene->canonical_transcript(); my $exons = $can_tr->get_all_Exons();