How To Download All The Introns From Ensembl
5
8
Entering edit mode
12.7 years ago
Anima Mundi ★ 2.9k

Hi,

I would like to know how could I download all the introns (in FASTA) of a species from Ensembl via web.

ensembl fasta intron • 8.9k views
ADD COMMENT
13
Entering edit mode
12.7 years ago
Akk ▴ 210

Hi,

As far as I know, we don't store the intron sequences explicitly anywhere.

Here's a piece of Perl code that uses the Ensembl Perl API to fetch all intron sequences for the transcripts overlapping a particular region on the first human chromosome. It can easily be modified to fetch all transcripts in the species and to dump the sequence to a file instead of to the screen:

use strict;
use warnings;

use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::Utils::SeqDumper;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db( '-host' => 'ensembldb.ensembl.org',
                                  '-port' => '5306',
                                  '-user' => 'anonymous',
                                  '-db_version' => '63' );

my $sa = $registry->get_adaptor( 'Human', 'Core', 'Slice' );
my $slice = $sa->fetch_by_region( 'Chromosome', '1', 12_000, 13_000 );

my $dumper = Bio::EnsEMBL::Utils::SeqDumper->new();

foreach my $transcript ( @{ $slice->get_all_Transcripts() } ) {
  foreach my $intron ( @{ $transcript->get_all_Introns() } ) {
    $dumper->dump( $intron->feature_Slice(), 'FASTA' );
  }
}

I hope this helps.

ADD COMMENT
0
Entering edit mode

This helps, thanks!

ADD REPLY
0
Entering edit mode

You should be aware that there can be multiple transcripts, due to alternate splicing, for any particular gene. You would need to use the canonical transcript, or do some post-hoc removal of the redundant/overlapping introns to ensure you aren't over estimating the number of introns retrieved.

ADD REPLY
4
Entering edit mode
12.4 years ago

Additional to my comment and Andreas' post, this is how I would deal with intron redundancy:

use strict;
use warnings;

use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::Utils::SeqDumper;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(-host => 'ensembldb.ensembl.org',
                                 -port => 5306,
                                 -user => 'anonymous',
                                 -passwd => undef,
                                 -db_version => 64);

my $gene_adapter = $registry->get_adaptor('Human', 'Core', 'Gene');

my $dumper = Bio::EnsEMBL::Utils::SeqDumper->new();

while(my $gene_id = shift(@{$gene_adapter->list_stable_ids()})) {
    my $gene = $gene_adapter->fetch_by_stable_id($gene_id);
    my $canonical_transcript = $gene->canonical_transcript();
    while(my $intron = shift(@{$canonical_transcript->get_all_Introns()})) {
        $dumper->dump($intron->feature_Slice(), 'FASTA', 'introns.fasta');
    }
}
ADD COMMENT
0
Entering edit mode

Oh, thanks for this important improvement, I will test it soon!

ADD REPLY
2
Entering edit mode
12.7 years ago

You can refer to this biostar thread

You can design your query at biomart and export it to xml and use the script I provided to get your introns

Hope this helps

Radhouane

ADD COMMENT
1
Entering edit mode

BioMart won't actually work in this case, as the introns are not stored, and are not in the BioMart database. The API script Andreas provides is the best way.

ADD REPLY
0
Entering edit mode

Thanks a lot, but I will have to deal with Ensembl also for other issues, I would prefer to clarify in it this particular kind of problem.

ADD REPLY
1
Entering edit mode
12.7 years ago

Enter any Gene Symbol in Ensembl. choose your organism. follow the link you can find one geneatlas link click on the Geneatlas link. You will all the introns and exons of yor particular gene.

Hope this will help you

ADD COMMENT
0
Entering edit mode

Thank you too, but I would like to save all the introns of one genome.

ADD REPLY
0
Entering edit mode
12.2 years ago
Biojl ★ 1.7k

Be careful because ultimately ENSEMBL is not working properly and most of the times do not give you all the information you request. Check your data! I normally cut the first column from results, sort it and uniq it and compare to the ID list I provided, just to be sure that I get at least one line for each ID I requested.

My advise is to directly download the FASTA files for the whole genome and only ask ENSEMBL for the positions. Then extract them yourself. It may seem more work but you'll get exact results and will avoid you some troubles when analyzing results.

ADD COMMENT
0
Entering edit mode

Thanks for the hint. How long has it been working improperly?

ADD REPLY
0
Entering edit mode

At least for 15 months that is the time I found the issue, probably more... I contacted ENSEMBL and they told be they knew that was happening and the only solution they provide is to make smaller queries, but even in some cases that fails.

ADD REPLY
0
Entering edit mode

Biojl is referring to BioMart, not to Ensembl! BioMart indeed is not very good in handling such large genome-wide queries, but, as already pointed out above, you cannot retrieve intron information with BioMart anyway. The API script provided by gawbul should work perfectly fine for you and give you all the information you request.

ADD REPLY

Login before adding your answer.

Traffic: 2694 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6