Question: Downloading all introns from Ensembl for Grch37
0
gravatar for mmacd
4.1 years ago by
mmacd20
United States
mmacd20 wrote:

I was wondering if anyone could give me advice on acquiring all introns from human Grch37 from Ensembl? I can get the introns from human Grch38 through Perl API using the following code:

use strict;
use warnings;

use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::Utils::SeqDumper;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(-host => 'ensembldb.ensembl.org',
                                 -port => 5306,
                                 -user => 'anonymous',
                                 -passwd => undef,
                                 -db_version => '84');

my $gene_adapter = $registry->get_adaptor('human', 'Core', 'Gene');

my $dumper = Bio::EnsEMBL::Utils::SeqDumper->new();

while(my $gene_id = shift(@{$gene_adapter->list_stable_ids()})) {
    my $gene = $gene_adapter->fetch_by_stable_id($gene_id);
    my $canonical_transcript = $gene->canonical_transcript();
    while(my $intron = shift(@{$canonical_transcript->get_all_Introns()})) {
        $dumper->dump($intron->feature_Slice(), 'FASTA', 'introns.fasta');
    }
}

However, when I try changing the -db_version to a db version that uses Ghrc37, such as release 64, I get the following error:

For homo_sapiens_core_64_37 there is a difference in the software release (84) and the database release (64). You should update one of these to ensure that your script does not crash. DBD::mysql::st execute failed: Unknown column 'stable_id' in 'field list' at /home/.../src/ensembl/modules/Bio/EnsEMBL/DBSQL/BaseAdaptor.pm line 312.
    -------------------- EXCEPTION -------------------- 
MSG: Detected an error whilst executing SQL 'SELECT `stable_id` FROM `gene`': DBD::mysql::st execute failed: Unknown column 'stable_id' in 'field list' at /home/.../src/ensembl/modules/Bio/EnsEMBL/DBSQL/BaseAdaptor.pm line 312.
    STACK Bio::EnsEMBL::DBSQL::BaseAdaptor::_list_dbIDs /home/.../src/ensembl/modules/Bio/EnsEMBL/DBSQL/BaseAdaptor.pm:313 STACK Bio::EnsEMBL::DBSQL::GeneAdaptor::list_stable_ids /home/.../src/ensembl/modules/Bio/EnsEMBL/DBSQL/GeneAdaptor.pm:152 STACK toplevel get_introns.pl:19 Date (localtime)    = Thu Jul 14 16:04:48 2016 Ensembl API version = 84
    ---------------------------------------------------

Anyone have advice on this or can give an easier way to get the introns? Thank you!

ADD COMMENTlink modified 4.1 years ago by igor11k • written 4.1 years ago by mmacd20
2
gravatar for Denise CS
4.1 years ago by
Denise CS5.1k
UK, Hinxton, EMBL-EBI
Denise CS5.1k wrote:

Get the API for the previous assembly from the Ensembl GRCh37 archive. You can obtain the API code as a gzipped tarball. You should use the homo_sapiens_core_84_37. Also, use port 3337, which will connect to the GRCh37 db on the latest schema.

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Denise CS5.1k

Denise is correct about using port 3337, however she is wrong about the API itself. You can use the same API to access GRCh37 as GRCh38.

ADD REPLYlink written 4.1 years ago by Emily_Ensembl21k

So I am using the same code as I posted above but with port 3337 and homo_sapiens_core_84_37 as the db. I am using the newest API. However, now im getting the following error:

 -------------------- WARNING ----------------------
MSG: Human is not a valid species name (check DB and API version)
FILE: Bio/EnsEMBL/Registry.pm LINE: 1327
CALLED BY: Bio/EnsEMBL/Registry.pm  LINE: 1102
Date (localtime)    = Fri Jul 15 14:24:00 2016
Ensembl API version = 84
---------------------------------------------------

-------------------- EXCEPTION --------------------
MSG: Can not find internal name for species 'Human'
STACK Bio::EnsEMBL::Registry::get_adaptor /home/.../src/ensembl/modules/Bio/EnsEMBL/Registry.pm:1104
STACK toplevel get_introns.pl:15
Date (localtime)    = Fri Jul 15 14:24:00 2016
Ensembl API version = 84
---------------------------------------------------

I also tried replacing human with homo sapiens with no luck. Any thoughts?

ADD REPLYlink modified 4.1 years ago by genomax87k • written 4.1 years ago by mmacd20

Oh ignore the post before. If I use port 3337 with no db version specified, I get introns from GRCh37. Thank you both so much for your help!

I am having an additional issue now. My code seems to hang on the first canonical transcript's intron. It prints it multiple times and then hangs.

Here the updated code:

use strict;
use warnings;

use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::Utils::SeqDumper;

sub feature2string
{
    my $feature = shift;
    my $gene_id = shift;

    my $seq_region = $feature->slice->seq_region_name();
    my $start      = $feature->start();
    my $end        = $feature->end();
    my $strand     = $feature->strand();

    return sprintf( "%s %s:%d-%d (%+d)",
        $gene_id, $seq_region, $start, $end, $strand );
}

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(-host => 'ensembldb.ensembl.org',
                                 -port => 3337,
                                 -user => 'anonymous',
                                 -passwd => undef);

my $gene_adapter = $registry->get_adaptor('human', 'Core','Gene');

my $dumper = Bio::EnsEMBL::Utils::SeqDumper->new();

while(my $gene_id = shift @{($gene_adapter->list_stable_ids())}) {

    my $gene = $gene_adapter->fetch_by_stable_id($gene_id);
    my $canonical_transcript = $gene->canonical_transcript();

    while(my $intron = shift @{($canonical_transcript->get_all_Introns())}) {
        $dumper->dump($intron->feature_Slice(), 'FASTA', 'introns.fasta');
        my $can_transcript_id = $canonical_transcript->stable_id();
        my $istring = feature2string($intron, $can_transcript_id);
        print "$istring\n";
    }
}

Any help is appreciated.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by mmacd20

Changing the while loops to foreach seemed to do the trick.

ADD REPLYlink written 4.1 years ago by mmacd20
0
gravatar for EagleEye
4.1 years ago by
EagleEye6.6k
Sweden
EagleEye6.6k wrote:

Try UCSC table browser,

Bed File With Introns Only

How To Download All The Introns From Ucsc

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by EagleEye6.6k
1

I want the introns associated with ENSEMBL ids specifically, not UCSC ids (this is actually what we were using originally but UCSC ids are hard to convert to other ids or to give them functional meaning, so we are trying to move away from using them). Thanks though!

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by mmacd20

In table browser you have options for that too. Example follow the first link except in your case select track: 'Ensembl genes' and table: 'ensGene' and you will get result with Ensembl Ids.

Sample Output:

chr1 66999090 66999928 ENST00000237247_intron_0_0_chr1_66999091_f 0 +

chr1 67000051 67091529 ENST00000237247_intron_1_0_chr1_67000052_f 0 +

chr1 67091593 67098752 ENST00000237247_intron_2_0_chr1_67091594_f 0 +

chr1 67098777 67099762 ENST00000237247_intron_3_0_chr1_67098778_f 0 +

chr1 67099846 67105459 ENST00000237247_intron_4_0_chr1_67099847_f 0 +

chr1 67105516 67108492 ENST00000237247_intron_5_0_chr1_67105517_f 0 +

chr1 67108547 67109226 ENST00000237247_intron_6_0_chr1_67108548_f 0 +

Follow this link for complete output (GrCh37):

https://genome-euro.ucsc.edu/cgi-bin/hgTables?hgsid=216552565_gR3lVbaZ3inTsmltiXKxN2MZzFVa&boolshad.hgta_printCustomTrackHeaders=0&hgta_ctName=tb_ensGene&hgta_ctDesc=table+browser+query+on+ensGene&hgta_ctVis=pack&hgta_ctUrl=&fbUpBases=200&fbExonBases=0&fbQual=intron&fbIntronBases=0&fbDownBases=200&hgta_doGetBed=get+BED


For converting UCSC ids to Ensembl, table browser can return table like this,

name value

uc022bqo.2 ENST00000389680

uc004cor.1 ENST00000387342

uc004cos.5 ENST00000387347

uc031tga.1 ENST00000361624

uc011mfi.2 ENST00000361739

By following these three simple steps,

  1. Track: Ensembl genes

  2. Table: KnownToEnsembl

  3. Get Output

It is FUN playing with UCSC table browser. Try it :-)

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by EagleEye6.6k

Tried that too, since keeping UCSC and then converting would have been much easier, but only a very small portion (less than 10%) of my UCSC ids could be converted! The table browser is nice and easy to use, I just hope they get an updated KnownToEnsembl table soon where more ids can be converted.

ADD REPLYlink written 4.1 years ago by mmacd20

But how about the first procedure, did you manage to get introns with ensembl ids ?

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by EagleEye6.6k

Do you know if the coordinates for Hg19 and Grch37 the same? Since that gives me intron coordinates based on Hg19. If they are the same, then i think that would have worked.

ADD REPLYlink written 4.1 years ago by mmacd20

Yes, they are the same. GRCh37 = hg19

ADD REPLYlink written 4.1 years ago by Denise CS5.1k
0
gravatar for igor
4.1 years ago by
igor11k
United States
igor11k wrote:

There is a much simpler solution. If you have the Ensembl GTF, you can use grep to create 2 BED files from it - 1 for "gene" (includes everything) and 1 for "exon" (includes just coding). Then use something like bedtools to subtract exons from genes.

ADD COMMENTlink written 4.1 years ago by igor11k

Seems interesting !! Is this proceture possible to handle transcript level intron information (since you specify 'gene')? It will be really useful, if you don't mind please provide us with step by step procedure with some commands (example). So that it can be reference for future.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by EagleEye6.6k

A gene is a combination of all transcripts. Thus, if you subtract all exons, you get everything that's not an exon in any transcript. If an exon of one transcript overlaps an intron of another, than that gets subtracted. It really depends on an application to say if that is desired or not.

Anyway, my commands:

cat genes.gtf | cut -f 1-5 | grep -w "gene" | cut -f 1,4,5 > genes.bed
cat genes.gtf | cut -f 1-5 | grep -w "exon" | cut -f 1,4,5 > exons.bed
bedtools subtract -a genes.bed -b exons.bed

You can add the strand column if you need that.

ADD REPLYlink written 4.1 years ago by igor11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 694 users visited in the last hour