Question

Where To Find Ensembl Tss Annotations?

9

Entering edit mode

11.0 years ago

Alex Reynolds 35k

I'd like to find transcription start sites for Ensembl genes (mouse, build 67, NCBIM37). Are these annotations available from Ensembl's site, and if so, what file(s) should I be looking at?

Via FTP

Here's the script I wrote to pull mouse exons from Ensembl, release 67 via FTP and filter them for TSS positions. I treat the start position of the first forward strand exon and the stop position of the first reverse strand exon as the position of the TSS:

#!/usr/bin/env perl

use strict;
use warnings;
use File::Path;

sub trimWhitespace($)
{
    my $string = shift;
    $string =~ s/^\s+//;
    $string =~ s/\s+$//;
    return $string;
}

sub trimQuotes($)
{
    my $string = shift;
    $string =~ s/^"//;
    $string =~ s/"$//;
    return $string;
}

my $ensemblURL = "ftp://ftp.ensembl.org//pub/release-67/gtf/mus_musculus/Mus_musculus.NCBIM37.67.gtf.gz";
my $dataDir = "../data"; if (! -d $dataDir} { mkpath $dataDir; }
my $ensemblGzFn = "$dataDir/Mus_musculus.NCBIM37.67.gtf.gz";
my $ensemblBedFn = "$dataDir/Mus_musculus.NCBIM37.67.gtf.bed";

if (! -e $ensemblGzFn) {
    print STDERR "retrieving mouse records...\n";
    system ("wget --output-document=$ensemblGzFn $ensemblURL") == 0 or die "could not retrieve Ensembl records from URL: $ensemblURL\n";
}

if (! -e $ensemblBedFn) {
    print STDERR "converting mouse records from GTF to sorted BED...\n";
    system ("gunzip --stdout $ensemblGzFn | gtf2bed - | sort-bed - > $ensemblBedFn") == 0 or die "could not convert Ensembl record to sorted BED\n";

    open my $allRecFh, "<", $ensemblBedFn or die "could not open all records\n";
    while (<$allRecFh>) {
        chomp;
        my ($chr, $start, $stop, $id, $score, $strand, $source, $feature, $frame, $attributesStr) = split("\t", $_);

        if ($feature eq "exon") {
            my ($tssChr, $tssStart, $tssStop, $tssId, $tssScore, $tssStrand);
            my @attributes = map (trimWhitespace($_), split(";", $attributesStr));
            my $attributesRef;
            foreach my $attributePair (@attributes) {
                my @pairItems = split(" ", $attributePair);
                my $key = $pairItems[0];
                my $value = trimQuotes($pairItems[1]);
                $attributesRef->{$key} = $value;
            }
            # find TSS start position
            if (exists $attributesRef->{"gene_name"}) {
                $tssChr = "chr$chr";
                if ($strand eq "+") {
                    $tssStart = $start;
                    $tssStop = $tssStart + 1;
                }
                elsif ($strand eq "-") {
                    $tssStart = $stop;
                    $tssStop = $tssStart + 1;
                }
                $tssId = $attributesRef->{"gene_name"};
                $tssScore = $attributesRef->{"exon_number"};
                $tssStrand = $strand;

                my $tssGeneId = $attributesRef->{"gene_id"};

                if ($tssScore == 1) {
                    print STDOUT join("\t", ($tssChr, 
                                             $tssStart, 
                                             $tssStop,
                                             $tssId,
                                             $tssScore,
                                             $tssStrand,
                                             $tssGeneId
                                             ))."\n";
                }
            }
        }
    }
    close $allRecFh;
}

In running this FTP-sourced data search, I get 97639 records (which matches Malachi's answer via BioMart):

$ ./get_mm9_ensembl_tss_records.pl > ../data/mm9.ensembl67.tss.bed
$ wc -l ../data/mm9.ensembl67.tss.bed
97639

Records seem to match up:

$ grep Gm16627 ../data/mm9.ensembl67.tss.bed
chr17   71274336        71274337        Gm16627 1       -       ENSMUSG00000085299

$ grep Vmn2r-ps113 ../data/mm9.ensembl67.tss.bed
chr17   18186447        18186448        Vmn2r-ps113     1       +       ENSMUSG00000067978
chr17   18201000        18201001        Vmn2r-ps113     1       +       ENSMUSG00000067978

I'll mark both answers as correct, since Malachi gives me a "sanity check" result and Emily provided the FTP URL to the Ensembl mouse data.

Via Perl API

From the Ensembl-dev mailing list, I was pointed to the following version of the ensembl portion of the Ensembl Perl API setup process:

http://www.ensembl.org/cvsdownloads/ensembl-67.tar.gz

The remaining components were installed per Emily's instructions in her answer below.

Additionally, the useastdb.ensembl.org host only supports Releases 71 and 70. It is necessary to use ensembldb.ensembl.org to access older releases or host the databases locally.

After making these changes to my setup, here is the script I used to pull all exons for all genes from mouse release 67, printing them in a BED6-like format:

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper;
use Bio::EnsEMBL::DBSQL::DBAdaptor;

my $host   = 'ensembldb.ensembl.org';
my $user   = 'anonymous';
my $dbname = 'mus_musculus_core_67_37';
my $port   = 5306;

my $db = new Bio::EnsEMBL::DBSQL::DBAdaptor(-host => $host,
                                            -user => $user,
                                            -dbname => $dbname,
                                            -port => $port);

my $slice_adaptor = $db->get_SliceAdaptor();

my $slices = $slice_adaptor->fetch_all('chromosome');
foreach my $slice (@{$slices}) {
    my $chr = "chr".$slice->seq_region_name();
    my $genes = $slice->get_all_Genes();
    foreach my $gene (@{$genes}) {
        my $exons = $gene->get_all_Exons();
        my $id = $gene->external_name();
        my $exon_index = 1;
        my $exon_number = $exon_index;
        my $exon_count = scalar(@{$exons});        
        foreach my $exon (@{$exons}) {
            my $start = $exon->start();
            my $end = $exon->end();
            my $stable_id = $exon->stable_id();
            my $strand = $exon->strand();
            if ($strand == 1) { 
                $strand = "+";
                $exon_number = $exon_index;
            } 
            elsif ($strand == -1) { 
                $strand = "-";
                $exon_number = $exon_count - $exon_index + 1;
            } 
            else { 
                die "unknown value for strand\n"; 
            }
            print STDOUT join("\t", ($chr, $start, $end, $id, $exon_number, $strand))."\n";
            $exon_index++;
        }
    }
}

This does not fully recapitulate the GTF output, and access to this remote database is slow, but it should be enough to filter the output from this script on the score column, retrieving only 1-scored elements and looking at the strand in order to get the position of the TSS, using awk or something else.

I'll need to verify that this can output the correct number of single-exons-via-genes, and locally hosting databases is probably a good idea in order to get a quick answer, but this is probably a pretty decent start.

ensembl • 13k views

ADD COMMENT • link 11.0 years ago by Alex Reynolds 35k

score 5 · Answer 1 · 2013-04-12

5

Entering edit mode

11.0 years ago

Emily 23k

Malachi's right that you can get this kind of data from BioMart, but not at this scale. If you're just looking at a subset of genes, then use the filters in BioMart and Malachi's method will work. However if you try to get data on all transcripts, BioMart will just stop partway through your query, without giving you any warning and you'll just get a partial data file. Due to issues with the BioMart code, we are unable to put in some kind of warning, or make it able to handle all transcripts in a genome.

To get genome-wide data, you can use the Ensembl API. You'll need to use the archive version of the API to access version 67 data. You can download the old API from here:

http://may2012.archive.ensembl.org/info/docs/api/api_installation.html

Note that there is a mistake on this page (fixed in later releases) and you should use bioperl-1.2.3 not bioperl-live.

There's a tutorial on using the API here:

http://may2012.archive.ensembl.org/info/docs/api/core/core_tutorial.html

The documentation is here:

http://may2012.archive.ensembl.org/info/docs/Doxygen/core-api/index.html

The data is also featured in the GTF files for release 67. You may need to use some kind of parsing script to get out the data you need:

ftp://ftp.ensembl.org/pub/release-67/gtf/mus_musculus/

ADD COMMENT • link 11.0 years ago by Emily 23k

0

Entering edit mode

Thanks for the warning. I have observed this in the past as well. Although in this case, I obtained results for 97,639 transcripts. This is the reported number for Ensembl 67 mouse and as far as I could tell there were no missing values in my output. Anyway I agree that it is straight forward to obtain this kind of info via the API, especially if you already have a local MySQL instance of the databases you need to query. On the other hand, if BioMart could be improved to handle larger queries, I'm sure it would be greatly appreciated by the community. I recognize that this can be a challenge though.

ADD REPLY • link 11.0 years ago by Malachi Griffith 19k

0

Entering edit mode

Yes, it's quite unpredictable. Sometimes it does it, sometimes it doesn't. If there were a hard and fast rule as to how and when it gives up then at least we could give sensible warnings

We would love to be able to increase its capacity. Either that or replace it with something that works in a similar way, but with higher capacity.

ADD REPLY • link 11.0 years ago by Emily 23k

0

Entering edit mode

Thanks for your help and the links to the tutorial.

To test the installation, I downloaded/unpacked the BioPerl and Ensembl modules, changed my PERL5LIB environment variable and tried running the following test script:

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
                                 -host => 'useastdb.ensembl.org',
                                 -user => 'anonymous'
                                 );

# should return Dumped contents of reference to list
print Dumper $registry->get_all_DBAdaptors(); 

my @db_adaptors = @{ $registry->get_all_DBAdaptors() };

foreach my $db_adaptor (@db_adaptors) {
    my $db_connection = $db_adaptor->dbc();

    printf(
           "species/group\t%s/%s\ndatabase\t%s\nhost:port\t%s:%s\n\n",
           $db_adaptor->species(),   
           $db_adaptor->group(),
           $db_connection->dbname(), 
           $db_connection->host(),
           $db_connection->port()
           );
}

However, this does not return any data. From the print Dumper $registry->get_all_DBAdaptors() line, I get the following output:

$VAR1 = [];

As this list is empty, the foreach block does not print anything.

Is there something I am missing about the setup? I re-read the instructions and I seem to have covered all the listed steps, with the exception you noted regarding the change from bioperl-live to bioperl-1.2.3.

ADD REPLY • link 11.0 years ago by Alex Reynolds 35k

0

Entering edit mode

Hi Alex. I see Andy from our core team has replied to your query on the Ensembl dev mailing list. Has he solved your problem?

For completeness of answer, would Biostar members like to see what he said written here?

ADD REPLY • link 11.0 years ago by Emily 23k

0

Entering edit mode

I will try out his suggestions and try to follow up with a write-up here. It may be useful for others who wish to learn the Perl API.

ADD REPLY • link 11.0 years ago by Alex Reynolds 35k

score 2 · Answer 2 · 2013-04-11

You can get these from Ensembl BioMart:

Go to Ensembl version 67 in the archive: http://may2012.archive.ensembl.org/index.html
Follow the BioMart link
Select database: Ensembl Genes 67
Select dataset: Mus Musculus Genes
Go to the Attributes tab and select the following: Ensembl Gene ID, Ensembl Transcript ID, Transcript Start (bp), Transcript End (bp), Gene Biotype, Associated Gene Name, Chromosome Name
Export all results to -> File -> TSV -> Unique results only -> GO

This yields positions for 97,639 transcripts. Results should look something like this:

Ensembl Gene ID    Ensembl Transcript ID    Transcript Start (bp)    Transcript End (bp)    Gene Biotype    Associated Gene Name    Chromosome Name
ENSMUSG00000085299    ENSMUST00000142133    71223692    71274336    lincRNA    Gm16627    17
ENSMUSG00000067978    ENSMUST00000176802    18186448    18211184    protein_coding    Vmn2r-ps113    17
ENSMUSG00000067978    ENSMUST00000097386    18201001    18211184    protein_coding    Vmn2r-ps113    17