Question: Perl Script To Fetch Mammals 3'Utrs From Ensembl
2
gravatar for dannyjmh
7.3 years ago by
dannyjmh20
dannyjmh20 wrote:

Hi all. Does anyone know how to fetch all the 3'UTR sequences of the Ensembl EPO mammals using the Ensembl Perl API? Thanx in advance.

api ensembl perl • 3.8k views
ADD COMMENTlink modified 3.6 years ago by Emily_Ensembl18k • written 7.3 years ago by dannyjmh20
11
gravatar for Steve Moss
7.3 years ago by
Steve Moss2.3k
United Kingdom
Steve Moss2.3k wrote:

Here's something I coded that should do the job :)

#!/usr/bin/env perl
# Perl script to retrieve the 3'-UTR sequence for 12 eutherian mammals
# Coded by Steve Moss (gawbul [at] gmail [dot] com)
# http://stevemoss.ath.cx/

# make things easier
use strict;
use warnings;

#import EnsEMBL and BioPerl modules
use Bio::EnsEMBL::Registry;
use Bio::SeqIO;
use Data::Dumper;

# setup array of all epo mammals
my @epo_mammals = ("homo_sapiens", "gorilla_gorilla", "pan_troglodytes", "pongo_abelii", "macaca_mulatta", "callithrix_jacchus", "mus_musculus", "rattus_norvegicus", "equus_caballus", "canis_familiaris", "sus_scrofa", "bos_taurus");

# connect to EnsEMBL
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(    -host => 'ensembldb.ensembl.org',
                                    -user => 'anonymous');

# use Bio::SeqIO for write sequence to STDOUT
my $outseq = Bio::SeqIO->new(    -fh => \*STDOUT,
                            -format => 'FASTA');

# loop over the mammals
foreach my $mammal (@epo_mammals) {
    # get gene adaptor
    my $gene_adaptor = $registry->get_adaptor($mammal, 'Core', 'Gene');

    # fetch all genes
    my @gene_ids = @{$gene_adaptor->list_stable_ids()};

    # traverse through genes
    foreach my $gene_id (@gene_ids) {
        # get gene
        my $gene = $gene_adaptor->fetch_by_stable_id($gene_id);

        # get canonical transcript
        my $transcript = $gene->canonical_transcript();

        # check gene has a transcript associated
        unless (defined $transcript) {
            next;
        }

        # get 3'-UTR - returns a Bio::Seq object
        my $tputr = $transcript->three_prime_utr();

        # check transcript has a 3'-UTR annotated
        unless (defined $tputr) {
            next;
        }

        # print to STDOUT
        print $outseq->write_seq($tputr) . "\n";
    }
}

This currently prints the sequence in FASTA format to STDOUT, but you could replace -fh => \STDOUT,* with -file => ">filename", to write to a file instead.

Also, the default Bio::Seq object returned by the transcript object's three_prime_utr method, contains only the sequence and the associated transcript stable_id as the display_id. I would probably create a custom Seq object with gene stable_id, transcript stable_id, and the start and end coordinates of the 3'-UTR. As the three_prime_utr method only returns the Bio::Seq object, it is necessary to determine the start and end of the 3'-UTR based on the start of the transcript and the start of the coding region respectively (or the end of both, if it is on the alternate strand).

ADD COMMENTlink modified 7.3 years ago • written 7.3 years ago by Steve Moss2.3k

Hello Steve! Thanks so much for the help, the code works just fine. I also need a list of the transcripts for which there are no annotated UTRs. What would be the modification to the code? I'm not familiar with OO Perl at all. Again than you so much.

ADD REPLYlink written 7.3 years ago by dannyjmh20

In the part that says "unless (defined $tputr) {", you could add "print $transcript->stableid();" above the "next;" statement? You might want to add "my $fputr = $transcript->fiveprime_utr();" below the "my $tputr..." bit and check to see if both $tputr and $fputr are defined? It depends if you want to print the transcript ID if both, or one or the other of the UTRs aren't annotated. Bare in mind that it is likely to be annotation, rather than biologically real lack of UTRs.

ADD REPLYlink modified 7.3 years ago • written 7.3 years ago by Steve Moss2.3k

It would be useful to get a basic grasp of Perl though, if you want to get to grips with the API better. You can find information on the individual parts of the (Core) API here http://www.ensembl.org/info/docs/Doxygen/core-api/index.html.

ADD REPLYlink written 7.3 years ago by Steve Moss2.3k

Hey again, Steve! Thanks for all the help, my friend. Yeah, I'll start on object-oriented Perl. I can do some Perl but only for text manipulation, not CGI or DBI. Thanks again.

ADD REPLYlink written 7.3 years ago by dannyjmh20

No problem :) Any of the O'Reilly Perl books are great for getting to grips with things - http://shop.oreilly.com/category/browse-subjects/programming/perl.do and http://search.oreilly.com/?q=bioinformatics&x=0&y=0

ADD REPLYlink modified 7.3 years ago • written 7.3 years ago by Steve Moss2.3k

I'm gonna get some of them no matter what. Two more questions(the last ones i promise): will the script get all of the transcripts reported for each gene? And how to print the resulting fasta file like:

EnsemblTranscriptID|EnsemblGeneID Sequence ? Million thanks.

ADD REPLYlink written 7.3 years ago by dannyjmh20

At the moment this is only getting the canonical transcript that is annotated for each gene. In order to get all transcripts you would need to replace my $transcript = $gene->canonicaltranscript(); with my @transcripts = @{$gene->getall_Transcripts()}; and then iterate over this using foreach my $transcript (@transcripts) { ... } where ... Is the code for retrieving the UTRs.

ADD REPLYlink written 7.3 years ago by Steve Moss2.3k

I think you can just add $tputr->desc = $transcript->stableid . "|" . $gene->stableid; and it should add those to the fasta description line (not currently at the 'puter to check that) :-)

ADD REPLYlink modified 7.3 years ago • written 7.3 years ago by Steve Moss2.3k

Done, Steve. The latter gives the error "Can't modify non-lvalue subroutine call". But I modified the script to print in another file EnsemblTranscriptID|EnsemblGeneID lines (print AMPL ${geneid} . "|" . $transcript->stableid() . "\n";). And added it to each transcript with another simple script. Zillion thanks one more time. Next step for me: Obj-oriented Perl.

ADD REPLYlink written 7.3 years ago by dannyjmh20
4
gravatar for Giovanni M Dall'Olio
7.3 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

Did you look at the Ensembl API Tutorial? There is an example just about retrieving 3' UTRs: http://www.ensembl.org/info/docs/api/core/core_tutorial.html#genes

You can just copy & paste the example in the tutorial and change the name of the species. Note that the example fetches all the UTR in the genes in a chromosomal region: if you want to specify a single gene, you can just use a gene adaptor instead of a slice adaptor.

If you have any specific problem in using the APIs, you can get better answers by editing your question and specifying what is causing you trouble.

ADD COMMENTlink modified 7.3 years ago • written 7.3 years ago by Giovanni M Dall'Olio26k

Hey, Giovanni! Thanks for the tip.

ADD REPLYlink written 7.3 years ago by dannyjmh20
1
gravatar for Guangchuang Yu
7.3 years ago by
Guangchuang Yu2.2k
China/Guangzhou/Southern Medical University
Guangchuang Yu2.2k wrote:

I have a script in R to fetch 3UTR via biomaRt: http://ygc.name/2011/03/02/easiest-utr-sequence/

ADD COMMENTlink written 7.3 years ago by Guangchuang Yu2.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 539 users visited in the last hour