Question: Ensembl Api: How To Print All Transcript Isoforms As A Multifasta Alignment
5
gravatar for 2184687-1231-83-
9.0 years ago by
2184687-1231-83-5.0k wrote:

Hi,

I would like to print out the different transcript isoforms as a multifasta alignment using the Ensembl API by projecting the coding regions in genomic coordinates to the CDS sequences in a multi-alignment. For example, for gene with 5 isoforms, something that looks like this:

http://www.ebi.ac.uk/~avilella/isoforms.fasta

Any ideas?

Cheers

ADD COMMENTlink modified 9.0 years ago by Jeff Hussmann120 • written 9.0 years ago by 2184687-1231-83-5.0k
3
gravatar for Jeff Hussmann
9.0 years ago by
Jeff Hussmann120
United States
Jeff Hussmann120 wrote:

Here is a script to do this. I happened to have recently made the main tool required (code to find the "mask" in genomic coordinates of all coding sequences of all isoforms of a gene), and it was still a little messy to throw together. Attempt to read the code at your own risk.

Give the script an ENSG(d+) stable id at the command line and get ENSG${1}_aligned_isoforms.fasta out. This is essentially instant on a local installation of the Ensembl database but takes ~30 seconds if connecting to ensembldb.ensembl.org. If you need to do this for more than a few genes, you will want to install locally.

#!/usr/bin/perl

use Bio::EnsEMBL::Registry;
use warnings;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous',
    -port => 5306
);

my $gene_name = shift or die "Give an ENSG at the command line";

# Set up database adaptors
my $gene_adaptor = $registry->get_adaptor("human", "core", "gene");
my $transcript_adaptor = $registry->get_adaptor("human", "core", "transcript");

my $gene = $gene_adaptor->fetch_by_stable_id($gene_name);
open(OUTPUT, ">" . $gene->stable_id . "_aligned_isoforms.fasta");
my $strand = $gene->strand;

# Build the combined set of intervals in genomic coordinates that all isoforms' coding sequences cover 
my ($left_endpoints, $right_endpoints) = get_combined_coding_regions($gene);

# For convenience, compute the partial sums of the lengths of these intervals in the strand-appropriate direction
$partial_lengths = partial_lengths($left_endpoints, $right_endpoints, $strand);

my $transcripts = $gene->get_all_Transcripts;
for my $transcript (@$transcripts) {
    my $has_coding_region = 0;
    # Initialize this isoform's sequence to all gap
    my $sequence = '-' x $partial_lengths->[@$partial_lengths - 1];
    my $exons = $transcript->get_all_Exons;
    for my $exon (@$exons) {
    my $coding_region_start = $exon->coding_region_start($transcript);
    my $coding_region_end = $exon->coding_region_end($transcript);
    if ($coding_region_start) {
        $has_coding_region = 1;
        my $length = $coding_region_end - $coding_region_start + 1; 
        # Find the left edge of this exon's coding sequence in the alignment's coordinate system
        my $aligned_left = slice_coords_to_aligned_coords($coding_region_start, $coding_region_end, $left_endpoints, $right_endpoints, $partial_lengths, $strand);
        # Fill in the appropriate region of this isoform's aligned sequence with this exon's coding sequence 
        substr($sequence, $aligned_left, $length) = $exon->slice->subseq($coding_region_start, $coding_region_end, $strand);
    }
    }
    if ($has_coding_region) {
    print OUTPUT ">" . $gene->stable_id . "-" . $transcript->stable_id . "\n";
    print OUTPUT $sequence;
    print OUTPUT "\n";
    }
}

sub get_combined_coding_regions {
    my $gene = shift @_;
    my $transcripts = $gene->get_all_Transcripts;
    my @left_endpoints;
    my @right_endpoints;
    my $slice = undef;
    my $strand = undef;
    my $sequence = "";
    for my $transcript (@$transcripts) {
    my $exons = $transcript->get_all_Exons;
    for my $exon (@$exons) {
        $slice = $exon->slice unless ($slice);
        $strand = $exon->strand unless ($strand);
        my $crs = $exon->coding_region_start($transcript);
        my $cre = $exon->coding_region_end($transcript);
        if ($crs) {
        insert_endpoints($crs, $cre, \@left_endpoints, \@right_endpoints);
        }
    }
    }
    return (\@left_endpoints, \@right_endpoints);
}

sub insert_endpoints {
    my ($new_left, $new_right, $left_endpoints, $right_endpoints) = @_;
    my $number_existing = @$left_endpoints;  
    my ($new_left_region, $new_right_region);
    my ($new_left_in, $new_right_in) = (0, 0);
    if ($number_existing == 0) {
    push @$left_endpoints, $new_left;
    push @$right_endpoints, $new_right;
    return;
    }
    if ($new_right < $left_endpoints->[0]) {
    $new_left_region = $new_right_region = 0;
    } elsif ($new_left > $right_endpoints->[$number_existing - 1]) {
    $new_left_region = $new_right_region = $number_existing;
    } else {
    my $i = 0;
    while ($right_endpoints->[$i] < $new_left) {
        $i++;
    }
    $new_left_region = $i;
    if ($left_endpoints->[$i] <= $new_left) {
        $new_left_in = 1;
    } 
    $i = $number_existing - 1;
    while ($left_endpoints->[$i] > $new_right) {
        $i--;
    }
    $new_right_region = $i;
    if ($right_endpoints->[$i] >= $new_right) {
        $new_right_in = 1;
    } 
    }
    my $length = $new_right_region - $new_left_region + ($new_left_region != $new_right_region || $new_left_in || $new_right_in ? 1 : 0);
    splice @$left_endpoints, $new_left_region, $length, ($new_left_in ? $left_endpoints->[$new_left_region] : $new_left);
    splice @$right_endpoints, $new_left_region, $length, ($new_right_in ? $right_endpoints->[$new_right_region] : $new_right);
}

sub partial_lengths {
    my ($left_endpoints, $right_endpoints, $strand) = @_;
    my @partials = (0);
    my $length = 0;
    if ($strand == 1) {
    for (my $j = 0; $j <= @$left_endpoints - 1; $j++) {
        $length += $right_endpoints->[$j] - $left_endpoints->[$j] + 1;
        push @partials, $length;
    }
    } else {
    for (my $j = @$left_endpoints - 1; $j >= 0; $j--) {
        $length += $right_endpoints->[$j] - $left_endpoints->[$j] + 1;
        push @partials, $length;
    }
    }
    return \@partials;
}

sub slice_coords_to_aligned_coords {
    my ($left, $right, $left_endpoints, $right_endpoints, $partial_lengths, $strand) = @_;
    my $index_from_front = 0;
    my $index_from_back = @$left_endpoints - 1;
    if ($strand == 1) {
    while ($right_endpoints->[$index_from_front] < $right) {
        $index_from_front++;
    }
    return ($left - $left_endpoints->[$index_from_front]) + $partial_lengths->[$index_from_front];
    } else {
    while ($left_endpoints->[$index_from_back] > $left) {
        $index_from_front++;
        $index_from_back--;
    }
    return ($right_endpoints->[$index_from_back] - $right) + $partial_lengths->[$index_from_front];
    }
}
ADD COMMENTlink modified 9.0 years ago • written 9.0 years ago by Jeff Hussmann120

precisely what I needed. thx!

ADD REPLYlink written 9.0 years ago by 2184687-1231-83-5.0k
0
gravatar for Ketil
9.0 years ago by
Ketil4.0k
Germany
Ketil4.0k wrote:

Not quite what you ask for, but you could put the isoforms in a fasta file, and align them into an ACE file using e.g. CAP3 - probably specifying parameters that allow for relatively large indels and few substitutions. Converting the alignment to fasta should be easy, I wrote a tool for it (ace2fasta from http://hackage.haskell.org/package/clustertools), but you can probably find lots of others.

ADD COMMENTlink written 9.0 years ago by Ketil4.0k

Thanks, but no thanks, aligning is exactly what I want to avoid, I just want to project from the genome reference coordinates.

ADD REPLYlink written 9.0 years ago by 2184687-1231-83-5.0k

Okay. I suppose you might have to write a little script to do this - it shouldn't be hard, I think.

ADD REPLYlink written 9.0 years ago by Ketil4.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2022 users visited in the last hour