Question: How Can I Maintain Feature Counts When Removing Sequences Within Their Boundaries?
gravatar for Steve Moss
9.4 years ago by
Steve Moss2.3k
United Kingdom
Steve Moss2.3k wrote:

I'm currently using integer spans (Set::IntSpan::Fast) as a fast (or the fastest way I can think of) to add start and end co-ordinates for particular genomic co-ordinates to hashes for each gene ID for whole genomes using Perl. This then represents the individual ranges and thus sizes of multiple features (e.g. exons) within the genes.

I also do a similar thing for any repetitive sequence ranges (within the original features), but instead remove this from those original integer spans for each gene (hash), in order to get the unique feature lengths, minus the repetitive sequence (e.g. TEs).

However, this currently creates a bias towards more, smaller gene features, even though the overall feature sequence length is correct. This is because if a sequence is removed from the middle of one of the features, then it will return two smaller segments, even though, what I really want to do, is treat it as one segment still, but use it's smaller length - to build a new frequency distribution of unique sizes!


# add three features of 20, 30 & 40bp respectively
$hash{$gene_id}->add_ranges("1-20", "51-80", "111-150")

# remove 3 repetitive elements of 5, 10 and 20bp
$hash{$gene_id}->del_ranges("11-15", "61-70", "121-140")

This results in 6 ranges of 1-10, 16-20, 51-60, 71-80, 111-120, and 141-150. Where as the overall resulting sequence length is correct, the number of features can't increase, but must stay the same (or reduce if the feature is all repetitive). So I need a way to join the fragments back up, relative to the original ranges.

Can you think of the most efficient way I can do this? Perhaps check the integer span boundaries before I remove the sequence and adjust the removal to the boundary of the feature instead (I don't need to know about start or end coordinates after, just the new unique size and frequency of the original features minus the repeat sequence).

ADD COMMENTlink modified 9.4 years ago by brentp23k • written 9.4 years ago by Steve Moss2.3k
gravatar for brentp
9.4 years ago by
Salt Lake City, UT
brentp23k wrote:

I think if you're going gene-by-gene, you can just use perl arrays and slice out the repetitive sequence. Here's a vanilla perl function that does that given a gene length, an array-ref of exons and an array-ref of ranges to delete:

use strict;

sub del_ranges {
    my $gene_length = shift;
    my $ranges = shift;
    my $dels = shift;
    my @ranges = @$ranges;
    my @dels = @$dels;

    # create a gene of all zeros.
    my @gene = map { 0 } 0 .. $gene_length;
    # put 1's for the ranges.
    map { $gene[$_] = 1 } @ranges;

    # delete out the gaps.
    foreach my $del (@dels){
        my $start = $del->[0];
        my $len = scalar(@$del);
        splice(@gene, $start, $len);

    # starts are transitions from 0 to 1
    my @starts = grep { $gene[$_-1] == 0 && $gene[$_] == 1 } 1..$gene_length;
    # ends are transitions from 1 to 0
    my @ends   = grep { $gene[$_+1] == 0 && $gene[$_] == 1 } 1..$gene_length;

    # pair up as ([start, end], ...)
    my @new_gene = map { [$starts[$_], $ends[$_]] } 0..$#ends;
    return \@new_gene;

my $gene_length = 150;
my @ranges = (1..20, 51..80, 111..150);
my @dels = reverse ([11..15], [61..70], [121..140]);

my $del_gene = del_ranges($gene_length, \@ranges, \@dels);
map { print $_->[0] . "-" . $_->[1] . "\n" } @$del_gene;

This will print out:

ADD COMMENTlink written 9.4 years ago by brentp23k

Thanks brentp :-) I'll give this a whirl whe I get back in the lab on Monday!

ADD REPLYlink written 9.4 years ago by Steve Moss2.3k

It seems to be truncating the code for some reason, although it may be an issue with mobile Safari?

ADD REPLYlink written 9.4 years ago by Steve Moss2.3k

looks ok to me. last line should be: map { print $->[0] . "-" . $->[1] . "n" } @$del_gene;

ADD REPLYlink written 9.4 years ago by brentp23k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2067 users visited in the last hour