Per-Column Conservation Of Multiple Alignment In Perl
3
5
Entering edit mode
10.5 years ago
Chad A. Davis ▴ 150

Given a Bio::SimpleAlign, what is the best way to get per-column conservation scores. E.g. into an array of values in [0:1] where the array length would be the same as $align->length. I don't find anything like this in Bio::SimpleAlign. I'm looking for a function that allows:

my $io = Bio::AlignIO->new(-file=>$file);
my $align = $io->next_aln;
my @cons = $align->percentage_identity_by_column(); # <- does this exist?
print "@cons";
# 0.75 1.0 1.0 1.0 0.64 ....

Or should I just concat the gapped sequence, use substr() to extract the characters and count them with a hash and return the frequency of the most frequent character per column?

It looks like the private method Bio::SimpleAlign::consensusaa() already does most of this, but it returns the character rather than the fraction, which is what I was looking for. Short of submitting a patch for that, is there a better approach?

perl bioperl multiple consensus conservation • 4.0k views
ADD COMMENT
5
Entering edit mode
10.5 years ago
Neilfws 49k

I think you are correct, in that there is no existing method to do this in Bio::SimpleAlign and in thinking about counting residues using a hash.

I would use the slice() method to return each column and extract the residues from it. Here's something I tried which works up to a point, but needs some more work:

use List::Util 'max';
my @scores;
# move along alignment 1 column at a time
for (my $i = 1; $i <= $align->length; $i++) {
    my %count;
    # extract column as a set of Bio::LocatableSeq objects
    my $seqs = $align->slice($i,$i);
    # get each residue and count in a hash
    foreach my $seq ($seqs->each_seq) {
    $count{$seq->seq} += 1;
    }
    push(@scores, max(values %count) / $align->num_sequences);
}

print join(",", @scores);

Note that this will print warnings where a column contains gap characters and exclude the row with the gap from the new alignment object. This means frequencies would be incorrectly calculated if you divided by $seqs->num_sequences: e.g. column "A, -, A" will result in frequency A = 1, not A = 0.67. So we divide by the number of sequences in the original alignment. Presumably "all gaps" columns will result in an empty hash, which you may have to handle so as to return frequency = 0.

ADD COMMENT
2
Entering edit mode
10.5 years ago
Chad A. Davis ▴ 150

This was my final solution, which also marks gap columns as 0% conserved. This is mostly already in Bio::SimpleAlign, so I'll submit this as a patch as well and see what others think.

# Conservation of each non-gap column [0-100]
sub conservation {
    my ($self) = @_;
    my @cons;
    my $num_sequences = $self->num_sequences;
    foreach my $point (0..$align->length-1) { 
        my %hash;
        foreach my $seq ( $self->each_seq() ) {
            my $letter = substr($seq->seq,$point,1);
            ($letter eq $self->gap_char || $letter =~ /\./) && next;
            $hash{$letter}++;
        }
        # max frequency of a non-gap letter
        my $max = (sort {$b<=>$a} values %hash )[0];
        push @cons, 100 * $max / $num_sequences;
    }
    return @cons;
}
ADD COMMENT
0
Entering edit mode

Just a small problem with $align variable which is not declared anywhere. Replaced with $self did the job.

ADD REPLY
0
Entering edit mode
10.5 years ago
Asaf 8.6k

In my opinion computing the entropy of each column will give a better measure. If a column has 50% A and 50% T and another column has 50% and 50% C G and T evenly distributed, their conservation is not the same and entropy will catch these differences. The above holds if the sequences have the same distances from each other, otherwise your computations will be biased towards the more closely related organisms.

   sub entropy{
      my $vec = shift;
      #count the occuransies of the four nucleotides and U
      my %counts;
      foreach my $n (@{$vec}){
        $counts{$n}++;
      }
      #deduce the gaps;
      delete $counts{'-'};
      #sum up the counts;
      my $sum = 0;
      foreach my $key (keys %counts){
        $sum+=$counts{$key};
      }
      #computes the entropy
      my $ent = 0;
      foreach my $key (keys %counts){
        my $prob = $counts{$key}/$sum;
        $ent-=$prob*log2($prob);
      }
      return $ent;
    }
ADD COMMENT

Login before adding your answer.

Traffic: 2938 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6