Question

Per-Column Conservation Of Multiple Alignment In Perl

5

Entering edit mode

12.9 years ago

Chad A. Davis ▴ 150

Given a Bio::SimpleAlign, what is the best way to get per-column conservation scores. E.g. into an array of values in [0:1] where the array length would be the same as $align->length. I don't find anything like this in Bio::SimpleAlign. I'm looking for a function that allows:

my $io = Bio::AlignIO->new(-file=>$file);
my $align = $io->next_aln;
my @cons = $align->percentage_identity_by_column(); # <- does this exist?
print "@cons";
# 0.75 1.0 1.0 1.0 0.64 ....

Or should I just concat the gapped sequence, use substr() to extract the characters and count them with a hash and return the frequency of the most frequent character per column?

It looks like the private method Bio::SimpleAlign::consensusaa() already does most of this, but it returns the character rather than the fraction, which is what I was looking for. Short of submitting a patch for that, is there a better approach?

perl bioperl multiple consensus conservation • 4.8k views

ADD COMMENT • link updated 12.9 years ago by Asaf 10k • written 12.9 years ago by Chad A. Davis ▴ 150

score 5 · Answer 1 · 2011-06-15

I think you are correct, in that there is no existing method to do this in Bio::SimpleAlign and in thinking about counting residues using a hash.

I would use the slice() method to return each column and extract the residues from it. Here's something I tried which works up to a point, but needs some more work:

use List::Util 'max';
my @scores;
# move along alignment 1 column at a time
for (my $i = 1; $i <= $align->length; $i++) {
    my %count;
    # extract column as a set of Bio::LocatableSeq objects
    my $seqs = $align->slice($i,$i);
    # get each residue and count in a hash
    foreach my $seq ($seqs->each_seq) {
    $count{$seq->seq} += 1;
    }
    push(@scores, max(values %count) / $align->num_sequences);
}

print join(",", @scores);

Note that this will print warnings where a column contains gap characters and exclude the row with the gap from the new alignment object. This means frequencies would be incorrectly calculated if you divided by $seqs->num_sequences: e.g. column "A, -, A" will result in frequency A = 1, not A = 0.67. So we divide by the number of sequences in the original alignment. Presumably "all gaps" columns will result in an empty hash, which you may have to handle so as to return frequency = 0.

score 2 · Answer 2 · 2011-06-17

This was my final solution, which also marks gap columns as 0% conserved. This is mostly already in Bio::SimpleAlign, so I'll submit this as a patch as well and see what others think.

# Conservation of each non-gap column [0-100]
sub conservation {
    my ($self) = @_;
    my @cons;
    my $num_sequences = $self->num_sequences;
    foreach my $point (0..$align->length-1) { 
        my %hash;
        foreach my $seq ( $self->each_seq() ) {
            my $letter = substr($seq->seq,$point,1);
            ($letter eq $self->gap_char || $letter =~ /\./) && next;
            $hash{$letter}++;
        }
        # max frequency of a non-gap letter
        my $max = (sort {$b<=>$a} values %hash )[0];
        push @cons, 100 * $max / $num_sequences;
    }
    return @cons;
}

score 0 · Answer 3 · 2011-06-17

In my opinion computing the entropy of each column will give a better measure. If a column has 50% A and 50% T and another column has 50% and 50% C G and T evenly distributed, their conservation is not the same and entropy will catch these differences. The above holds if the sequences have the same distances from each other, otherwise your computations will be biased towards the more closely related organisms.

   sub entropy{
      my $vec = shift;
      #count the occuransies of the four nucleotides and U
      my %counts;
      foreach my $n (@{$vec}){
        $counts{$n}++;
      }
      #deduce the gaps;
      delete $counts{'-'};
      #sum up the counts;
      my $sum = 0;
      foreach my $key (keys %counts){
        $sum+=$counts{$key};
      }
      #computes the entropy
      my $ent = 0;
      foreach my $key (keys %counts){
        my $prob = $counts{$key}/$sum;
        $ent-=$prob*log2($prob);
      }
      return $ent;
    }