Given a Bio::SimpleAlign, what is the best way to get per-column conservation scores. E.g. into an array of values in [0:1] where the array length would be the same as $align->length. I don't find anything like this in Bio::SimpleAlign. I'm looking for a function that allows:
my $io = Bio::AlignIO->new(-file=>$file);
my $align = $io->next_aln;
my @cons = $align->percentage_identity_by_column(); # <- does this exist?
print "@cons";
# 0.75 1.0 1.0 1.0 0.64 ....
Or should I just concat the gapped sequence, use substr() to extract the characters and count them with a hash and return the frequency of the most frequent character per column?
It looks like the private method Bio::SimpleAlign::consensusaa() already does most of this, but it returns the character rather than the fraction, which is what I was looking for. Short of submitting a patch for that, is there a better approach?
I think you are correct, in that there is no existing method to do this in Bio::SimpleAlign and in thinking about counting residues using a hash.
I would use the slice() method to return each column and extract the residues from it. Here's something I tried which works up to a point, but needs some more work:
use List::Util 'max';
my @scores;
# move along alignment 1 column at a time
for (my $i = 1; $i <= $align->length; $i++) {
my %count;
# extract column as a set of Bio::LocatableSeq objects
my $seqs = $align->slice($i,$i);
# get each residue and count in a hash
foreach my $seq ($seqs->each_seq) {
$count{$seq->seq} += 1;
}
push(@scores, max(values %count) / $align->num_sequences);
}
print join(",", @scores);
Note that this will print warnings where a column contains gap characters and exclude the row with the gap from the new alignment object. This means frequencies would be incorrectly calculated if you divided by $seqs->num_sequences: e.g. column "A, -, A" will result in frequency A = 1, not A = 0.67. So we divide by the number of sequences in the original alignment. Presumably "all gaps" columns will result in an empty hash, which you may have to handle so as to return frequency = 0.
This was my final solution, which also marks gap columns as 0% conserved. This is mostly already in Bio::SimpleAlign, so I'll submit this as a patch as well and see what others think.
# Conservation of each non-gap column [0-100]
sub conservation {
my ($self) = @_;
my @cons;
my $num_sequences = $self->num_sequences;
foreach my $point (0..$align->length-1) {
my %hash;
foreach my $seq ( $self->each_seq() ) {
my $letter = substr($seq->seq,$point,1);
($letter eq $self->gap_char || $letter =~ /\./) && next;
$hash{$letter}++;
}
# max frequency of a non-gap letter
my $max = (sort {$b<=>$a} values %hash )[0];
push @cons, 100 * $max / $num_sequences;
}
return @cons;
}
In my opinion computing the entropy of each column will give a better measure. If a column has 50% A and 50% T and another column has 50% and 50% C G and T evenly distributed, their conservation is not the same and entropy will catch these differences.
The above holds if the sequences have the same distances from each other, otherwise your computations will be biased towards the more closely related organisms.
sub entropy{
my $vec = shift;
#count the occuransies of the four nucleotides and U
my %counts;
foreach my $n (@{$vec}){
$counts{$n}++;
}
#deduce the gaps;
delete $counts{'-'};
#sum up the counts;
my $sum = 0;
foreach my $key (keys %counts){
$sum+=$counts{$key};
}
#computes the entropy
my $ent = 0;
foreach my $key (keys %counts){
my $prob = $counts{$key}/$sum;
$ent-=$prob*log2($prob);
}
return $ent;
}
Just a small problem with $align variable which is not declared anywhere. Replaced with $self did the job.