Question: Per-Column Conservation Of Multiple Alignment In Perl
5
gravatar for Chad A. Davis
8.8 years ago by
Chad A. Davis150
Chad A. Davis150 wrote:

Given a Bio::SimpleAlign, what is the best way to get per-column conservation scores. E.g. into an array of values in [0:1] where the array length would be the same as $align->length. I don't find anything like this in Bio::SimpleAlign. I'm looking for a function that allows:

my $io = Bio::AlignIO->new(-file=>$file);
my $align = $io->next_aln;
my @cons = $align->percentage_identity_by_column(); # <- does this exist?
print "@cons";
# 0.75 1.0 1.0 1.0 0.64 ....

Or should I just concat the gapped sequence, use substr() to extract the characters and count them with a hash and return the frequency of the most frequent character per column?

It looks like the private method Bio::SimpleAlign::consensusaa() already does most of this, but it returns the character rather than the fraction, which is what I was looking for. Short of submitting a patch for that, is there a better approach?

ADD COMMENTlink written 8.8 years ago by Chad A. Davis150
3
gravatar for Neilfws
8.8 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

I think you are correct, in that there is no existing method to do this in Bio::SimpleAlign and in thinking about counting residues using a hash.

I would use the slice() method to return each column and extract the residues from it. Here's something I tried which works up to a point, but needs some more work:

use List::Util 'max';
my @scores;
# move along alignment 1 column at a time
for (my $i = 1; $i <= $align->length; $i++) {
    my %count;
    # extract column as a set of Bio::LocatableSeq objects
    my $seqs = $align->slice($i,$i);
    # get each residue and count in a hash
    foreach my $seq ($seqs->each_seq) {
    $count{$seq->seq} += 1;
    }
    push(@scores, max(values %count) / $align->num_sequences);
}

print join(",", @scores);

Note that this will print warnings where a column contains gap characters and exclude the row with the gap from the new alignment object. This means frequencies would be incorrectly calculated if you divided by $seqs->num_sequences: e.g. column "A, -, A" will result in frequency A = 1, not A = 0.67. So we divide by the number of sequences in the original alignment. Presumably "all gaps" columns will result in an empty hash, which you may have to handle so as to return frequency = 0.

ADD COMMENTlink modified 8.8 years ago • written 8.8 years ago by Neilfws48k
2
gravatar for Chad A. Davis
8.8 years ago by
Chad A. Davis150
Chad A. Davis150 wrote:

This was my final solution, which also marks gap columns as 0% conserved. This is mostly already in Bio::SimpleAlign, so I'll submit this as a patch as well and see what others think.

# Conservation of each non-gap column [0-100]
sub conservation {
    my ($self) = @_;
    my @cons;
    my $num_sequences = $self->num_sequences;
    foreach my $point (0..$align->length-1) { 
        my %hash;
        foreach my $seq ( $self->each_seq() ) {
            my $letter = substr($seq->seq,$point,1);
            ($letter eq $self->gap_char || $letter =~ /\./) && next;
            $hash{$letter}++;
        }
        # max frequency of a non-gap letter
        my $max = (sort {$b<=>$a} values %hash )[0];
        push @cons, 100 * $max / $num_sequences;
    }
    return @cons;
}
ADD COMMENTlink written 8.8 years ago by Chad A. Davis150

Just a small problem with $align variable which is not declared anywhere. Replaced with $self did the job.

ADD REPLYlink written 4.8 years ago by julien.roux100
0
gravatar for Asaf
8.8 years ago by
Asaf7.2k
Israel
Asaf7.2k wrote:

In my opinion computing the entropy of each column will give a better measure. If a column has 50% A and 50% T and another column has 50% and 50% C G and T evenly distributed, their conservation is not the same and entropy will catch these differences. The above holds if the sequences have the same distances from each other, otherwise your computations will be biased towards the more closely related organisms.

   sub entropy{
      my $vec = shift;
      #count the occuransies of the four nucleotides and U
      my %counts;
      foreach my $n (@{$vec}){
        $counts{$n}++;
      }
      #deduce the gaps;
      delete $counts{'-'};
      #sum up the counts;
      my $sum = 0;
      foreach my $key (keys %counts){
        $sum+=$counts{$key};
      }
      #computes the entropy
      my $ent = 0;
      foreach my $key (keys %counts){
        my $prob = $counts{$key}/$sum;
        $ent-=$prob*log2($prob);
      }
      return $ent;
    }
ADD COMMENTlink written 8.8 years ago by Asaf7.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2102 users visited in the last hour