Creating A Simple Matrix With Perl
1
0
Entering edit mode
10.9 years ago
loicatraile ▴ 50

Hi! I´m trying to build a 4x4 nucleotide Count Matrix of a sequence alignment and im having some trouble to display it correctly as a Matrix. Here´s what i have so far:

use warnings;
use strict;

my $seq1='ACGATCTGCAGATGCGATGAGGTAGCTGATGCTGAGAGTACGCCGAGGATC';

my $seq2='ACGATCAGCAAAAGACGGCCTTCTCATACTACGTAGTTAGCGATAGCTAAG';

my $matrix={};

for (my $i=0; $i< length $seq1; $i++){

    $matrix->{ substr($seq1,$i,1)}->{substr($seq2,$i,1)}++;
}

foreach my $key1 (keys %$matrix){

    foreach my $key2(keys %{$matrix->{$key1}}){

        print $key2." ", $matrix->{$key1}->{$key2};
    }
    print "\n";
}

exit;

I would like to create something like this (just an example):

   A  C  G  T
A  3  5  5  6
C  3  1  2  5
G  2  3  5  7
T  1  2  4  3

Any Ideas? I would be very grateful!

matrix sequence alignment reference perl • 8.0k views
ADD COMMENT
1
Entering edit mode

Could you please explain why you need this?

ADD REPLY
0
Entering edit mode

Are you trying to find number of substitutions from seq1 to seq2? And in the output A to A is 3. Does it mean that, the number of times A remains the same from seq1 to seq2 is 3? Though, that's not the case in the above example. A -> A is definitely more than 3.

ADD REPLY
0
Entering edit mode

hi... im trying to count the number of matches and mismataches and to display it in a matrix 4 x 4 matrix like in the example above.

ADD REPLY
0
Entering edit mode

Also can you please explain how you can get A-A as 3 instead of 6?

ADD REPLY
2
Entering edit mode
10.9 years ago
SES 8.6k

I think I can see what you are trying to accomplish but it is confusing because the numbers in your example output are not correct (consistent with the comments above). Though, please correct me if you were trying for something else. Be aware that this code works for your pairwise comparison of sequences of the same length with no gaps, but you will need to update this code if you starting making more complicated comparisons. Your code was actually very close.

The most helpful thing I can add is to take a look at the data structure (your %matrix) to see what you would expect. Here is some code to print the matrix:

#!/usr/bin/env perl                                                                                                                                                          
use strict;
use warnings;
use Data::Dump qw(dd);

my $seq1='ACGATCTGCAGATGCGATGAGGTAGCTGATGCTGAGAGTACGCCGAGGATC';

my $seq2='ACGATCAGCAAAAGACGGCCTTCTCATACTACGTAGTTAGCGATAGCTAAG';

my $matrix = {};

for (my $i = 0; $i < length $seq1; $i++){
    $matrix->{substr($seq1,$i,1)}->{substr($seq2,$i,1)}++;
}

dd $matrix;

Running that code shows your data:

$ perl biostars72505.pl
{
  A => { A => 6, C => 2, G => 3, T => 2 },
  C => { A => 3, C => 5, G => 1, T => 1 },
  G => { A => 4, C => 4, G => 5, T => 5 },
  T => { A => 4, C => 1, G => 2, T => 3 },
}

There are many ways to do this, but I like the flattened structure from Data::Dump and I think it is the nicest method for your wrists (less typing). If this were a really large file I were parsing, I would save this data structure and read it in later. That way you don't have to keep reading the file and constructing this data. The full code is different in that you have to control the order of keys by sorting, and change the placement of the print statements.

#!/usr/bin/env perl

use strict;
use warnings;

my $seq1='ACGATCTGCAGATGCGATGAGGTAGCTGATGCTGAGAGTACGCCGAGGATC';

my $seq2='ACGATCAGCAAAAGACGGCCTTCTCATACTACGTAGTTAGCGATAGCTAAG';

my $matrix = {};

for (my $i=0; $i < length $seq1; $i++){
    $matrix->{substr($seq1,$i,1)}->{substr($seq2,$i,1)}++;
}

print "  A C G T\n";
for my $k (sort keys %$matrix) {
    print $k;
    for my $k2 (sort keys %{$matrix->{$k}}) {
        print " ",$matrix->{$k}->{$k2};
    }
    print "\n";
}

This prints your matrix (notice it agrees with the data structure we printed above):

$ perl biostars72505.pl
  A C G T
A 6 2 3 2
C 3 5 1 1
G 4 4 5 5
T 4 1 2 3

This code will be portable and fast but my advice would be to look into BioPerl if you plan to build up your methods to larger comparisons. I couldn't find a BioPerl method to do this exact calculation, though I'm sure it's feasible with a little coding. However, I did find some related methods that may be helpful:

ADD COMMENT
0
Entering edit mode

thank you very much! this was exactly what i was looking for :)

ADD REPLY
1
Entering edit mode

Glad I could help. Please consider up voting and/or marking the question answered so others know that it solved your problem.

ADD REPLY

Login before adding your answer.

Traffic: 2526 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6