Question: Creating A Simple Matrix With Perl
0
gravatar for loicatraile
6.7 years ago by
loicatraile40
Germany
loicatraile40 wrote:

Hi! I´m trying to build a 4x4 nucleotide Count Matrix of a sequence alignment and im having some trouble to display it correctly as a Matrix. Here´s what i have so far:

use warnings;
use strict;

my $seq1='ACGATCTGCAGATGCGATGAGGTAGCTGATGCTGAGAGTACGCCGAGGATC';

my $seq2='ACGATCAGCAAAAGACGGCCTTCTCATACTACGTAGTTAGCGATAGCTAAG';

my $matrix={};

for (my $i=0; $i< length $seq1; $i++){

    $matrix->{ substr($seq1,$i,1)}->{substr($seq2,$i,1)}++;
}

foreach my $key1 (keys %$matrix){

    foreach my $key2(keys %{$matrix->{$key1}}){

        print $key2." ", $matrix->{$key1}->{$key2};
    }
    print "\n";
}

exit;

I would like to create something like this (just an example):

   A  C  G  T
A  3  5  5  6
C  3  1  2  5
G  2  3  5  7
T  1  2  4  3

Any Ideas? I would be very grateful!

ADD COMMENTlink modified 6.7 years ago by SES8.3k • written 6.7 years ago by loicatraile40
1

Could you please explain why you need this?

ADD REPLYlink written 6.7 years ago by Gjain5.4k

Are you trying to find number of substitutions from seq1 to seq2? And in the output A to A is 3. Does it mean that, the number of times A remains the same from seq1 to seq2 is 3? Though, that's not the case in the above example. A -> A is definitely more than 3.

ADD REPLYlink written 6.7 years ago by Jordan1.1k

hi... im trying to count the number of matches and mismataches and to display it in a matrix 4 x 4 matrix like in the example above.

ADD REPLYlink written 6.7 years ago by loicatraile40

Also can you please explain how you can get A-A as 3 instead of 6?

ADD REPLYlink written 6.7 years ago by Gjain5.4k
2
gravatar for SES
6.7 years ago by
SES8.3k
Vancouver, BC
SES8.3k wrote:

I think I can see what you are trying to accomplish but it is confusing because the numbers in your example output are not correct (consistent with the comments above). Though, please correct me if you were trying for something else. Be aware that this code works for your pairwise comparison of sequences of the same length with no gaps, but you will need to update this code if you starting making more complicated comparisons. Your code was actually very close.

The most helpful thing I can add is to take a look at the data structure (your %matrix) to see what you would expect. Here is some code to print the matrix:

#!/usr/bin/env perl                                                                                                                                                          
use strict;
use warnings;
use Data::Dump qw(dd);

my $seq1='ACGATCTGCAGATGCGATGAGGTAGCTGATGCTGAGAGTACGCCGAGGATC';

my $seq2='ACGATCAGCAAAAGACGGCCTTCTCATACTACGTAGTTAGCGATAGCTAAG';

my $matrix = {};

for (my $i = 0; $i < length $seq1; $i++){
    $matrix->{substr($seq1,$i,1)}->{substr($seq2,$i,1)}++;
}

dd $matrix;

Running that code shows your data:

$ perl biostars72505.pl
{
  A => { A => 6, C => 2, G => 3, T => 2 },
  C => { A => 3, C => 5, G => 1, T => 1 },
  G => { A => 4, C => 4, G => 5, T => 5 },
  T => { A => 4, C => 1, G => 2, T => 3 },
}

There are many ways to do this, but I like the flattened structure from Data::Dump and I think it is the nicest method for your wrists (less typing). If this were a really large file I were parsing, I would save this data structure and read it in later. That way you don't have to keep reading the file and constructing this data. The full code is different in that you have to control the order of keys by sorting, and change the placement of the print statements.

#!/usr/bin/env perl

use strict;
use warnings;

my $seq1='ACGATCTGCAGATGCGATGAGGTAGCTGATGCTGAGAGTACGCCGAGGATC';

my $seq2='ACGATCAGCAAAAGACGGCCTTCTCATACTACGTAGTTAGCGATAGCTAAG';

my $matrix = {};

for (my $i=0; $i < length $seq1; $i++){
    $matrix->{substr($seq1,$i,1)}->{substr($seq2,$i,1)}++;
}

print "  A C G T\n";
for my $k (sort keys %$matrix) {
    print $k;
    for my $k2 (sort keys %{$matrix->{$k}}) {
        print " ",$matrix->{$k}->{$k2};
    }
    print "\n";
}

This prints your matrix (notice it agrees with the data structure we printed above):

$ perl biostars72505.pl
  A C G T
A 6 2 3 2
C 3 5 1 1
G 4 4 5 5
T 4 1 2 3

This code will be portable and fast but my advice would be to look into BioPerl if you plan to build up your methods to larger comparisons. I couldn't find a BioPerl method to do this exact calculation, though I'm sure it's feasible with a little coding. However, I did find some related methods that may be helpful:

ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by SES8.3k

thank you very much! this was exactly what i was looking for :)

ADD REPLYlink written 6.7 years ago by loicatraile40
1

Glad I could help. Please consider up voting and/or marking the question answered so others know that it solved your problem.

ADD REPLYlink written 6.7 years ago by SES8.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1205 users visited in the last hour