Question

Calculate Percentage Of Identity Between Protein Sequences Using Clustalw

0

Entering edit mode

12.6 years ago

Reyhaneh ▴ 530

Hi

I am using "Command line ClustalW" for MSA and I can see %score (Percentage of identity score) for the aligned sequences such as below on the screen:

Start of Pairwise alignments

Aligning...

Sequences (1:2) Aligned. Score: 17

Sequences (1:3) Aligned. Score: 21

Sequences (1:4) Aligned. Score: 16

Sequences (1:5) Aligned. Score: 27

Sequences (2:3) Aligned. Score: 29

Any idea what Command line should I use to print this output as a file?

Thanks in advance;

Reyhaneh

clustalw command-line sequence alignment • 14k views

ADD COMMENT • link 12.6 years ago by Reyhaneh ▴ 530

0

Entering edit mode

Did the answer below fit your question? No pm possibilities on BioStar :P

ADD REPLY • link 12.6 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

sorry for the late reply. I have found a nice way which I have written below.

ADD REPLY • link 12.5 years ago by Reyhaneh ▴ 530

score 3 · Answer 1 · 2011-09-08

3

Entering edit mode

12.6 years ago

ALchEmiXt ★ 1.9k

On unix or windows systems you could use the redirect options to capture STDOUT into a file like:

command inputfile > outputfile

But that is probably not your question. You want the matrix to be returned?

[edit since this was what you wanted]

I attach an example perl script that will parse an output file given your format into a matrix. You have to specify the number sequences you aligned pair-wise and the inut file. Output will go to STDOUT and can be grabbel by redirects if wanted. The script below will parse this example text file into the shown result;

INPUT file example:

Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score: 17
Sequences (1:3) Aligned. Score: 21
Sequences (1:4) Aligned. Score: 16
Sequences (1:5) Aligned. Score: 27
Sequences (2:3) Aligned. Score: 29
Sequences (2:4) Aligned. Score: 23
Sequences (2:5) Aligned. Score: 26
Sequences (3:4) Aligned. Score: 21
Sequences (3:5) Aligned. Score: 22
Sequences (4:5) Aligned. Score: 20
Some other lines here probably

Parsed output:

-   17  21  16  27  
-   -   29  23  26  
-   -   -   21  22  
-   -   -   -   20

Done!

The example code:

#!/usr/bin/perl -w

# Parses pairwise alignment data into a tab-delimited matrix
# USAGE: cmd <num_seq> <inputfile>
# output is to STDOUT
#
# alex

use strict;
use warnings;

#get args
my ($num_seq) = $ARGV[0]=~ m/^([A-Z0-9_.-]+)$/ig;
my ($input_file_name) = $ARGV[1]=~ m/^([A-Z0-9_.-]+)$/ig;

open (IN,$input_file_name) || die "Input file error: $!" ;

#grab values
my @values=();
while (<IN>) {
    if ($_=~ /Score: ([0-9]+)$/) {
        push(@values,$1);
    }
}

#print out matrix
my $count=0;
my $score;
for (my $x=1; $x<$num_seq; $x++){
    for (my $y=1; $y<=$num_seq; $y++){
        if ($y<=$x){
            $score="-";
        }
        else {
            $score=$values[$count++];
        }
        print $score."\t";
    }
    print "\n";
}

close (IN);

print "\nDone!\n";

ADD COMMENT • link 12.6 years ago by ALchEmiXt ★ 1.9k

2

Entering edit mode

So you should "valid" his/her answer.

ADD REPLY • link 12.6 years ago by Manu Prestat 4.1k

0

Entering edit mode

Thanks for this. Yes my main idea is to have the matrix in a file.

ADD REPLY • link 12.6 years ago by Reyhaneh ▴ 530

0

Entering edit mode

Did it work out for you :)

ADD REPLY • link 12.6 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

yes it did ... but since i am going to use this code in a loop and I am coding in C# I had to write the script... thanks again.

ADD REPLY • link 12.6 years ago by Reyhaneh ▴ 530

0

Entering edit mode

for c# the method stays the same. Just some other syntax but the regexp could do all the work. But why c#? Homework/practise or some other need? Perl if FAST for text.

ADD REPLY • link 12.6 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

Well I am used to C# and I am developing a big pipeline which a small part is what I have written here. It is more difficult for me to change my programming language for a small part of the project ;)

ADD REPLY • link 12.5 years ago by Reyhaneh ▴ 530

score 1 · Answer 2 · 2011-10-26

I have found an easier way to do it and it is to create a distance matrix which is basically represents the same information as similarity matrix and shows how much protein sequences differ. In order to achieve this you should do the following steps:

1) create the multiple sequence alignment for the input sequences(seq.txt) using the below command line. The multiple sequence alignment is saved in the seq.aln file.

clustalw2.exe -INFILE=seq.txt -OUTORDER=INPUT

2)Then you need to create a distance matrix using the alignment file:

clustalw2.exe -INFILE=seq.aln -OUTPUTTREE=dist -BOOTLABELS=branch -CLUSTERING=nj -TREE

3) the file with the name seq.dst is created which contains the distance matrix.

Hope it is useful. P.S. I am using Windows platform.