Question: Calculate Percentage Of Identity Between Protein Sequences Using Clustalw
0
gravatar for Reyhaneh
2.6 years ago by
Reyhaneh410
United Kingdom
Reyhaneh410 wrote:

Hi

I am using "Command line ClustalW" for MSA and I can see %score (Percentage of identity score) for the aligned sequences such as below on the screen:

Start of Pairwise alignments

Aligning...

Sequences (1:2) Aligned. Score: 17

Sequences (1:3) Aligned. Score: 21

Sequences (1:4) Aligned. Score: 16

Sequences (1:5) Aligned. Score: 27

Sequences (2:3) Aligned. Score: 29

Any idea what Command line should I use to print this output as a file?

Thanks in advance;

Reyhaneh

ADD COMMENTlink written 2.6 years ago by Reyhaneh410

Did the answer below fit your question? No pm possibilities on BioStar :P

ADD REPLYlink written 2.6 years ago by ALchEmiXt1.4k

sorry for the late reply. I have found a nice way which I have written below.

ADD REPLYlink written 2.5 years ago by Reyhaneh410
3
gravatar for ALchEmiXt
2.6 years ago by
ALchEmiXt1.4k
(close to) Lelystad, The Netherlands
ALchEmiXt1.4k wrote:

On unix or windows systems you could use the redirect options to capture STDOUT into a file like:

command inputfile > outputfile

But that is probably not your question. You want the matrix to be returned?

[edit since this was what you wanted]

I attach an example perl script that will parse an output file given your format into a matrix. You have to specify the number sequences you aligned pair-wise and the inut file. Output will go to STDOUT and can be grabbel by redirects if wanted. The script below will parse this example text file into the shown result;

INPUT file example:

Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score: 17
Sequences (1:3) Aligned. Score: 21
Sequences (1:4) Aligned. Score: 16
Sequences (1:5) Aligned. Score: 27
Sequences (2:3) Aligned. Score: 29
Sequences (2:4) Aligned. Score: 23
Sequences (2:5) Aligned. Score: 26
Sequences (3:4) Aligned. Score: 21
Sequences (3:5) Aligned. Score: 22
Sequences (4:5) Aligned. Score: 20
Some other lines here probably

Parsed output:

-   17  21  16  27  
-   -   29  23  26  
-   -   -   21  22  
-   -   -   -   20

Done!

The example code:

#!/usr/bin/perl -w

# Parses pairwise alignment data into a tab-delimited matrix
# USAGE: cmd <num_seq> <inputfile>
# output is to STDOUT
#
# alex

use strict;
use warnings;

#get args
my ($num_seq) = $ARGV[0]=~ m/^([A-Z0-9_.-]+)$/ig;
my ($input_file_name) = $ARGV[1]=~ m/^([A-Z0-9_.-]+)$/ig;

open (IN,$input_file_name) || die "Input file error: $!" ;

#grab values
my @values=();
while (<IN>) {
    if ($_=~ /Score: ([0-9]+)$/) {
        push(@values,$1);
    }
}

#print out matrix
my $count=0;
my $score;
for (my $x=1; $x<$num_seq; $x++){
    for (my $y=1; $y<=$num_seq; $y++){
        if ($y<=$x){
            $score="-";
        }
        else {
            $score=$values[$count++];
        }
        print $score."\t";
    }
    print "\n";
}

close (IN);

print "\nDone!\n";
ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by ALchEmiXt1.4k
2

So you should "valid" his/her answer.

ADD REPLYlink written 2.6 years ago by Manu Prestat2.8k

Thanks for this. Yes my main idea is to have the matrix in a file.

ADD REPLYlink written 2.6 years ago by Reyhaneh410

Did it work out for you :)

ADD REPLYlink written 2.6 years ago by ALchEmiXt1.4k

yes it did ... but since i am going to use this code in a loop and I am coding in C# I had to write the script... thanks again.

ADD REPLYlink written 2.6 years ago by Reyhaneh410

for c# the method stays the same. Just some other syntax but the regexp could do all the work. But why c#? Homework/practise or some other need? Perl if FAST for text.

ADD REPLYlink written 2.6 years ago by ALchEmiXt1.4k

Well I am used to C# and I am developing a big pipeline which a small part is what I have written here. It is more difficult for me to change my programming language for a small part of the project ;)

ADD REPLYlink written 2.5 years ago by Reyhaneh410
1
gravatar for Reyhaneh
2.5 years ago by
Reyhaneh410
United Kingdom
Reyhaneh410 wrote:

I have found an easier way to do it and it is to create a distance matrix which is basically represents the same information as similarity matrix and shows how much protein sequences differ. In order to achieve this you should do the following steps:

1) create the multiple sequence alignment for the input sequences(seq.txt) using the below command line. The multiple sequence alignment is saved in the seq.aln file.

clustalw2.exe -INFILE=seq.txt -OUTORDER=INPUT

2)Then you need to create a distance matrix using the alignment file:

clustalw2.exe -INFILE=seq.aln -OUTPUTTREE=dist -BOOTLABELS=branch -CLUSTERING=nj -TREE

3) the file with the name seq.dst is created which contains the distance matrix.

Hope it is useful. P.S. I am using Windows platform.

ADD COMMENTlink written 2.5 years ago by Reyhaneh410

This is the way I ended up using but ALchEmiXt method is also working.

ADD REPLYlink written 2.5 years ago by Reyhaneh410
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 652 users visited in the last hour