Finding Difference Values Based on Clustal Omega Distance Matrices
1
0
Entering edit mode
7.3 years ago
Bara'a ▴ 260

Hi all :)

I have a question about distance matrices produced by Clustal Omega application .

It's well known to all that they represent the similarities between each pair of sequences in both distance and percentage representation as follows :

100.000000 21.944035 22.133939 23.723042 19.750284 20.431328 20.885358 21.679909
21.944035 100.000000 22.827688 21.796760 22.974963 20.324006 21.944035 24.889543
22.133939 22.827688 100.000000 21.152030 22.474032 17.387033 19.830028 20.963173
23.723042 21.796760 21.152030 100.000000 20.437018 24.361493 19.059107 19.436957
19.750284 22.974963 22.474032 20.437018 100.000000 21.414538 20.094259 21.765210
20.431328 20.324006 17.387033 24.361493 21.414538 100.000000 20.432220 20.432220
20.885358 21.944035 19.830028 19.059107 20.094259 20.432220 100.000000 19.018898
21.679909 24.889543 20.963173 19.436957 21.765210 20.432220 19.018898 100.000000


But what if I wanted to find the difference percentage between each pair of sequences, depending on those matrices?!

I'm working on a pipeline that needs to filter out similarity values >= 90.00 for left flanking region and difference values >= 50.00 for right flanking region , here's the code snippet I wrote to find that :

files=['Arr-Right(Aestivum_Japonica).dst','Arr-Left(Aestivum_Japonica).dst']
for I in range(len(files)):
name=files[i][files[i].find("-")+1:files[i].find(".")]
retrieved=open("Rtrv-"+name+".csv",'w',newline='')
retrieved.write(str('{0:^14}\t{1:^8}\t{2:^10}\n'.format(str("Similarity (%)"),str("Query ID"),str("Subject ID"))))
data=np.genfromtxt(files[i])
for row_idx, row in enumerate(data):
for col_idx, element in enumerate(row):
if row_idx >= col_idx :
continue
elif ("Left" in name and element>=90.000000):
retrieved.write(str('{0:10.6f}\t{1:0d}\t{2:0d}\n'.format(element,row_idx,col_idx)))
elif ("Right" in name and (100-element)>=50.000000) :
retrieved.write(str('{0:10.6f}\t{1:0d}\t{2:0d}\n'.format(element,row_idx,col_idx)))
retrieved.close()


My question is about the correctness of the equation I used : Is it simply (100-element)>=50.000000 or am I missing something ?!

Edited : to add the list of file names to the code snippet

clustal-omega distance-matrix python • 1.6k views
0
Entering edit mode

Would someone help me with this , please ?!

I really need to get the right answer , thank you all .

0
Entering edit mode

Looks good to me, though I don't understand the first 4 lines of code. Maybe explain the code a little bit?

0
Entering edit mode

@RamRS... The first 4 lines iterates over a list of matrices file names , process the file name to eliminate some prefix I added earlier to distinguish them from other files , add a new prefix to the retrieved result's file name , open it for writing and add some header before starting the filtering part .

I wrote it that way to avoid overwriting and have the final file names clear from prefixes and suffixes , that's all :)

0
Entering edit mode

Oh, I see. Does the code work?

0
Entering edit mode

@RamRS...Yes , it works perfectly :D

I'm afraid of having concept error in that equation , can you please confirm it's correctness for me ?!

0
Entering edit mode

That is what I was wondering as well, but I guess 100-similarity is a crude measure of dissimilarity. How else would you find a quantifying parameter for difference from similarity matrices?

2
Entering edit mode
7.2 years ago
Bara'a ▴ 260

This is the reply I had from clustalw team :

So , I think the equation is correct @RamRS !!

1
Entering edit mode

Good job on asking them and on posting the follow-up!

0
Entering edit mode

Thanks :)

Hope this help others facing the same issue.