Question

Interpret Clustal Omega Distance/Percent Identity Output for Multiple Proteins

0

Entering edit mode

2.9 years ago

taojincs ▴ 50

How do we interpret the distance or percent identity output of Clustal Omega?

I run Clustal Omega on some selected sequences from a homologous cluster but it happens that the percent identity it returns between sequences can be around 30%, even 19%.

Does that mean these sequences should not even be in a homologous cluster? What is a good threshold on percent identity (produced by Clustal Omega) to tell two sequences are similar? What is the minimum identity that indicates a good match?

Thank you!

Omega Clustal similarity sequence • 2.3k views

ADD COMMENT • link updated 2.9 years ago by Mensur Dlakic ★ 27k • written 2.9 years ago by taojincs ▴ 50

score 1 · Answer 1 · 2021-06-15

Distance in Clustal Omega is (1-perc_identity). Two aligned sequence that share 65% (0.65) identical residues will have a distance of 0.35.

There is a homology in an evolutionary sense, and a homology in a functional sense (let's call this one orthology). Two proteins can be homologous while sharing only 5-10% identical residues. There are numerous protein superfamilies (ATPases, alpha/beta hydrolases, methyltransferases, nucleases, etc) where the members are all related by distant ancestry and perform the same general chemical reaction, but not necessarily exactly the same reaction or on the same substrate. In such cases the identity can go well below 30% and yet they still could be homologous. Orthology, on the other hand, usually requires higher level of sequence identity because those proteins perform the same reaction on the same substrate, just in different organisms. I don't know what your case is or where you obtained a homologous cluster, but it is possible that all sequences in it are related despite low sequence identity. Unfortunately, there is no fixed threshold of sequence identity where the relatedness can be determined with certainty.

If your goal is to show the alignment of orthologs, you may want to check the species for your sequences, and use only those that can reasonably be expected to be related. But there is no reason to exclude anything if your goal is to show distant homologs. It is possible that all the sequences in your group are related despite low identity, especially if this homologous cluster was created by someone who knows how to find distant homologs.