Similarity matrix to distance matrix for protein sequences
0
0
Entering edit mode
3.4 years ago
kbaitsi • 0

I have used R to calculate a similarity matrix for 11 proteins (histones) from a fasta file. Then I need to turn the similarity matrix into a distance matrix in order to use it in hclust. I have used sim2dist and also dist with all methods (euclidean, maximum, manhattan, canberra, binary, minkowski). I have excluded the binary method but I am not sure which is the best way to calculate the distance from the rest of my options. Any thoughts?

similarity distance protein sequences r • 2.9k views
ADD COMMENT
1
Entering edit mode

There are a few common and generic ways of turning a similarity into a distance such as:

  • d = max(s) - s (e.g. if similarity is cosine then max(s) = 1)
  • d = 1/(s+1)
  • d = exp(- s^a) with a being a parameter In fact, any function that is strictly decreasing will do.
ADD REPLY
0
Entering edit mode

Thank you for your answer, sim2dist does what you wrote in the first bullet. I was just wondering if there is a preferable way when it comes to protein sequences or it doesn't matter?

ADD REPLY
3
Entering edit mode

What matters most is the choice of the original measure of similarity. It has to capture the notion of proximity/similarity that is relevant to the question you're trying to address. When converting you need to make sure that distribution properties that are important for the clustering are preserved.

ADD REPLY
0
Entering edit mode

I have used the pairwise alignment function and a blosum subtitution matrix. Thanks a lot for your time and answer.

ADD REPLY

Login before adding your answer.

Traffic: 1968 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6