Comparing distributions of amino acids
2
0
Entering edit mode
2.1 years ago

I have thousands of sequences that each contain a certain distribution of a specific amino acid (in my case cysteine). I would like a way of grouping these sequences by distribution similarity. Here is an example:

SEQ_A 7 18 32 48 67 100

SEQ_B 26 56 89 112 138 178

SEQ_C 20 44 71 94 120 160

SEQ_D 11 26 44 54 67 94

SEQ_X is the sequence ID and each number is the position of C in the sequence for SEQ_X.

I would either like to order these by "similarity" or find some way to obtain a score that quantifies the distribution.

How would I would go about doing this?

amino acid distribution similarity • 489 views
0
Entering edit mode

Is aa the most relevant tag for this post? Not amino acids or distribution or similarity, but aa?

0
Entering edit mode

can't it be done by hierarchy clustering?

0
Entering edit mode
2.1 years ago
Joe 19k

I would suggest computing all pairwise cosine distances, treating each set of numbers as a vector.

I’m guessing they’ll be in number order already?

It’s fairly trivial to implement (about half a dozen lines), but I’d suggest the scipy implementations for speed:

https://stackoverflow.com/questions/18424228/cosine-similarity-between-2-number-lists

If you do non-redundant comparisons (n choose k, instead of all-vs-all), you can then plot the numbers and perhaps cluster with k-means or something?

0
Entering edit mode
2.1 years ago
Mensur Dlakic ★ 14k

It should be fairly simple to read through FASTa file and extract positions of cysteines in each sequence. I suspect BioPython has I/O functions that can used for this purpose.

The other thing you are trying to do is more difficult, primarily in terms of interpretation. Similarity of equally-sized distributions can be assessed using a KS test. I suspect your distributions are unequal in size, so you may want to look at Mann–Whitney U test. Scipy stats has functions to calculate both quantities.

The biggest problem I see is the interpretation. Unless your proteins are somehow enriched in cysteines, you are not going to have very long distributions. Moreover, proteins are very flexible in terms of positional shifting of residues as long as indels are in less structured protein regions. What you are trying to do strikes me as similar to determining the identity between two proteins by comparing their residues that are in multiples of 15. You will get a result from such an exercise, but I am not sure it will be meaningful, and I would imagine the same might be true for what you are trying to do.

Rather than looking at absolute numeric positions of Cs, I suggest you do multiple alignments of all your sequences and represent each C as a 20-residue frequency profile. That way you will get longer distributions (they will be # of Cs x 20) and it will be a more meaningful comparison as you will be comparing substitution profiles of Cs rather than their absolute positions. If you really want to complicate your life further in terms of execution - but potentially get a more meaningful result - you can look at small windows of 2-5 neighboring residues on both sides of the cysteine and compare those distributions.