Protein name alignment for comparison and similarity score
1
0
Entering edit mode
7.1 years ago
adi0957 ▴ 10

Hi everyone,

I am fairly new to bioinformatics and I am a bit stumped on how to go about doing a comparison of my data.

I currently have a file containing about 318 protein clusters. It looks something like this:

Cluster 1     CSF2,NRAS,GSK3A,GSK3B,...
Cluster 2     MAP3K7,HLA-DRA,NFKBIA,ZAP70,...
Cluster 3     CSF2, NRAS, GRIN1, CDKN1A,...
...

I wish to compare the proteins in each cluster and assign a similarity score based on the seed cluster chosen. So, lets take Cluster 1 as the one all are compared to for example, if half of the proteins in Cluster 2 match with any in Cluster 1 then Cluster 2 would have a 50% similarity score, and so on going through the entire list of clusters. The number of proteins in each cluster is different, and so the score should be based on each individual clusters total number of proteins. Output can be flexible, so perhaps something like print all clusters with a score greater than 60%.

Any advice on how I would go about doing something like this in either R or Python would be greatly appreciated.

Thank you,

Adrian

alignment R python proteins • 1.9k views
ADD COMMENT
0
Entering edit mode

Provide some more details about your input file. For example, in Cluster 1 and Cluster 2 there are no spaces between gene names. Cluster 3 has spaces. Is this how the original file really looks like?

ADD REPLY
0
Entering edit mode

Hi, the input file is an excel file with 2 columns. Columns are delimited by tabs, and gene names by commas.

ADD REPLY
2
Entering edit mode
7.1 years ago

Python:

Input file (clusters.txt):

Cluster 1   CSF2,NRAS,GSK3A,GSK3B
Cluster 2   MAP3K7,HLA-DRA,NFKBIA,ZAP70
Cluster 3   CSF2,NRAS,GRIN1,CDKN1A
Cluster 4   GSK3A,GSK3B,NRAS,CSF2

Script (clusters.py):

import sys

REFCLUSTNUM = int(sys.argv[2])-1

with open(sys.argv[1]) as fh:
    clusts = []
    for line in fh:
        sl = line.strip().split('\t')
        clusts.append((sl[0], set(sl[1].split(','))))

    ref = clusts[REFCLUSTNUM][1]
    for i, c in enumerate(clusts):
        if i != REFCLUSTNUM:
            score = len(ref.intersection(c[1]))/float(len(ref.union(c[1])))*100
            print('{}\t{:.2f}%'.format(c[0], score))

Run (first cluster as reference):

python clusters.py clusters.txt 1

Output:

Cluster 2   0.00%
Cluster 3   33.33%
Cluster 4   100.00%

Run (second cluster as reference):

python clusters.py clusters.txt 2

Output:

Cluster 1   0.00%
Cluster 3   0.00%
Cluster 4   0.00%

Run (third cluster as reference)

python clusters.py clusters.txt 3

Output:

Cluster 1   33.33%
Cluster 2   0.00%
Cluster 4   33.33%
ADD COMMENT
0
Entering edit mode

This is brilliant, thanks. But from what I see, it takes the first line and uses it as the reference point. How would I need to change it to specify reference sequence, e.g. use Cluster 2 as the comparison reference?

ADD REPLY
2
Entering edit mode

Okay, I've just edited my answer. I hope this is doing what you want.

ADD REPLY
0
Entering edit mode

Exactly what I needed. Thank you very much!

ADD REPLY
0
Entering edit mode

Hi, again thank you very much for your help. But I was wondering whether it would be possible to output the obtained data into something like a correlation matrix contained within a csv file perhaps? So, cluster names would be on both rows and columns with the score at their intersection.

ADD REPLY
0
Entering edit mode

This would probably complicate the present solution. However, if you post your problem as a new question, I will be happy to help. If you do so, please provide a sample input and output.

ADD REPLY

Login before adding your answer.

Traffic: 3434 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6