Question

Results of Protein Interaction Network Clusterings

0

Entering edit mode

8.4 years ago

schererpaulm • 0

I am working on a project involving the clustering of Protein Interaction Networks, having made several clustering algorithms on the graphs of interacting proteins, I am somewhat confused on how I would now go about seeing whether the clusters created are any good or not.

To put this into context protein interaction networks represent pairwise connections between proteins and isolating groups of interacting proteins that participate in the same biological processes or that perform together specific functions. This is significant as many proteins and interactions are unlabelled so inference to their function can be made if many labelled proteins for a certain are in one cluster.

Unlike typical supervised machine learning tasks where a labelled data set can show numbers of correct groupings or not, there is no precedent for good clusterings of proteins and their interaction, hypothetically a clustering where all proteins are in their one cluster are as good as one where all proteins are in one cluster (though there is no informational significance in this). There are of course no feature vectors for distance calculations either, only binary information whether one protein interacts with another or not, so this is quite difficult.

This problem is completely exploratory, and is hard to see whether a clustering is significant or just bogus.

Most academic papers use cluster analysis techniques to see how good the clusters and the algorithms are. i.e. whether they are robust to edge deletion or node deletion, cluster correlation etc. as in http://www.biomedcentral.com/1471-2105/7/488 or http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2999340/ or http://www.cse.buffalo.edu/DBGROUP/bioinformatics/papers/chuan.pdf

I would like to see if there is any information one can fish out using protein databases, say input a large number of interactions (from one cluster) and see if the labelled ones have a tendency to be involved in the same metabolic process. If there is a significantly high number of proteins involved in one metabolic process one can surmise that the unlabelled proteins may be involved in a similar process or function, or similarly may be part of a protein domain or not.

I have just begun delving into bioinformatics and research in general so there is a very high chance that this has been done before and I haven't looked around extensively enough. If this is the case I would be grateful for links. I would appreciate any help possible, or ideas on how one could think about this problem.

clusters ppi clustering • 1.8k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.4 years ago by schererpaulm • 0

score 2 · Answer 1 · 2015-12-10

Cluster analysis is a data exploratory method. It's unsupervised, meaning you're not adding prior knowledge to the algorithm so the clusters you get reflect whatever structure your chosen algorithm can identify in your data. Clustering graphs of physical protein interactions should reveal protein complexes. Since there are many known protein complexes, you can easily validate your cluster against this ground truth. Although not everybody agrees on the granularity of some complexes and where some see one large complex, others would see two or three sub-complexes, picking one good reference database (e.g. Reactome) would be acceptable. If your interactions are not physical, it's likely they represent some form of functional relationships, in which case the clusters would probably represent biological processes which again you can validate against current knowledge e.g. using GO annotation enrichment analysis of the clusters or by comparing clusters to lists of genes known to be involved in various biological processes (available from various databases (e.g. Reactome, Panther, KEGG...).