I am working on a project involving the clustering of Protein Interaction Networks, having made several clustering algorithms on the graphs of interacting proteins, I am somewhat confused on how I would now go about seeing whether the clusters created are any good or not.
To put this into context protein interaction networks represent pairwise connections between proteins and isolating groups of interacting proteins that participate in the same biological processes or that perform together specific functions. This is significant as many proteins and interactions are unlabelled so inference to their function can be made if many labelled proteins for a certain are in one cluster.
Unlike typical supervised machine learning tasks where a labelled data set can show numbers of correct groupings or not, there is no precendent for good clusterings of proteins and their interaction, hypothetically a clustering where all proteins are in their one cluster are as good as one where all proteins are in one cluster (though there is no informational significance in this). There are of course no feature vectors for distance calculations either, only binary information whether one protein interacts with another or not, so this is quite difficult.
This problem is completely exploratory, and is hard to see whether a clustering is significant or just bogus.
Most academic papers use cluster analysis techniques to see how good the clusters and the algorithms are. ie. whether they are robust to edge deletion or node deletion, cluster correlation etc. as in http://www.biomedcentral.com/1471-2105/7/488 or http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2999340/ or http://www.cse.buffalo.edu/DBGROUP/bioinformatics/papers/chuan.pdf
I would like to see if there is any information one can fish out using protein databases, say input a large number of interactions (from one cluster) and see if the labelled ones have a tendency to be involved in the same metabolic process. If there is a significantly high number of proteins involved in one metabolic process one can surmise that the unlabelled proteins may be involved in a similar process or function, or similarly may be part of a protein domain or not.
I have just begun delving into bioinformatics and research in general so there is a very high chance that this has been done before and I haven't looked around extensively enough. If this is the case I would be grateful for links. I would appreciate any help possible, or ideas on how one could think about this problem.