Question

Statistical Coupling Analysis -- How To Get A Single Score?

3

Entering edit mode

12.2 years ago

Fabian ▴ 50

Hi,

I'm experimenting with SCA (http://hhmi.swmed.edu/Labs/rr/sca.html).

My goal is to get a single score for a MSA that represents how statistically coupled the residues are. This score should then be used as one of many different scores that should together distinguish "good" templates from "bad" templates. The assumption is that templates that have strongly coupled residues are superior to those that haven't.

The problem is that SCA returns a matrix. I tried averaging the elements, but got bad results. I will try "maxing" the elements next.

Is there a smarter way to do this? What would you do?

statistics analysis • 3.3k views

ADD COMMENT • link updated 10.7 years ago by Biostar 20 • written 12.2 years ago by Fabian ▴ 50

score 1 · Answer 1 · 2012-02-12

Have you looked into leading MI-related methods. They are superior at predicting residue-residue contacts. Dickson et al. 2010 demonstrated residue-level statistics derived from a MI-graph that detect a few types of alignment errors, namely shifts. I've actually been thinking of writing a structural alignment script that finds aligns by argmax(corr(struct,MI)). Of course this could be performed on SCA and without a structure, but MI also has quite a few attractive theoretical properites and is written as a simple nested loop.

If the above paper doesn't suit you let's rephrase the problem. The most natural representation of coevolutionary relations for a protein is a weighted multiedged digraph, where the edges incident to the residues are the elements of the joint probability mass function reshaped from columns in the MSA and possibly normalized. It is pretty nasty. Alternately some metrics employ a chi-squared approach of comparing profiles to each column in the MSA. Alternately older, but not newer SCA and OMES employs a perturbative approach. A simplification employed by all coevolutionary metrics is to derive a score from all of the edges incident between residues. Is there any data that is important to alignment that you've lost from this representation? Do you care about how the residues are coevolving? Do you want to have higher preference for all p(residues_x|residue_y) more likely to interact? This is a many-to-1 mapping. Now your data is a matrix that you're familiar with. There are many different means of simplifying a network into a vector that somehow represents our intuitive notion of importance. Degree, PageRank, Eigenvector or Flow Centrality, and many many more. It depends on what you're asking to find. You might derive these metrics for some moving-windows. Your original idea though could be done by first studying the assortivity of a quality hand-curated multiple sequence alignment as compared to some Erdos-Renyi graph of the same size, then the goal of the alignment might be to find the template corrsponding to the MI-graph that is most dissortive, for example.

tl;dr check out information theory and graph theory