Question: Statistical Coupling Analysis -- How To Get A Single Score?
3
8.3 years ago by
Fabian50
Fabian50 wrote:

Hi,

I'm experimenting with SCA (http://hhmi.swmed.edu/Labs/rr/sca.html).

My goal is to get a single score for a MSA that represents how statistically coupled the residues are. This score should then be used as one of many different scores that should together distinguish "good" templates from "bad" templates. The assumption is that templates that have strongly coupled residues are superior to those that haven't.

The problem is that SCA returns a matrix. I tried averaging the elements, but got bad results. I will try "maxing" the elements next.

Is there a smarter way to do this? What would you do?

analysis statistics • 2.4k views
modified 6.8 years ago by Biostar ♦♦ 20 • written 8.3 years ago by Fabian50
1
8.3 years ago by
Jake Mick50
Columbia, Missouri
Jake Mick50 wrote:

Have you looked into leading MI-related methods. They are superior at predicting residue-residue contacts. Dickson et al. 2010 demonstrated residue-level statistics derived from a MI-graph that detect a few types of alignment errors, namely shifts. I've actually been thinking of writing a structural alignment script that finds aligns by argmax(corr(struct,MI)). Of course this could be performed on SCA and without a structure, but MI also has quite a few attractive theoretical properites and is written as a simple nested loop.

If the above paper doesn't suit you let's rephrase the problem. The most natural representation of coevolutionary relations for a protein is a weighted multiedged digraph, where the edges incident to the residues are the elements of the joint probability mass function reshaped from columns in the MSA and possibly normalized. It is pretty nasty. Alternately some metrics employ a chi-squared approach of comparing profiles to each column in the MSA. Alternately older, but not newer SCA and OMES employs a perturbative approach. A simplification employed by all coevolutionary metrics is to derive a score from all of the edges incident between residues. Is there any data that is important to alignment that you've lost from this representation? Do you care about how the residues are coevolving? Do you want to have higher preference for all p(residues_x|residue_y) more likely to interact? This is a many-to-1 mapping. Now your data is a matrix that you're familiar with. There are many different means of simplifying a network into a vector that somehow represents our intuitive notion of importance. Degree, PageRank, Eigenvector or Flow Centrality, and many many more. It depends on what you're asking to find. You might derive these metrics for some moving-windows. Your original idea though could be done by first studying the assortivity of a quality hand-curated multiple sequence alignment as compared to some Erdos-Renyi graph of the same size, then the goal of the alignment might be to find the template corrsponding to the MI-graph that is most dissortive, for example.

tl;dr check out information theory and graph theory

Thanks Jake! I will read the MI papers you mentioned. However, I'm not quite sure if you've understood my use case: I am not interested what residues are co-evolving. Instead, given a bunch of completely unrelated MSAs I want to pick the one that is more conserved. That is: one MSA, one score.

Check out the second paper, it goes in great detail on a easy to spot alignment error that occurs even in high-quality sequence alignments. I know of 2 papers that cite that one as demostrating potential to improve sequence alignment, as no current method I'm aware of takes into account any of the coevolution information in a alignment. Which brings me to my next points, you want to score the columns of the msa by conservation? Coevolutionary metrics aren't used for traditional conservation calculations, though you certainly could derive a score for a position being evolutionarily constrained.