Question

Defining residues as buried or exposed based in ASAs/RSAs

0

Entering edit mode

2.7 years ago

Agenor Neto ▴ 10

Hello everyone,

First, the problem being considered here is not new in Biostars, but the questions were asked considerably a quite few years ago and I would like to uptade this question and help myself and others that may be facing the same question. See similar posts: 1, 2

The question is: accessible surface area and relative accessible surface area are key measurments when classifying a residue as exposed or buried in a protein. There are tools in the field which calculate this and gives us these measurments (NACCESS, Biopython.SASA, etc). But how can we make the classification based on these numbers? Is this dependent of our particular situation which is being studied (proteins)? And I also would really like to know if there is some way and, if yes, how can we classify not a residue, but an amino acid strecht of a particular protein (sum RSAs and ASAs values?).

I really think there is no "The right answer" for this but I appreciate if we can discuss this problem. It is going to be great not just for me but for everyone.

structural-bioinformatics protein-biology • 1.0k views

ADD COMMENT • link updated 2.7 years ago by Ram 43k • written 2.7 years ago by Agenor Neto ▴ 10

score 2 · Accepted Answer · 2021-07-23

But how can we make the classification based on these numbers?

First, it doesn't have to be a classification - it can be a regression. That means predicting a continuous number (usually relative solvent accessibility - RSA) rather than a binary choice between buried / exposed. But let's say you want classification, in which case you pick a threshold that will delineate the two groups. I think most commonly used threshold is a RSA of 25%, but I have seen it anywhere in the 15-25% range. About 7-8 years ago when I looked a bit into it, a threshold of 23.79% was the exact middle point between the two classes so that they end up with equal number of data points. The dataset would still be fairly balanced at 25%.

Let's say you take a non-redundant set of protein structures such that no single sequence is a detectable homolog of any other. To each of them we assign a buried or exposed class using NACCESS or DSSP. Next, we build a position-specific scoring matrix (PSSM) for each sequence, either using PSI-BLAST, HHblits, jackhmmer or some other tool that can iteratively search for homologs and make profiles from multiple sequence alignments. Now we take a window around each residue, say 7 to its left and 7 to its right, and we have 15 x 20 values (a window of 15 residues, each with 20 amino-acid frequencies from PSSMs). So these 300 values are associated with either 0 or 1 that corresponds to middle residue's RSA, and that is how you build your dataset for classification or regression.

Now the next thing may not make sense because I am telling you explicitly above to include the neighborhood of 7 amino acids on each side into this dataset, but you can't meaningfully add things up and get an average accessibility value for a stretch of amino-acids. That's because two residues separated by one that are part of an alpha-helix can fully exposed and fully buried, respectively, because one of them is facing the solvent and the other one (half a helix turn away) is packed against the protein core. Averaging accessibilities over a window, especially a large one, would give you fairly uniform values across the whole length, which would not be meaningful. Now, even though the residue accessibility is a local quantity, it is determined by the values of its immediate neighbors, which is why it makes sense to use a window of PSSM values when predicting.

Not sure why you think discussing this topic will be great for everyone. There are plenty of other topics that are of greater interest to others, and still hardly any topic that is of interest to everyone. I am happy to help you because this also happens to interest me, but I doubt there are more than 10-20 other people on this website that care about this topic.