But how can we make the classification based on these numbers?
First, it doesn't have to be a classification - it can be a regression. That means predicting a continuous number (usually relative solvent accessibility - RSA) rather than a binary choice between buried / exposed. But let's say you want classification, in which case you pick a threshold that will delineate the two groups. I think most commonly used threshold is a RSA of 25%, but I have seen it anywhere in the 15-25% range. About 7-8 years ago when I looked a bit into it, a threshold of 23.79% was the exact middle point between the two classes so that they end up with equal number of data points. The dataset would still be fairly balanced at 25%.
Let's say you take a non-redundant set of protein structures such that no single sequence is a detectable homolog of any other. To each of them we assign a buried or exposed class using NACCESS or DSSP. Next, we build a position-specific scoring matrix (PSSM) for each sequence, either using PSI-BLAST, HHblits, jackhmmer or some other tool that can iteratively search for homologs and make profiles from multiple sequence alignments. Now we take a window around each residue, say 7 to its left and 7 to its right, and we have 15 x 20 values (a window of 15 residues, each with 20 amino-acid frequencies from PSSMs). So these 300 values are associated with either 0 or 1 that corresponds to middle residue's RSA, and that is how you build your dataset for classification or regression.
Now the next thing may not make sense because I am telling you explicitly above to include the neighborhood of 7 amino acids on each side into this dataset, but you can't meaningfully add things up and get an average accessibility value for a stretch of amino-acids. That's because two residues separated by one that are part of an alpha-helix can fully exposed and fully buried, respectively, because one of them is facing the solvent and the other one (half a helix turn away) is packed against the protein core. Averaging accessibilities over a window, especially a large one, would give you fairly uniform values across the whole length, which would not be meaningful. Now, even though the residue accessibility is a local quantity, it is determined by the values of its immediate neighbors, which is why it makes sense to use a window of PSSM values when predicting.
Not sure why you think discussing this topic will be great for everyone. There are plenty of other topics that are of greater interest to others, and still hardly any topic that is of interest to everyone. I am happy to help you because this also happens to interest me, but I doubt there are more than 10-20 other people on this website that care about this topic.