I've recently joined the Bioinformatcs community, though my background in Biology is not extensive. At all. In any case, I've been trying to manipulate the datasets from CBS regarding peptide:MHCII binding affinity. My intention is, put very briefly, to train a model by implementing my own algorithm to perform the predictions.
What I've identified as the biggest challenge, however, is to properly format the training data so that its features are actually meaningful and hold information that will make the predictions accurate.
According to a variety of papers, namely Nielsen et. al, 2003, Nielsen et. al, 2009, as well as Luo et. al, 2015, the preferred method of feature extraction is protein encoding using BLOSUM matrices. Other methods include sparse encoding, and extraction of physiochemical features.
Now, what I cannot understand is the manner in which the BLOSUM encoding is actually applied. As far as what I've searched, BLOSUM matrices are used to determine distance between evolutionarily divergent protein sequences; further, the studies mentioned refer to the encoding as enabling of representing a peptide using a BLOSUM matrix itself (which is then turned into a vector):
"The peptide core was presented to the network using Blosum encoding, where each amino acid was encoded by the BLOSUM log-odds vector" (Nielsen, 2009)
Meaning that in some obscure way, each peptide was simple transformed into a vector. How so? Was it aligned and compared to the other peptides that appear in the dataset? Was it aligned using BLAST or other alignment tool against an on-line dataset?
Bonus: what is the most widely accepted method of representing peptides for classification/regression purposes (given my intention, that is modeling peptide:MHCII binding affinity)? (some literature as reference would be a good place to start, python packages with appropriate functions would be even better)
Thanks in advance. I'm trying to learn as much as possible on a reasonably short timeframe, so sorry if the question is too long/could've been split in multiple questions.