Question

Peptide Encoding & Feature Extraction

2

Entering edit mode

7.7 years ago

carlo.mazzaferro ▴ 20

Hello Biostars!

I've recently joined the Bioinformatcs community, though my background in Biology is not extensive. At all. In any case, I've been trying to manipulate the datasets from CBS regarding peptide:MHCII binding affinity. My intention is, put very briefly, to train a model by implementing my own algorithm to perform the predictions.

What I've identified as the biggest challenge, however, is to properly format the training data so that its features are actually meaningful and hold information that will make the predictions accurate.

According to a variety of papers, namely Nielsen et. al, 2003, Nielsen et. al, 2009, as well as Luo et. al, 2015, the preferred method of feature extraction is protein encoding using BLOSUM matrices. Other methods include sparse encoding, and extraction of physiochemical features.

Now, what I cannot understand is the manner in which the BLOSUM encoding is actually applied. As far as what I've searched, BLOSUM matrices are used to determine distance between evolutionarily divergent protein sequences; further, the studies mentioned refer to the encoding as enabling of representing a peptide using a BLOSUM matrix itself (which is then turned into a vector):

"The peptide core was presented to the network using Blosum encoding, where each amino acid was encoded by the BLOSUM log-odds vector" (Nielsen, 2009)

Meaning that in some obscure way, each peptide was simple transformed into a vector. How so? Was it aligned and compared to the other peptides that appear in the dataset? Was it aligned using BLAST or other alignment tool against an on-line dataset?

Bonus: what is the most widely accepted method of representing peptides for classification/regression purposes (given my intention, that is modeling peptide:MHCII binding affinity)? (some literature as reference would be a good place to start, python packages with appropriate functions would be even better)

Thanks in advance. I'm trying to learn as much as possible on a reasonably short timeframe, so sorry if the question is too long/could've been split in multiple questions.

machine learning blosum • 3.1k views

ADD COMMENT • link 7.7 years ago by carlo.mazzaferro ▴ 20

0

Entering edit mode

I don't know if this is legal, but I'm commenting to bump up this. Any clue would be much appreciated.

ADD REPLY • link 7.7 years ago by carlo.mazzaferro ▴ 20

0

Entering edit mode

To keep things legal do this in future. Editing your original post bumps it up to the main page.

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

See if any of these papers help: https://web.njit.edu/~wangj/publications/ARTICLES/ibm01.pdf
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2868007/
https://www.cs.kent.ac.uk/people/staff/aaf/pub_papers.dir/Curr-Proteo-2008-Davies-preprint.pdf

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

Thanks for those links. I had seen the first two papers but will give them a more thorough look. I'll report back later today.

ADD REPLY • link 7.7 years ago by carlo.mazzaferro ▴ 20