Question: Peptide Encoding & Feature Extraction
1
gravatar for carlo.mazzaferro
2.9 years ago by
carlo.mazzaferro10 wrote:

Hello Biostars!

I've recently joined the Bioinformatcs community, though my background in Biology is not extensive. At all. In any case, I've been trying to manipulate the datasets from CBS regarding peptide:MHCII binding affinity. My intention is, put very briefly, to train a model by implementing my own algorithm to perform the predictions.

What I've identified as the biggest challenge, however, is to properly format the training data so that its features are actually meaningful and hold information that will make the predictions accurate.

According to a variety of papers, namely Nielsen et. al, 2003, Nielsen et. al, 2009, as well as Luo et. al, 2015, the preferred method of feature extraction is protein encoding using BLOSUM matrices. Other methods include sparse encoding, and extraction of physiochemical features.

Now, what I cannot understand is the manner in which the BLOSUM encoding is actually applied. As far as what I've searched, BLOSUM matrices are used to determine distance between evolutionarily divergent protein sequences; further, the studies mentioned refer to the encoding as enabling of representing a peptide using a BLOSUM matrix itself (which is then turned into a vector):

"The peptide core was presented to the network using Blosum encoding, where each amino acid was encoded by the BLOSUM log-odds vector" (Nielsen, 2009)

Meaning that in some obscure way, each peptide was simple transformed into a vector. How so? Was it aligned and compared to the other peptides that appear in the dataset? Was it aligned using BLAST or other alignment tool against an on-line dataset?

Bonus: what is the most widely accepted method of representing peptides for classification/regression purposes (given my intention, that is modeling peptide:MHCII binding affinity)? (some literature as reference would be a good place to start, python packages with appropriate functions would be even better)

Thanks in advance. I'm trying to learn as much as possible on a reasonably short timeframe, so sorry if the question is too long/could've been split in multiple questions.

machine learning blosum • 1.2k views
ADD COMMENTlink modified 2.8 years ago • written 2.9 years ago by carlo.mazzaferro10

I don't know if this is legal, but I'm commenting to bump up this. Any clue would be much appreciated.

ADD REPLYlink written 2.8 years ago by carlo.mazzaferro10

To keep things legal do this in future. Editing your original post bumps it up to the main page.

ADD REPLYlink written 2.8 years ago by genomax68k

See if any of these papers help: https://web.njit.edu/~wangj/publications/ARTICLES/ibm01.pdf
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2868007/
https://www.cs.kent.ac.uk/people/staff/aaf/pub_papers.dir/Curr-Proteo-2008-Davies-preprint.pdf

ADD REPLYlink written 2.8 years ago by genomax68k

Thanks for those links. I had seen the first two papers but will give them a more thorough look. I'll report back later today.

ADD REPLYlink written 2.8 years ago by carlo.mazzaferro10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1721 users visited in the last hour