Question

Generating embeddings for antibody data

0

Entering edit mode

22 months ago

tom5 • 0

Hi, I'm attempting to set up an ML model to predict antibody developability properties (eg. aggregation likelihood, viscosity, etc.) from sequence. I've previously worked with the ESM-1b model (https://github.com/facebookresearch/esm) for protein sequences on a toxicity prediction task and achieved good results. I saw that the ABLang model (https://github.com/oxpig/AbLang) is now available for extracting antibody specific sequence embeddings. I'm wondering if there's any guidance on using these embeddings for upstream property prediction tasks.

My specific task is as follows:

I have access to a dataset of antibody sequences and corresponding descriptor values that were calculated via a more traditional bioinformatics pipeline, and want to see if I can predict these descriptor values via an ml pipeline. These values are linear in nature, but can be made categorical (eg. high aggregation likelihood vs. low aggregation likelihood).
I plan to use a pre-trained language model (eg. AbLang) to extract embeddings for each of my antibody sequences. These embeddings will be paired with the descriptors corresponding to the input amino acid sequence. That is, for each input antibody sequence, I'll produce an (embedding, descriptor) pair for all descriptors associated with that antibody (eg. aggregation likelihood, viscosity, etc.).
I will set up a predictive model for each descriptor, with the data generated in step two as training data. This will leave me with multiple models, one for each descriptor. As an initial step, I was thinking of using just two layers in the predictive models, a fully connected layer and an output layer. To make training easier, I can convert my linear descriptor label to binary labels (eg. high aggregation likelihood vs. low aggregation likelihood).

Is the approach in step three sufficient for a baseline test of the predictive validity of antibody sequence embeddings generated by ABLang on developability properties? I have around 2,000 antibody sequences total (training and test). Or will I need to set up a more complex model? Additionally, are there other models you may recommend I look into, either for the developability property prediction task or the embedding generation task? I've looked into DeepAb as well (https://github.com/RosettaCommons/DeepAb). Value the help!

ML fab Alphafold mab antibody • 564 views

ADD COMMENT • link 22 months ago by tom5 • 0