Question

Generating structural embeddings from FaceBook ESM model

0

Entering edit mode

23 months ago

tom5 • 0

Hi,

I'm trying to generate structural embeddings for peptide sequences, to test whether structural data can improve the performance of an ML model for toxicity prediction. I read that the Facebook ESM model can rapidly generate contact maps which can serve as a proxy for structural data. Here's a link: https://github.com/facebookresearch/esm

However, I'm confused at the output FB ESM produces. It seems like the script (extract.py) itself is quite self explanatory, but what flags should I specify to produce the most relevant structural data. My goal is to represent the contact map as a tensor that can be appended to the feature set for my existing training dataset.

Here's the code I run (pulled from the readme):

python scripts/extract.py esm1b_t33_650M_UR50S examples/data/some_proteins.fasta examples/data/some_proteins_emb_esm1b/ --repr_layers 0 32 33 --include mean per_tok

My questions:

First, the script allows you to specify the layer(s) from which to extract embeddings. What's the most appropriate layer? I assume the final layer default would have the most predictive relevance for my upstream toxicity prediction model.
I can either set the per tok flag (which returns embeddings per residue) or mean flag (returns embeddings averaged over the entire sequence). Which is most appropriate for my upstream toxicity prediction problem? I'm leaning towards the average flag since this will yield a tensor of consistent dimensions.
Should I directly append the contact map embeddings to my existing sequence embeddings or does it make sense to normalize/standardize them before doing so?

Any help would be appreciated!

alphafold • 3.5k views

ADD COMMENT • link updated 23 months ago by Mensur Dlakic ★ 27k • written 23 months ago by tom5 • 0

score 2 · Accepted Answer · 2022-05-22

2

Entering edit mode

23 months ago

Mensur Dlakic ★ 27k

The neural network typically has learned more in distal than in proximal layers, so it makes sense for most applications to use the final layer. However, it is something to be tested for each application individually.
If you are predicting individual residue properties, it makes sense to use residue predictions. When predicting properties of the whole sequence, the mean flag is more appropriate.
The embeddings usually do not need normalization, but it does depend on the data you wish to append to. I suggest you make separate predictors from your other data and ESM embeddings, and combine their predictions later by blending or stacking.

ADD COMMENT • link 23 months ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thanks for the help! One further question. The model produces a contact map that's a tensor of dimension 1280 for each sequence. What do the features of this output represent? And is the dimension arbitrary? We're trying to set up model interpretability assessments and want to get a better understanding of the contact maps.

ADD REPLY • link 23 months ago by tom5 • 0

1

Entering edit mode

Unless you specifically asked for contact map prediction, ESMs produces for each residue an embedding which is a 1280-dimensional vector. Those are high-dimensional representations of a particular amino acid in a given sequence context. Averaging those vectors across the whole sequence length will give a single 1280-dimensional vector, which is a high-dimensional representation of the whole sequence. The vectors don't have units and their dimensions are arbitrary, although for mean representations they usually are small fractional numbers on either side of 0. If you are using extract.py script that comes with ESM, this is what you are getting.

In case it is not obvious: similar proteins have similar embeddings, both per residues and for the whole sequence. That means the embeddings can be used to train models that predict various sequence properties. They can also be reduced to a smaller number of dimensions, where similar sequences will be close together. Below is a t-SNE embedding of ~2000 proteins that belong to two groups. These are unaligned and only their 1280-dimensional vectors are used for embedding. It should be pretty obvious that there are two groups, although within them there are subgroups as well.

enter image description here

Contact maps are predicted by finding correlations between attention heads of the same network that creates embeddings. Those predictions are symmetrical N x N matrices (N is the number of residues) where the value in each cell is roughly the likelihood of the two residues being in close contact. You will have to read up more about it on your own if contact maps are your main interest.

ADD REPLY • link 23 months ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thanks! Are the embeddings generated a proxy for structural data? Or rather, can we claim that the embeddings provide some structural information, potentially since ESM was trained on AlphaFold structural data. Our high level goal is to use the facebook ESM model output to add structural information to supplement our current sequence embedding data and improve the accuracy of our toxicity prediction model.

ADD REPLY • link 23 months ago by tom5 • 0

0

Entering edit mode

Are the embeddings generated a proxy for structural data?

The titles of the two papers I linked above answer your question directly (Transformer protein language models are unsupervised structure learners). Yet my suggestion to you is to read those papers beyond the titles if you want to do any serious work with ESM embeddings. There are also several landmark papers from the Rost lab that will likely be useful. Their ProtT5-XL-UniRef50 model is relatively new but could be better than ESM's single sequence embedder.

Just to get you started, ESMs were not trained on any structural data. The only training data they used were tokenized protein sequences, which is basically considering proteins as long sentences where individual residues are words. Despite that, these transformer models capture all kinds of structural and functional information that goes beyond simple strings of amino acids. ESM embedding have been used successfully to predict protein localization, functions, secondary structure, solvent accessibility, and many other properties. It is very likely they will be useful for toxicity prediction as well, and it would not surprise me if they performed better than the models you have attempted before.

ADD REPLY • link 23 months ago by Mensur Dlakic ★ 27k