I'm trying to generate structural embeddings for peptide sequences, to test whether structural data can improve the performance of an ML model for toxicity prediction. I read that the Facebook ESM model can rapidly generate contact maps which can serve as a proxy for structural data. Here's a link: https://github.com/facebookresearch/esm
However, I'm confused at the output FB ESM produces. It seems like the script (extract.py) itself is quite self explanatory, but what flags should I specify to produce the most relevant structural data. My goal is to represent the contact map as a tensor that can be appended to the feature set for my existing training dataset.
Here's the code I run (pulled from the readme):
python scripts/extract.py esm1b_t33_650M_UR50S examples/data/some_proteins.fasta examples/data/some_proteins_emb_esm1b/ --repr_layers 0 32 33 --include mean per_tok
First, the script allows you to specify the layer(s) from which to extract embeddings. What's the most appropriate layer? I assume the final layer default would have the most predictive relevance for my upstream toxicity prediction model.
I can either set the per tok flag (which returns embeddings per residue) or mean flag (returns embeddings averaged over the entire sequence). Which is most appropriate for my upstream toxicity prediction problem? I'm leaning towards the average flag since this will yield a tensor of consistent dimensions.
Should I directly append the contact map embeddings to my existing sequence embeddings or does it make sense to normalize/standardize them before doing so?
Any help would be appreciated!