Question: Best feature vector representation of Protein model?
gravatar for Random
4.1 years ago by
Random0 wrote:

I have a number of protein models of varying lengths in PDB format and I'm trying to do machine learning on them and predict their energy. I have the energy values of each of the protein models.

The problem is that machine learning algorithms obviously require a fixed length vector representation. The problem is that all my protein models have different lengths.

Does anyone know of a protein vector representation?

machine learning • 1.8k views
ADD COMMENTlink modified 3 months ago by Biostar ♦♦ 20 • written 4.1 years ago by Random0
gravatar for learnBioinformatics
4.1 years ago by
United States
learnBioinformatics40 wrote:

Maybe, you can calculate  the protein similarity matrix firstly, and the apply the kernel-based methods (such as kernel svm) to prediction the energy. Here, there are many method to get the protein similarity such  as smith-waterman local aligment scores or blast bits scores. By applying such method, the length of protein may not have influence on prediction (Applicability Domain). For example, you have 200 proteins, and then you will get a 200 * 200 similarity matrix which will be used to build machine learning model to predict the corresponding energy values.

Hope this help.

ADD COMMENTlink written 4.1 years ago by learnBioinformatics40
gravatar for ricardo
2.5 years ago by
ricardo0 wrote:

You may want to consider using features based on reside cluster classes (

These are 26 features based on residue contacts and primary sequence contiguity

There is an easy to follow iPython notebook showing how to use this for structural classification :

Easiest way to get these features from a PDB file is by using a web service:

curl -X POST -F file=@1HIV.pdb ''

where 1HIV.pdb may be any PDB file

Hope this help.

ADD COMMENTlink written 2.5 years ago by ricardo0
gravatar for linus
4.1 years ago by
linus330 wrote:

Are your sequences related to each other?

If yes:

How about an alignment of them. Afterwards you would have equal length vectors, which you could for example encode very easy with a 20 bit vector for each AA position, or you could use some BLOSSUM representation or you pick a set of interesting attributes from 

If not:

You say you want to predict their energy. I may be wrong, but is the length not a very crucial part of the energy calculation (depending of course which energy you calculate). So maybe you could create vector instead of representing the AAs, you could calculate properties of your proteins, like number of helices or something else.  (But to be honest i do not think this will yield in good predictions) 

ADD COMMENTlink written 4.1 years ago by linus330

Hi Linus; thanks for the response. The sequences are actually not related to each other. Using properties of the proteins is tough because it would give bad predictions. I am interested in using the distances of the atoms in the protein model; Is there a standard way to represent a protein model as a feature vector considering the atomic distances?

ADD REPLYlink written 4.1 years ago by Random0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 743 users visited in the last hour