Question: Protein sequence representation in neural network
gravatar for JDK_92
2.3 years ago by
JDK_920 wrote:


I am currently working on a project to build a neural network which takes as input an amino acid sequence (protein fragment) with the fixed length of 34. I am trying to give a prediction whether or not the input sequence belongs to a certain class of repeat (TPR). Long story short:

My problem is to encode the sequence in order to have a proper input for the network. I thought about encoding each single amino acid with a vector of 20 bits (for 20 amino acids) having a '1' at the position in the vector representing the current amino acid and '0' for the other 19 bits. Concatenating these vectors leads me to a vector of length 20 * 34 which is quite big.

So does anybody here has any experience on how to represent an amino acid sequence to be able to provide it as input for a neural network.

Thank you!

ADD COMMENTlink modified 2.2 years ago by Biostar ♦♦ 20 • written 2.3 years ago by JDK_920

Your one-hot encoding is commonly used, but you could also try to use physical/chemical properties (look up AAINDEX) to represent the amino acids.

ADD REPLYlink written 2.3 years ago by cschu1812.5k

Thank you. I'll take some properties from AAINDEX along with the one-hot encoding and see what the results will be.

ADD REPLYlink written 2.3 years ago by JDK_920
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1361 users visited in the last hour