Protein sequence representation in neural network
0
0
Entering edit mode
5.9 years ago
JDK_92 • 0

Hi,

I am currently working on a project to build a neural network which takes as input an amino acid sequence (protein fragment) with the fixed length of 34. I am trying to give a prediction whether or not the input sequence belongs to a certain class of repeat (TPR). Long story short:

My problem is to encode the sequence in order to have a proper input for the network. I thought about encoding each single amino acid with a vector of 20 bits (for 20 amino acids) having a '1' at the position in the vector representing the current amino acid and '0' for the other 19 bits. Concatenating these vectors leads me to a vector of length 20 * 34 which is quite big.

So does anybody here has any experience on how to represent an amino acid sequence to be able to provide it as input for a neural network.

Thank you!

neural network protein python machine learning • 2.9k views
ADD COMMENT
0
Entering edit mode

Your one-hot encoding is commonly used, but you could also try to use physical/chemical properties (look up AAINDEX) to represent the amino acids.

ADD REPLY
0
Entering edit mode

Thank you. I'll take some properties from AAINDEX along with the one-hot encoding and see what the results will be.

ADD REPLY
0
Entering edit mode

Hi! Were you able to gather some experience around the issue? I am as well about to try those both options to see which performs better but I also suspect there might be other encoding schemes that are more efficient.

This expectation naively arises because, as for my case of 12 AAs, the information is theoretically 52 bits (=log2(20**12) but one-hot encoding virtually yields 240 bits, giving out a very scarce matrix, which in turn raises doubts about the efficiency of the convolution in later steps.

Currently reading this: https://pubs.acs.org/doi/10.1021/acs.jcim.0c00073

ADD REPLY

Login before adding your answer.

Traffic: 2558 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6