Question

Protein sequence representation in neural network

0

Entering edit mode

5.9 years ago

JDK_92 • 0

Hi,

I am currently working on a project to build a neural network which takes as input an amino acid sequence (protein fragment) with the fixed length of 34. I am trying to give a prediction whether or not the input sequence belongs to a certain class of repeat (TPR). Long story short:

My problem is to encode the sequence in order to have a proper input for the network. I thought about encoding each single amino acid with a vector of 20 bits (for 20 amino acids) having a '1' at the position in the vector representing the current amino acid and '0' for the other 19 bits. Concatenating these vectors leads me to a vector of length 20 * 34 which is quite big.

So does anybody here has any experience on how to represent an amino acid sequence to be able to provide it as input for a neural network.

Thank you!

neural network protein python machine learning • 2.9k views

ADD COMMENT • link updated 21 months ago by alpdeniz • 0 • written 5.9 years ago by JDK_92 • 0

0

Entering edit mode

Your one-hot encoding is commonly used, but you could also try to use physical/chemical properties (look up AAINDEX) to represent the amino acids.

ADD REPLY • link 5.9 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Thank you. I'll take some properties from AAINDEX along with the one-hot encoding and see what the results will be.

ADD REPLY • link 5.9 years ago by JDK_92 • 0

0

Entering edit mode

Hi! Were you able to gather some experience around the issue? I am as well about to try those both options to see which performs better but I also suspect there might be other encoding schemes that are more efficient.

This expectation naively arises because, as for my case of 12 AAs, the information is theoretically 52 bits (=log2(20**12) but one-hot encoding virtually yields 240 bits, giving out a very scarce matrix, which in turn raises doubts about the efficiency of the convolution in later steps.

Currently reading this: https://pubs.acs.org/doi/10.1021/acs.jcim.0c00073

ADD REPLY • link 21 months ago by alpdeniz • 0