how to one-hot-encode AA sequences/peptides (of different length with sklearn)?
1
0
Entering edit mode
2.9 years ago
simone • 0

My question is how AA sequences should be one-hot-encoded.

Do the resulting vectors/arrays all need to have the same dimension (dictated by the longest sequence)? Do you have to signify the end of a sequence? If so, do you add the "stopping letter" to the sequence before the encoding or during?

It would be especially helpful if someone could show me how to encode AA sequences with the sklearn.preprocessing.OneHotEncoder, but I am more so asking about the general approach to one hot encoding sequences.

one-hot-encoding encoding sklearn peptides python • 3.1k views
ADD COMMENT
0
Entering edit mode
2.9 years ago
LChart 5.1k

This heavily depends on what the downstream classifier is doing. For instance: If you're looking to relate different mutations of the same gene to (say) salt tolerance; then you can generally one-hot the individual mutations or treat the full genes as categorical; and you don't have to worry at all about differences in gene lengths.

For general tasks, the approach is typically to select a fixed residue length (which could simply be the length of the longest protein) and generate a (N_residue, 21) array for each sequence:

>>> AA_LIST = ['-', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'] 
>>> MAX_LEN=125
>>> seq = 'MDKKYSIGLDIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDSGETAEATRLKRTARRRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHERHPIFG' 
>>> seq_pad = seq[:MAX_LEN] if len(seq) > MAX_LEN else seq + '-' * (MAX_LEN - len(seq))  
>>> seq_df = pd.DataFrame([{'position': i, 'aa': x} for i, x in enumerate(seq_pad)]) 
>>> seq_ohe = sklearn.preprocessing.OneHotEncoder(categories=AA_LIST).fit_transform(seq_df)

Note that there are significantly better approaches to selecting the best set of MAX_LEN amino acids than using the first ones; the above is just a simple example. The above also one-hot encodes the position (just remove 'position' if you don't want that).

Importantly, your input feature matrix is of shape (N_sequences, N_residue, 21) (or (N_sequences, N_residue, 21 + MAX_LEN)) which is not compatible with scikit-learn which wants (N_record, N_features) matrices; so you'll need to reshape that into (N_sequences, N_residue x 21). Neural networks on the other hand will easily handle (., N_residue, 21) inputs; indeed this is one of the input features (but far from the only input feature) for many protein-oriented network architectures, including AlphaFold.

ADD COMMENT

Login before adding your answer.

Traffic: 2992 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6