Question

how to one-hot-encode AA sequences/peptides (of different length with sklearn)?

0

Entering edit mode

3.1 years ago

simone • 0

My question is how AA sequences should be one-hot-encoded.

Do the resulting vectors/arrays all need to have the same dimension (dictated by the longest sequence)? Do you have to signify the end of a sequence? If so, do you add the "stopping letter" to the sequence before the encoding or during?

It would be especially helpful if someone could show me how to encode AA sequences with the sklearn.preprocessing.OneHotEncoder, but I am more so asking about the general approach to one hot encoding sequences.

one-hot-encoding encoding sklearn peptides python • 3.3k views

ADD COMMENT • link updated 3.1 years ago by LChart 5.2k • written 3.1 years ago by simone • 0

score 0 · Answer 1 · 2022-10-22

This heavily depends on what the downstream classifier is doing. For instance: If you're looking to relate different mutations of the same gene to (say) salt tolerance; then you can generally one-hot the individual mutations or treat the full genes as categorical; and you don't have to worry at all about differences in gene lengths.

For general tasks, the approach is typically to select a fixed residue length (which could simply be the length of the longest protein) and generate a (N_residue, 21) array for each sequence:

>>> AA_LIST = ['-', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'] 
>>> MAX_LEN=125
>>> seq = 'MDKKYSIGLDIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDSGETAEATRLKRTARRRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHERHPIFG' 
>>> seq_pad = seq[:MAX_LEN] if len(seq) > MAX_LEN else seq + '-' * (MAX_LEN - len(seq))  
>>> seq_df = pd.DataFrame([{'position': i, 'aa': x} for i, x in enumerate(seq_pad)]) 
>>> seq_ohe = sklearn.preprocessing.OneHotEncoder(categories=AA_LIST).fit_transform(seq_df)

Note that there are significantly better approaches to selecting the best set of MAX_LEN amino acids than using the first ones; the above is just a simple example. The above also one-hot encodes the position (just remove 'position' if you don't want that).

Importantly, your input feature matrix is of shape (N_sequences, N_residue, 21) (or (N_sequences, N_residue, 21 + MAX_LEN)) which is not compatible with scikit-learn which wants (N_record, N_features) matrices; so you'll need to reshape that into (N_sequences, N_residue x 21). Neural networks on the other hand will easily handle (., N_residue, 21) inputs; indeed this is one of the input features (but far from the only input feature) for many protein-oriented network architectures, including AlphaFold.