This heavily depends on what the downstream classifier is doing. For instance: If you're looking to relate different mutations of the same gene to (say) salt tolerance; then you can generally one-hot the individual mutations or treat the full genes as categorical; and you don't have to worry at all about differences in gene lengths.
For general tasks, the approach is typically to select a fixed residue length (which could simply be the length of the longest protein) and generate a (N_residue, 21)
array for each sequence:
>>> AA_LIST = ['-', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']
>>> MAX_LEN=125
>>> seq = 'MDKKYSIGLDIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDSGETAEATRLKRTARRRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHERHPIFG'
>>> seq_pad = seq[:MAX_LEN] if len(seq) > MAX_LEN else seq + '-' * (MAX_LEN - len(seq))
>>> seq_df = pd.DataFrame([{'position': i, 'aa': x} for i, x in enumerate(seq_pad)])
>>> seq_ohe = sklearn.preprocessing.OneHotEncoder(categories=AA_LIST).fit_transform(seq_df)
Note that there are significantly better approaches to selecting the best set of MAX_LEN
amino acids than using the first ones; the above is just a simple example. The above also one-hot encodes the position (just remove 'position' if you don't want that).
Importantly, your input feature matrix is of shape (N_sequences, N_residue, 21)
(or (N_sequences, N_residue, 21 + MAX_LEN)
) which is not compatible with scikit-learn which wants (N_record, N_features)
matrices; so you'll need to reshape that into (N_sequences, N_residue x 21)
. Neural networks on the other hand will easily handle (., N_residue, 21)
inputs; indeed this is one of the input features (but far from the only input feature) for many protein-oriented network architectures, including AlphaFold.