Can someone explain me , how to extract bi-gram feature distribution from a given protein sequence? For eg: Consider the amino acid sequence as MNNPQMNPQRS. The extracted two-gram features are (MN,2),(NN,1),(NP,2)(PQ,2)(QM,1)(QR,1)(RS,1).
see the below text if you need to refer
Text Reference:
A. Feature extraction
Proteins (also known as polypeptides) are organic compounds made of amino acids arranged in a linear chain or folded into a globular form. The amino acids are joined together by the peptide bonds between the carboxyl and amino groups of adjacent amino acid residues. In general, the genetic code specifies 20 standard amino acids such as Σ = (A,C,D,E, F, G,H, I,K,L,M,N, P,Q,R, S, T, V,W, Y ) . For protein feature selection, the two gram features such as (AA,AC, · · · AY ), (CA,CC, · · · CY ), · · · (Y A,YC, · · ·Y Y ) are selected. The total number of possible bi-grams from a set of 20 amino acids is 202, that is, 400. The two gran features represent the majority of the protein features. Two grams have the advantages of being length invariant, insertion/deletion invariant, not requiring motif finding and allowing classification based on local similarity [7], [8].
Apart from this, bi-grams reflecting the pattern of substitution of amino acids are also extracted. For this purpose, equivalence classes of amino acids that substitue for one another are derived from the percent accepted mutation matrix (PAM) [14]. Exchange grams are similar but are based on a many to one translation of the amino acid alphabet into a six letter alphabet that represents six groups of amino acids, which represent high evolutionary similarity. Generally the exchange groups used are : e1 = {H,R,K}, e2 = {D,E,N,Q}, e3 = {C}, e4 = {S, T, P, A,G}, e5 = {M,I,L, V }, and e6 = {F, Y,W} . The exchange groups statistically describes the probability of one amino acid replacing another over time. The total number of possible bi-grams on these six substitution groups is 62, that is 36. Thus the bigram measure in computation of 436 values, 400 corresponding to the consecutive pairs of amino acids and 36 corresponding to the consecutive pairs of substitution groups. Besides that, the amino acid distribution (20) and exchange group distribution(6) are also taken into account. Consider the amino acid sequence as MNNPQMNPQRS. The extracted two-gram features are (MN,2),(NN,1),(NP,2)(PQ,2)(QM,1)(QR,1)(RS,1). The above sequence can be denoted in terms of 6-letter exchange group as e5e2e2e4e2e5e2e4e2e1e4 The two gram features of exchange group can be denoted as {(e5e2, 2), (e2e2, 1), (e2e4, 2), (e4e2, 2), (e2e5, 1), (e2e1, 1), (e1e4, 1)}