The amino acids in the protein sequences in data sets were described by five E-descriptors and the strings were transformed into uniform vectors by auto-cross covariance (ACC) transformation.
The E-descriptors for the 20 naturally occurring amino acids, defined by Venkatarajan and Braun (J. Mol. Model (2001) 7:445-453), were derived by principal component analysis of a data matrix consisting of 237 physicochemical properties. The first principal component (E1) reflects the hydrophobicity of amino acids; the second (E2) - their size; the third (E3) - their helix-forming propensity; the forth (E4) correlates with the relative abundance of amino acids; and the fifth (E5) is dominated by the β-strand forming propensity.
An auto-cross covariance (ACC) transformation was used to make the length of the proteins uniform. ACC is a protein sequence mining method developed by Wold et al. (Anal. Chim. Acta 1993; 277:239-253).
The subsets of antigens and non-antigens were transformed into matrices with 25 x 15 variables each. The derived matrix consisted of 4854 rows (2427 allergens and 2427 non-allergens) and 25 x 15 columns. Each column was divided into 11 intervals and a 25 x 15 x 11-digit binary fingerprint was generated for each protein. A digit in the fingerprint equals 1, if the ACC value falls into the corresponding interval; otherwise, it takes 0. Thus, each protein has a unique binary fingerprint consisted of 25 x 15 units and (25 x 15 x 11 - 25 x 15) nulls. Tanimoto coefficients were calculated for all protein pairs in the set. A protein was classified as allergen or non-allergen according to the protein from the pair with the highest Tanimoto coefficient.
OR like this paper:
Title: Algebraic Encoding and Protein Secondary Structure Prediction.
Maybe for binary protein descriptors, currently there is no existing software to calculate such kind of descriptors though for the real-value protein descriptors, many software such as Rcpi in bioconductor package can deal with.
Can you please give an example of a '"binary" descriptors of amino acid sequences' ?
Just like this:
http://ddg-pharmfac.net/AllergenFP/method.html
it was described as follows:
The amino acids in the protein sequences in data sets were described by five E-descriptors and the strings were transformed into uniform vectors by auto-cross covariance (ACC) transformation.
The E-descriptors for the 20 naturally occurring amino acids, defined by Venkatarajan and Braun (J. Mol. Model (2001) 7:445-453), were derived by principal component analysis of a data matrix consisting of 237 physicochemical properties. The first principal component (E1) reflects the hydrophobicity of amino acids; the second (E2) - their size; the third (E3) - their helix-forming propensity; the forth (E4) correlates with the relative abundance of amino acids; and the fifth (E5) is dominated by the β-strand forming propensity.
An auto-cross covariance (ACC) transformation was used to make the length of the proteins uniform. ACC is a protein sequence mining method developed by Wold et al. (Anal. Chim. Acta 1993; 277:239-253).
The subsets of antigens and non-antigens were transformed into matrices with 25 x 15 variables each. The derived matrix consisted of 4854 rows (2427 allergens and 2427 non-allergens) and 25 x 15 columns. Each column was divided into 11 intervals and a 25 x 15 x 11-digit binary fingerprint was generated for each protein. A digit in the fingerprint equals 1, if the ACC value falls into the corresponding interval; otherwise, it takes 0. Thus, each protein has a unique binary fingerprint consisted of 25 x 15 units and (25 x 15 x 11 - 25 x 15) nulls. Tanimoto coefficients were calculated for all protein pairs in the set. A protein was classified as allergen or non-allergen according to the protein from the pair with the highest Tanimoto coefficient.
OR like this paper:
Title: Algebraic Encoding and Protein Secondary Structure Prediction.
Maybe for binary protein descriptors, currently there is no existing software to calculate such kind of descriptors though for the real-value protein descriptors, many software such as Rcpi in bioconductor package can deal with.
Any suggestion was appreciated.