Question

Tools for calculating binary encoding (descriptors) of protein sequences?

1

Entering edit mode

10.8 years ago

newlife.well.2014 ▴ 10

Hello everyone,

Can you recommend some tools to calculate the "binary" descriptors of amino acid sequences, which is similar to the fingerprints of compund, like pubchem fingerprints (ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt ).

Thank you.

Kevin

sequence • 2.9k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.8 years ago by newlife.well.2014 ▴ 10

0

Entering edit mode

Can you please give an example of a '"binary" descriptors of amino acid sequences' ?

ADD REPLY • link 10.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Just like this:

http://ddg-pharmfac.net/AllergenFP/method.html

it was described as follows:

The amino acids in the protein sequences in data sets were described by five E-descriptors and the strings were transformed into uniform vectors by auto-cross covariance (ACC) transformation.

The E-descriptors for the 20 naturally occurring amino acids, defined by Venkatarajan and Braun (J. Mol. Model (2001) 7:445-453), were derived by principal component analysis of a data matrix consisting of 237 physicochemical properties. The first principal component (E1) reflects the hydrophobicity of amino acids; the second (E2) - their size; the third (E3) - their helix-forming propensity; the forth (E4) correlates with the relative abundance of amino acids; and the fifth (E5) is dominated by the β-strand forming propensity.

An auto-cross covariance (ACC) transformation was used to make the length of the proteins uniform. ACC is a protein sequence mining method developed by Wold et al. (Anal. Chim. Acta 1993; 277:239-253).

The subsets of antigens and non-antigens were transformed into matrices with 25 x 15 variables each. The derived matrix consisted of 4854 rows (2427 allergens and 2427 non-allergens) and 25 x 15 columns. Each column was divided into 11 intervals and a 25 x 15 x 11-digit binary fingerprint was generated for each protein. A digit in the fingerprint equals 1, if the ACC value falls into the corresponding interval; otherwise, it takes 0. Thus, each protein has a unique binary fingerprint consisted of 25 x 15 units and (25 x 15 x 11 - 25 x 15) nulls. Tanimoto coefficients were calculated for all protein pairs in the set. A protein was classified as allergen or non-allergen according to the protein from the pair with the highest Tanimoto coefficient.

OR like this paper:

Title: Algebraic Encoding and Protein Secondary Structure Prediction.

Maybe for binary protein descriptors, currently there is no existing software to calculate such kind of descriptors though for the real-value protein descriptors, many software such as Rcpi in bioconductor package can deal with.

Any suggestion was appreciated.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by newlife.well.2014 ▴ 10

Ram · Answer 1 · 2014-11-01

Hello, I have done this question, but maybe there is more better way to do this. Thanks everyone.

convertToBinaryDesc <- function(desc_mat = NULL, k = 3) {
  #desc_mat is real-value descriptor matrix

  maxvalue <- max(desc_mat)
  minvalue <- min(desc_mat)
  # construct intervals  
  interval_vec <- seq(from = minvalue, to = maxvalue, length.out = k + 1)
  # number of continuous descriptors
  num_desc <- ncol(desc_mat)
  # number of binary fingerprints obtained by converting
  fingerprints_onerow <- rep(0, num_desc * k)

   # More effective method?
  convertRowDesc <- function(curr_row_vec) {
    xx <- findInterval(curr_row_vec, interval_vec, rightmost.closed = TRUE)
    xx <- rep(xx, each = k)
    template <- rep(1:k, times = num_desc)
    logic <- (xx == template)
    fingerprints_onerow <- as.numeric(logic)
    return(fingerprints_onerow)
  }

  binary_desc <- apply(desc_mat, 1, convertRowDesc) # may be replaced by parallel
  binary_desc <- t(binary_desc)

  return(binary_desc)
}