Tools for calculating binary encoding (descriptors) of protein sequences?
1
1
Entering edit mode
9.6 years ago

Hello everyone,

Can you recommend some tools to calculate the "binary" descriptors of amino acid sequences, which is similar to the fingerprints of compund, like pubchem fingerprints (ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt ).

Thank you.

Kevin

sequence • 2.6k views
ADD COMMENT
0
Entering edit mode

Can you please give an example of a '"binary" descriptors of amino acid sequences' ?

ADD REPLY
0
Entering edit mode

Just like this:

http://ddg-pharmfac.net/AllergenFP/method.html

it was described as follows:

The amino acids in the protein sequences in data sets were described by five E-descriptors and the strings were transformed into uniform vectors by auto-cross covariance (ACC) transformation.

The E-descriptors for the 20 naturally occurring amino acids, defined by Venkatarajan and Braun (J. Mol. Model (2001) 7:445-453), were derived by principal component analysis of a data matrix consisting of 237 physicochemical properties. The first principal component (E1) reflects the hydrophobicity of amino acids; the second (E2) - their size; the third (E3) - their helix-forming propensity; the forth (E4) correlates with the relative abundance of amino acids; and the fifth (E5) is dominated by the β-strand forming propensity.

An auto-cross covariance (ACC) transformation was used to make the length of the proteins uniform. ACC is a protein sequence mining method developed by Wold et al. (Anal. Chim. Acta 1993; 277:239-253).

The subsets of antigens and non-antigens were transformed into matrices with 25 x 15 variables each. The derived matrix consisted of 4854 rows (2427 allergens and 2427 non-allergens) and 25 x 15 columns. Each column was divided into 11 intervals and a 25 x 15 x 11-digit binary fingerprint was generated for each protein. A digit in the fingerprint equals 1, if the ACC value falls into the corresponding interval; otherwise, it takes 0. Thus, each protein has a unique binary fingerprint consisted of 25 x 15 units and (25 x 15 x 11 - 25 x 15) nulls. Tanimoto coefficients were calculated for all protein pairs in the set. A protein was classified as allergen or non-allergen according to the protein from the pair with the highest Tanimoto coefficient.

OR like this paper:

Title: Algebraic Encoding and Protein Secondary Structure Prediction.

Maybe for binary protein descriptors, currently there is no existing software to calculate such kind of descriptors though for the real-value protein descriptors, many software such as Rcpi in bioconductor package can deal with.

Any suggestion was appreciated.

ADD REPLY
0
Entering edit mode
9.5 years ago

Hello, I have done this question, but maybe there is more better way to do this. Thanks everyone.

convertToBinaryDesc <- function(desc_mat = NULL, k = 3) {
  #desc_mat is real-value descriptor matrix

  maxvalue <- max(desc_mat)
  minvalue <- min(desc_mat)
  # construct intervals  
  interval_vec <- seq(from = minvalue, to = maxvalue, length.out = k + 1)
  # number of continuous descriptors
  num_desc <- ncol(desc_mat)
  # number of binary fingerprints obtained by converting
  fingerprints_onerow <- rep(0, num_desc * k)

   # More effective method?
  convertRowDesc <- function(curr_row_vec) {
    xx <- findInterval(curr_row_vec, interval_vec, rightmost.closed = TRUE)
    xx <- rep(xx, each = k)
    template <- rep(1:k, times = num_desc)
    logic <- (xx == template)
    fingerprints_onerow <- as.numeric(logic)
    return(fingerprints_onerow)
  }

  binary_desc <- apply(desc_mat, 1, convertRowDesc) # may be replaced by parallel
  binary_desc <- t(binary_desc)

  return(binary_desc)
}
ADD COMMENT

Login before adding your answer.

Traffic: 1861 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6