Question: Feature extraction from protein sequences for machine learning classification
1
5.6 years ago by
insilico12310
insilico12310 wrote:

How to extract features from protein sequences, so that it can be converted into vector for training the data in machine learning. From some papers I found methods like using AAindex , PSSM  for training data. But I was unable to find the detailed method behind it. Please, suggest some papers or links which can be helpful.

modified 2.1 years ago by allmotog0 • written 5.6 years ago by insilico12310

From the literature I found following article:

VaxiJen: a server for prediction of protective antigens, tumour antigens and subunit vaccines

It uses Auto cross covariance (ACC). I have written  the following python code to calculate it. Please suggest if its working fine.

http://biotoolsinsilico.blogspot.in/2014/07/auto-cross-covariance-python.html

import numpy as np

# z1 z2 and z3 descriptor was used to represent the protein sequence

# Index j was used for the z-scales (j = 1, 2, 3),

# n is the number of amino acids in a sequence,

# index i is the amino acid position (i = 1, 2, ...n)

# l is the lag (l = 1, 2, ...L).

# a short range of lags (L= 1, 2, 3, 4, 5)
Z = np.random.rand(3,80)
print(Z)
#Z = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
n = Z.shape[1]
n = n-1
print(n)
# Autocovariance
column = []
for j in range(0,3):
row = []
for l in range(0,5):
summ = 0
for i in range(0,n-l):
rightsum = (Z[j,i]*Z[j,i+1])/(n-l)
summ = summ + rightsum
row.append(summ)
column.append(row)

R = np.array(column)
print(R)

# Cross Covariance

ja = [0,1,2,0,1,2]
ka = [1,0,0,2,2,1]
column = []
for j,k in zip(ja,ka):
row = []
for l in range(0,5):
summ = 0
for i in range(0,n-l):
rightsum = (Z[j,i]*Z[k,i+1])/(n-l)

summ = summ + rightsum
row.append(summ)
column.append(row)

C = np.array(column)
print(C)

3
5.6 years ago by
Quak310
United States
Quak310 wrote:

Features you want to extract are divided into two groups; 1) sequence based sequence, 2) features extracted from the predicted structure.

Amino acid composition, amino acid property, amino acid distribution and etc are in group one. There are mainly two R packages, Seqinr and BioSeqClass from Bioconductor. I attach a table from my thesis, which summarize this threat.

I would recommend you reading this paper; and the code is implemented in the BioSeqClass package.

Prediction of protein folding class using global description of amino acid sequence. PNAS, 92(19):8700–8704, 1995 (BioSeqClass - Bioconductor package)

hiii

try to get secondary prediction by R language and i run the code

predictPROTEUS from BioSeqClass in  R language

PROTEUS = predictPROTEUS(proteinSeq[1:2],proteus2.organism="euk")
Error in file(file, "rt") : cannot open the connection
1: running command 'perl C:\Users\D58B~1\AppData\Local\Temp\Rtmpu8B2vI\file1728128132c4.pl' had status 127
2: In file(file, "rt") :
cannot open file 'C:\Users\D58B~1\AppData\Local\Temp\Rtmpu8B2vI\file1728128132c4.proteus2': No such file or directory

any suggestion

Hi Quak

i'm starting on this topic, i want to do something similar, i'm working on python, writing descriptors for aminoacid sequences. I saw your table from your thesis.  My question is about your data, becouse my data is a lot of antibodies sequences.  ¿was your data heterogeneus? i mean the length of sequences, how affect for the compute?.

mine was enzymes family - so within families, sequences are homogeneous, but across heterogenious (relatively).

If your sequences are homogenious, means, the biological functions are hidden in subtle changes of amino acid differences ! in otherwords, most of features would be redundant. but you might be able to align all and see what are those subtle differences ...

but if sequences are heterogenious, you would have an easier life since feature are not redundant.

I don't think, the length of sequence matters unless you want to predict the structure ...

0
4.3 years ago by
Belgium
Ibrahim Tanyalcin1.0k wrote:

Usefulness of my answer depends on how many different proteins you are interested in. Concerning single proteins, you can generate circular graphs of your sequence using I-PV(http://i-pv.org/). Then you can either extract features based on chemical property or directly choosing amino acids.

In the first example I extract the sequence of aromatic residues, 50 amino acids per line. Watch it below:

http://i-pv.org/gifs/featureExtraction1.gif

In the second example first I select some amino acids to display on the text tract, then I make the font-size a bit bigger. And then I show them on the scatter track underneath by clicking on the "sequence display" from the drop down menu. Then I extract these feature based on sequence, 100 amino acids per line. Here is how I did it:

http://i-pv.org/gifs/featureExtraction2.gif

I hope this helps,

Good luck,

0
2.1 years ago by
allmotog0
allmotog0 wrote:

in first comment "a table from my thesis" not working.plz 'send me table' . allmotog@gmail.com.thanks in advance

""""""""""""Amino acid composition, amino acid property, amino acid distribution and etc are in group one. There are mainly two R packages, Seqinr and BioSeqClass from Bioconductor. I attach a table from my thesis, which summarize this threat.""""""""""""""'