Question: Feature extraction from protein sequences for machine learning classification
1
gravatar for insilico123
4.2 years ago by
insilico12310
insilico12310 wrote:

How to extract features from protein sequences, so that it can be converted into vector for training the data in machine learning. From some papers I found methods like using AAindex , PSSM  for training data. But I was unable to find the detailed method behind it. Please, suggest some papers or links which can be helpful.

ADD COMMENTlink modified 7 months ago by allmotog0 • written 4.2 years ago by insilico12310

From the literature I found following article:

VaxiJen: a server for prediction of protective antigens, tumour antigens and subunit vaccines

It uses Auto cross covariance (ACC). I have written  the following python code to calculate it. Please suggest if its working fine.

http://biotoolsinsilico.blogspot.in/2014/07/auto-cross-covariance-python.html

 

import numpy as np

# z1 z2 and z3 descriptor was used to represent the protein sequence

# Index j was used for the z-scales (j = 1, 2, 3),

# n is the number of amino acids in a sequence,

# index i is the amino acid position (i = 1, 2, ...n)

# l is the lag (l = 1, 2, ...L).

# a short range of lags (L= 1, 2, 3, 4, 5)
Z = np.random.rand(3,80)
print(Z)
#Z = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
n = Z.shape[1]
n = n-1
print(n)
# Autocovariance
column = []
for j in range(0,3):
    row = []
    for l in range(0,5):
        summ = 0
        for i in range(0,n-l):
            rightsum = (Z[j,i]*Z[j,i+1])/(n-l)
            summ = summ + rightsum
        row.append(summ)
    column.append(row)

R = np.array(column)
print(R)

# Cross Covariance

ja = [0,1,2,0,1,2]
ka = [1,0,0,2,2,1] 
column = []
for j,k in zip(ja,ka):
    row = []
    for l in range(0,5):
        summ = 0
        for i in range(0,n-l):
            rightsum = (Z[j,i]*Z[k,i+1])/(n-l)

            summ = summ + rightsum
        row.append(summ)
    column.append(row)

C = np.array(column) 
print(C)  
    

 

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by insilico12310
3
gravatar for Quak
4.2 years ago by
Quak270
United States
Quak270 wrote:

Features you want to extract are divided into two groups; 1) sequence based sequence, 2) features extracted from the predicted structure.

Amino acid composition, amino acid property, amino acid distribution and etc are in group one. There are mainly two R packages, Seqinr and BioSeqClass from Bioconductor. I attach a table from my thesis, which summarize this threat.

I would recommend you reading this paper; and the code is implemented in the BioSeqClass package.

Prediction of protein folding class using global description of amino acid sequence. PNAS, 92(19):8700–8704, 1995 (BioSeqClass - Bioconductor package)

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Quak270

hiii 

 try to get secondary prediction by R language and i run the code 

 predictPROTEUS from BioSeqClass in  R language

 PROTEUS = predictPROTEUS(proteinSeq[1:2],proteus2.organism="euk")
Error in file(file, "rt") : cannot open the connection
In addition: Warning messages:
1: running command 'perl C:\Users\D58B~1\AppData\Local\Temp\Rtmpu8B2vI\file1728128132c4.pl' had status 127 
2: In file(file, "rt") :
  cannot open file 'C:\Users\D58B~1\AppData\Local\Temp\Rtmpu8B2vI\file1728128132c4.proteus2': No such file or directory

any suggestion

ADD REPLYlink written 3.4 years ago by m0166480

Hi Quak

i'm starting on this topic, i want to do something similar, i'm working on python, writing descriptors for aminoacid sequences. I saw your table from your thesis.  My question is about your data, becouse my data is a lot of antibodies sequences.  ¿was your data heterogeneus? i mean the length of sequences, how affect for the compute?. 

 

 

ADD REPLYlink written 3.4 years ago by victorfica0

mine was enzymes family - so within families, sequences are homogeneous, but across heterogenious (relatively).

If your sequences are homogenious, means, the biological functions are hidden in subtle changes of amino acid differences ! in otherwords, most of features would be redundant. but you might be able to align all and see what are those subtle differences ...

but if sequences are heterogenious, you would have an easier life since feature are not redundant.

I don't think, the length of sequence matters unless you want to predict the structure ...

ADD REPLYlink written 3.3 years ago by Quak270
0
gravatar for Ibrahim Tanyalcin
2.9 years ago by
Belgium
Ibrahim Tanyalcin880 wrote:

Usefulness of my answer depends on how many different proteins you are interested in. Concerning single proteins, you can generate circular graphs of your sequence using I-PV(http://i-pv.org/). Then you can either extract features based on chemical property or directly choosing amino acids.

In the first example I extract the sequence of aromatic residues, 50 amino acids per line. Watch it below:

http://i-pv.org/gifs/featureExtraction1.gif

In the second example first I select some amino acids to display on the text tract, then I make the font-size a bit bigger. And then I show them on the scatter track underneath by clicking on the "sequence display" from the drop down menu. Then I extract these feature based on sequence, 100 amino acids per line. Here is how I did it:

http://i-pv.org/gifs/featureExtraction2.gif

I hope this helps,

Good luck,

ADD COMMENTlink written 2.9 years ago by Ibrahim Tanyalcin880
0
gravatar for allmotog
7 months ago by
allmotog0
allmotog0 wrote:

in first comment "a table from my thesis" not working.plz 'send me table' . allmotog@gmail.com.thanks in advance

""""""""""""Amino acid composition, amino acid property, amino acid distribution and etc are in group one. There are mainly two R packages, Seqinr and BioSeqClass from Bioconductor. I attach a table from my thesis, which summarize this threat.""""""""""""""'

ADD COMMENTlink written 7 months ago by allmotog0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1376 users visited in the last hour