**10**wrote:

How to extract features from protein sequences, so that it can be converted into vector for training the data in machine learning. From some papers I found methods like using AAindex , PSSM for training data. But I was unable to find the detailed method behind it. Please, suggest some papers or links which can be helpful.

**0**• written 4.5 years ago by insilico123 •

**10**

From the literature I found following article:

VaxiJen: a server for prediction of protective antigens, tumour antigens and subunit vaccines

It uses Auto cross covariance (ACC). I have written the following python code to calculate it. Please suggest if its working fine.

http://biotoolsinsilico.blogspot.in/2014/07/auto-cross-covariance-python.html

import numpy as np

# z1 z2 and z3 descriptor was used to represent the protein sequence

# Index j was used for the z-scales (j = 1, 2, 3),

# n is the number of amino acids in a sequence,

# index i is the amino acid position (i = 1, 2, ...n)

# l is the lag (l = 1, 2, ...L).

# a short range of lags (L= 1, 2, 3, 4, 5)

Z = np.random.rand(3,80)

print(Z)

#Z = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])

n = Z.shape[1]

n = n-1

print(n)

# Autocovariance

column = []

for j in range(0,3):

row = []

for l in range(0,5):

summ = 0

for i in range(0,n-l):

rightsum = (Z[j,i]*Z[j,i+1])/(n-l)

summ = summ + rightsum

row.append(summ)

column.append(row)

R = np.array(column)

print(R)

# Cross Covariance

ja = [0,1,2,0,1,2]

ka = [1,0,0,2,2,1]

column = []

for j,k in zip(ja,ka):

row = []

for l in range(0,5):

summ = 0

for i in range(0,n-l):

rightsum = (Z[j,i]*Z[k,i+1])/(n-l)

summ = summ + rightsum

row.append(summ)

column.append(row)

C = np.array(column)

print(C)

10