Question: Protein Sequence Descriptors
gravatar for Funuser
8.5 years ago by
Funuser10 wrote:

i am looking for a way to describe protein sequences by a Neural Network. However i am still missing some descriptors i can use. Do you know of free descriptors or better a descriptor package i could use?

edit: My problem is actually: Different parts of the sequence have a influence on the protein. I want to go over the sequence and then predict this influence. Influence can be an assay or anything rly. In the end it should have formed a model for this and be able to predict for a new sequence. As i do this only to get acquainted with weka and stuff, i dont really have an idea what to use as assay :).

sequence protein analysis • 3.4k views
ADD COMMENTlink modified 8.5 years ago by Khader Shameer18k • written 8.5 years ago by Funuser10
gravatar for Chris
8.5 years ago by
Chris1.6k wrote:

I assume you're talking about feature extraction? I've once written a python tool that turns sequence-based features (predicted sec. strct., solv. acc., evolutionary information, predicted PPI interfaces, PFam data, biochemical propensities...) into position-specific numerical normalized features. Output formats are weka arff and libsvm/liblinear-compliant datasets. This tool however heavily depends on predictprotein [1] which is a command line wrapper for all kinds of sequence-based predictors developed in our group. It's available as machine image (complete linux OS) or debian packages. Let me know if this sounds appealing to you.


ADD COMMENTlink written 8.5 years ago by Chris1.6k

sounds good actually. i would love to play around with weka. how could i get this tool chain running?

ADD REPLYlink written 8.5 years ago by Funuser10

try getting the predictprotein image running. Let me know when you succeeded and contact me again (s. my webpage).

ADD REPLYlink written 8.5 years ago by Chris1.6k

thanks a lot, will do once i got it running :)

ADD REPLYlink written 8.5 years ago by Funuser10

Hi Chris!

The python tool you mention seems interesting and I would like to explore it for my work on disease/druggability gene predictions. However, I would appreciate if you could help me get started on how to use the tool in batch mode, because my current set consists of ~20k proteins with unique uniprot IDs.

ADD REPLYlink written 6.4 years ago by kandoigaurav150
gravatar for Khader Shameer
8.5 years ago by
Manhattan, NY
Khader Shameer18k wrote:

You can use AAINDEX database to derive descriptors using protein sequence. AAINDEX provides amino acid indices, substitution matrices and pair-wise contact potentials.

Background on amino acid index from AAINDEX:

AAindex is a database of numerical indices representing various physicochemical and biochemical properties of amino acids and pairs of amino acids. AAindex consists of three sections now: AAindex1 for the amino acid index of 20 numerical values, AAindex2 for the amino acid mutation matrix and AAindex3 for the statistical protein contact potentials. All data are derived from published literature.

Manuscript describing current version of AAINDEX is available here.

Current version include (ver.9.1) 544 amino acid indices, 94 amino acid mutation matrices and 47 contact potential matrices

You can use this data as a normalized score for the whole protein chain or use them to derive hybrid features. You may please refer to following papers that used AAINDEX derived features/descriptors to develop Support Vector Machines and Random Forests based machine learning algorithms for prediction of 3D domain swapping.

ADD COMMENTlink written 8.5 years ago by Khader Shameer18k

the question is now, how do i form this into a model i can use?

ADD REPLYlink written 8.5 years ago by Funuser10

funuser: IMHO, That should be a separate question.

ADD REPLYlink written 8.5 years ago by Khader Shameer18k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1036 users visited in the last hour