Question: Support vector machine in bioinformatics.
gravatar for pujapatel5400
5.2 years ago by
United States
pujapatel54000 wrote:

Hello all,

I am generally of field computer science and data analytic. I have learned machine learning and i am solving one biological problem with using Support Vector machine.My question is i am having a data set of amino acid sequences. In our human body there are 20 standard amino acids and each amino acid contain sequences. I have found the composition of this amino acid sequences.This composition is nothing but the word count of each individual amino acid by its name and have counted its percentage that is composition of amino acid. Now i have to build a model for support vector machine using these composition as feature.Can any 1 give some idea how can i build SVM model..??

Thanks in advance

Sequences is like this:

>HMPREF9352_0002 rod shape-determining protein MreD [Streptococcus gallolyticus subsp. gallolyticus TX20005]

From above Amino Acid sequence i found composition of the Amino Acid


machine learning matlab svm R • 2.1k views
ADD COMMENTlink modified 5.2 years ago by Jean-Karim Heriche23k • written 5.2 years ago by pujapatel54000

Could you please give a better outline of the problem you are working on? It would be helpful if you could answer the following questions :-

1)What is the sample size of the dataset that you have

2) What is the SVM model trying to predict? Or is it trying to classify your dataset into categories?

3) What is the dataset like, attach a sample of the dataset in the reply if you can.


Before building an SVM model, the most important thing to know is whether an SVM model is suited or would other Machine Learning techniques be more useful. Moreover, in any Machine Learning problem, it is very important to first get an idea of what is the function we are trying to train, what is the data set like, what are the limitations of the data at hand, and what features would be best, given the nature of the dataset and the function being predicted :)


ADD REPLYlink written 5.2 years ago by spandan.madan60

What biological problem are you trying to solve?  What data do you have?  What outcome are you trying to predict using SVM?

ADD REPLYlink written 5.2 years ago by Sean Davis26k

Hello sir, Thank you for replying me. Actually I have number of amino acids sequences from that I found composition or you can say count of amino acids in percentage. So I right now I want to build model of svm. For example I have 6 sequences and I found composition so then I have input sequences but from that 6 sequences 3 for positive and 3 for negative I wants to take. That will not decide by the train function which is positive or which is negative. And for svm composition it self feature perform.

ADD REPLYlink written 5.2 years ago by pujapatel54000

You still haven't told us what the question you're trying to address is. Are you trying to classify proteins (into two or more classes, if so which ones ? ) or are you trying to predict some property from the sequence ?

ADD REPLYlink written 5.2 years ago by Jean-Karim Heriche23k

If you have only 6 sequences, machine learning approaches really do not apply.  Is this just a subset of a much larger dataset?  

ADD REPLYlink written 5.2 years ago by Sean Davis26k

Hello sir,

What I want to do is, So, we have 6 sequences let's say that 3 of these are positive case while other three are negative. Now based on the AA composition of these sequences we need to build a model using SVM. This model would be used for predicting class of new sequences e.g. if we have a new sequence we should be able to predict whether it belongs to positive or negative class. 

As above mention sequences are look like that and from that i found composition of the Amino acid.

ADD REPLYlink modified 5.2 years ago • written 5.2 years ago by pujapatel54000

It sounds like a homework problem, 6 sequences and 3 positive 3 negative is a textbook ML question. Real problems would have an odd number of each. 

ADD REPLYlink written 5.2 years ago by karl.stamm3.8k

This is obviously an assignment.

ADD REPLYlink written 5.2 years ago by SmallChess530

Do you know what you are talking about? Machine learning never applies to such a small ridiculously small amount of data. I think you've misunderstood the question or your intention.

ADD REPLYlink written 5.2 years ago by SmallChess530

Homework problems are often toy-sized to help the student see all the moving parts and handle all decisions with pencil and paper. When teaching matrix multiply, you don't give a 100x100, you give a 3x3.  

The problem really is that it feels like pujapatel doesn't know where to start, how to represent her data, basic computer science stuff. 

ADD REPLYlink written 5.2 years ago by karl.stamm3.8k

If you're doing this to learn how to make an SVM, you should look up a tutorial for e1071 on google. If this is not for an assignment or just to learn, then I suggest don't use SVM. If I assume your sequence to be atleast 50 amino acids, and there being 20 AA, a conjecture used by some machine learning groups says that you will need AT LEAST, 50 * log2(20), which means over 220 samples. And this is only to get a representative sample set of your  sample space. So, I suggest you should drop the idea of SVM or even a neural network in that case. In biological terms, you need more samples to make a robust prediction :)

ADD REPLYlink written 5.2 years ago by spandan.madan60
gravatar for Jean-Karim Heriche
5.2 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche23k wrote:

If I understand correctly, you want to classify proteins into two classes using the percentage of amino-acids as feature vector to represent each protein. However, as mentioned in the comments above, if you only have 6 samples in your training set, you won't be able to train a reliable classifier. As a rule of thumb, you need a number of training samples on the order of the size of the feature vector but ideally far greater than the number of features. Also as mentioned above, I would consider whether percentages of amino-acid are suitable features for the classes you're interested in. Anyway, for a small training set, I would try LDA or logistic regression instead of SVM. For SVM, you may find this guide useful.

ADD COMMENTlink written 5.2 years ago by Jean-Karim Heriche23k

Is that feature vector one dimensional? Or, percentage of amino acids could be a 23-length real number in [0-1]. Then his result classes would be "stuff with helices" vs "stuff with beta sheets". 

ADD REPLYlink written 5.2 years ago by karl.stamm3.8k

Hello ,

Thank you for your reply.

That SVM paper is more related to my problem i go through that paper but i have 20 Amino Acid as a feature so i could not get the idea how to start coding with 20 features so, can you please give a little bit idea about that.

Thank you so much for replying and further more reply.

ADD REPLYlink written 5.2 years ago by pujapatel54000
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1024 users visited in the last hour