Question

Support vector machine in bioinformatics.

0

Entering edit mode

8.8 years ago

pujapatel5400 • 0

Hello all,

I am generally of field computer science and data analytic. I have learned machine learning and I am solving one biological problem with using Support Vector machine.My question is I am having a data set of amino acid sequences. In our human body there are 20 standard amino acids and each amino acid contain sequences. I have found the composition of this amino acid sequences.This composition is nothing but the word count of each individual amino acid by its name and have counted its percentage that is composition of amino acid. Now I have to build a model for support vector machine using these composition as feature.Can any 1 give some idea how can I build SVM model?

Thanks in advance

Sequences is like this:

>HMPREF9352_0002 rod shape-determining protein MreD [Streptococcus gallolyticus subsp. gallolyticus TX20005]
MIKVKFYKNKYFLLLLLFLLMLIDGQLSFLASSIFSYHLKVSSHLLLLAVLYFYHDKNKY
FMFISSLVLGGIFDIYYLNRIGLVIFLLPILVIFTSKISKNFFVSNFQTLIFYIIVLFLF
EIVGELGAILLGMTTMSMTYFIAYCFAPTLIYNILMYLIFQKVFKKVFLES

From above Amino Acid sequence I found composition of the Amino Acid

svm r matlab machine-learning • 3.5k views

ADD COMMENT • link updated 17 months ago by Ram 43k • written 8.8 years ago by pujapatel5400 • 0

6

Entering edit mode

Could you please give a better outline of the problem you are working on? It would be helpful if you could answer the following questions :-

What is the sample size of the dataset that you have
What is the SVM model trying to predict? Or is it trying to classify your dataset into categories?
What is the dataset like, attach a sample of the dataset in the reply if you can.

Before building an SVM model, the most important thing to know is whether an SVM model is suited or would other Machine Learning techniques be more useful. Moreover, in any Machine Learning problem, it is very important to first get an idea of what is the function we are trying to train, what is the data set like, what are the limitations of the data at hand, and what features would be best, given the nature of the dataset and the function being predicted :)

ADD REPLY • link updated 17 months ago by Ram 43k • written 8.8 years ago by spandan.madan ▴ 60

2

Entering edit mode

What biological problem are you trying to solve? What data do you have? What outcome are you trying to predict using SVM?

ADD REPLY • link updated 17 months ago by Ram 43k • written 8.8 years ago by Sean Davis 26k

0

Entering edit mode

Hello sir,

Thank you for replying me.

Actually I have number of amino acids sequences from that I found composition or you can say count of amino acids in percentage.

So I right now I want to build model of svm.

For example I have 6 sequences and I found composition so then I have input sequences but from that 6 sequences 3 for positive and 3 for negative I wants to take. That will not decide by the train function which is positive or which is negative. And for svm composition it self feature perform.

ADD REPLY • link updated 17 months ago by Ram 43k • written 8.8 years ago by pujapatel5400 • 0

2

Entering edit mode

You still haven't told us what the question you're trying to address is. Are you trying to classify proteins (into two or more classes, if so which ones?) or are you trying to predict some property from the sequence?

ADD REPLY • link updated 17 months ago by Ram 43k • written 8.8 years ago by Jean-Karim Heriche 27k

1

Entering edit mode

If you have only 6 sequences, machine learning approaches really do not apply. Is this just a subset of a much larger dataset?

ADD REPLY • link updated 17 months ago by Ram 43k • written 8.8 years ago by Sean Davis 26k

0

Entering edit mode

Hello sir,

What I want to do is, So, we have 6 sequences let's say that 3 of these are positive case while other three are negative. Now based on the AA composition of these sequences we need to build a model using SVM. This model would be used for predicting class of new sequences e.g. if we have a new sequence we should be able to predict whether it belongs to positive or negative class.

As above mention sequences are look like that and from that I found composition of the Amino acid.

ADD REPLY • link updated 17 months ago by Ram 43k • written 8.8 years ago by pujapatel5400 • 0

1

Entering edit mode

It sounds like a homework problem, 6 sequences and 3 positive 3 negative is a textbook ML question. Real problems would have an odd number of each.

ADD REPLY • link updated 17 months ago by Ram 43k • written 8.8 years ago by karl.stamm 4.1k

1

Entering edit mode

This is obviously an assignment.

ADD REPLY • link 8.8 years ago by scchess ▴ 640

0

Entering edit mode

Do you know what you are talking about? Machine learning never applies to such a small ridiculously small amount of data. I think you've misunderstood the question or your intention.

ADD REPLY • link 8.8 years ago by scchess ▴ 640

0

Entering edit mode

Homework problems are often toy-sized to help the student see all the moving parts and handle all decisions with pencil and paper. When teaching matrix multiply, you don't give a 100x100, you give a 3x3.

The problem really is that it feels like pujapatel doesn't know where to start, how to represent her data, basic computer science stuff.

ADD REPLY • link updated 17 months ago by Ram 43k • written 8.8 years ago by karl.stamm 4.1k

0

Entering edit mode

If you're doing this to learn how to make an SVM, you should look up a tutorial for e1071 on google. If this is not for an assignment or just to learn, then I suggest don't use SVM. If I assume your sequence to be at least 50 amino acids, and there being 20 AA, a conjecture used by some machine learning groups says that you will need AT LEAST, 50 * log2(20), which means over 220 samples. And this is only to get a representative sample set of your sample space. So, I suggest you should drop the idea of SVM or even a neural network in that case. In biological terms, you need more samples to make a robust prediction :)

ADD REPLY • link updated 17 months ago by Ram 43k • written 8.8 years ago by spandan.madan ▴ 60

Ram · Answer 1 · 2015-07-19

2

Entering edit mode

8.8 years ago

Jean-Karim Heriche 27k

If I understand correctly, you want to classify proteins into two classes using the percentage of amino-acids as feature vector to represent each protein. However, as mentioned in the comments above, if you only have 6 samples in your training set, you won't be able to train a reliable classifier. As a rule of thumb, you need a number of training samples on the order of the size of the feature vector but ideally far greater than the number of features. Also as mentioned above, I would consider whether percentages of amino-acid are suitable features for the classes you're interested in. Anyway, for a small training set, I would try LDA or logistic regression instead of SVM. For SVM, you may find this guide useful.

ADD COMMENT • link updated 17 months ago by Ram 43k • written 8.8 years ago by Jean-Karim Heriche 27k

1

Entering edit mode

Is that feature vector one dimensional? Or, percentage of amino acids could be a 23-length real number in [0-1]. Then his result classes would be "stuff with helices" vs "stuff with beta sheets".

ADD REPLY • link updated 17 months ago by Ram 43k • written 8.8 years ago by karl.stamm 4.1k

0

Entering edit mode

Hello ,

Thank you for your reply.

That SVM paper is more related to my problem I go through that paper but I have 20 Amino Acid as a feature so I could not get the idea how to start coding with 20 features so, can you please give a little bit idea about that.

Thank you so much for replying and further more reply.

ADD REPLY • link updated 17 months ago by Ram 43k • written 8.8 years ago by pujapatel5400 • 0