Question

Confused about data sets for SVM algorithm

0

Entering edit mode

7.7 years ago

spacegeek1212 ▴ 30

I am seeking to run a binary SVM classifier on a dataset of about 10,000 small molecule drugs. I want the classifier to determine which of these drugs are structural analogues to a specific drug. Each instance, or drug, will have five attributes containing some type of molecular descriptor.

For the training set, should I run the classifier on a dataset of drugs that are already known to be structural analogues of the drug? More specifically, a dataset with an even number of known structural analogues and non-analogues? And then do I test my classifier on the 10,000 small molecule drugs which have not yet been determined to be analogues?

svm support vector machine molecules • 1.7k views

ADD COMMENT • link updated 5.7 years ago by Biostar 20 • written 7.7 years ago by spacegeek1212 ▴ 30

score 2 · Answer 1 · 2016-08-11

2

Entering edit mode

7.7 years ago

Asaf 10k

You better start with defining a question and then see which tool (or tools) is the most appropriate. You might find out that unsupervised learning (clustering) would work better for you
Having that said, the input of SVM (training set) is a table of features of the observations and a binary vector specifying the group of each observation. The number of rows in the table and the vector should be the same. You can train your classifier with this data, people usually do a "leave X out cross validation" to test the performances of the classifier. The procedure uses a part of the training set to train the classifier and test it on the part left aside.
If you are convinced that your classifier is good you can predict the results for the 10'000 small molecules you are testing.

ADD COMMENT • link 7.7 years ago by Asaf 10k

0

Entering edit mode

Thanks! So would I have to manually specify the group of each observation?

ADD REPLY • link 7.7 years ago by spacegeek1212 ▴ 30

1

Entering edit mode

You should have this information. The groups are: analogues and non analogues. If you don't have this data for a set of molecules then you don't have a training set.

ADD REPLY • link 7.7 years ago by Asaf 10k