Confused about data sets for SVM algorithm
1
0
Entering edit mode
7.7 years ago

I am seeking to run a binary SVM classifier on a dataset of about 10,000 small molecule drugs. I want the classifier to determine which of these drugs are structural analogues to a specific drug. Each instance, or drug, will have five attributes containing some type of molecular descriptor.

For the training set, should I run the classifier on a dataset of drugs that are already known to be structural analogues of the drug? More specifically, a dataset with an even number of known structural analogues and non-analogues? And then do I test my classifier on the 10,000 small molecule drugs which have not yet been determined to be analogues?

svm support vector machine molecules • 1.7k views
ADD COMMENT
2
Entering edit mode
7.7 years ago
Asaf 10k

You better start with defining a question and then see which tool (or tools) is the most appropriate. You might find out that unsupervised learning (clustering) would work better for you
Having that said, the input of SVM (training set) is a table of features of the observations and a binary vector specifying the group of each observation. The number of rows in the table and the vector should be the same. You can train your classifier with this data, people usually do a "leave X out cross validation" to test the performances of the classifier. The procedure uses a part of the training set to train the classifier and test it on the part left aside.
If you are convinced that your classifier is good you can predict the results for the 10'000 small molecules you are testing.

ADD COMMENT
0
Entering edit mode

Thanks! So would I have to manually specify the group of each observation?

ADD REPLY
1
Entering edit mode

You should have this information. The groups are: analogues and non analogues. If you don't have this data for a set of molecules then you don't have a training set.

ADD REPLY

Login before adding your answer.

Traffic: 2230 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6