I am currently building a binary classification model and have created an input file for svm-train. The input file has 453 lines, 4 No. features and 2 No. classes [0,1].
i.e
0 1:15.0 2:40.0 3:30.0 4:15.0 1 1:22.73 2:40.91 3:36.36 4:0.0 1 1:31.82 2:27.27 3:22.73 4:18.18 0 1:22.73 2:13.64 3:36.36 4:27.27 1 1:30.43 2:39.13 3:13.04 4:17.39 ......................
My problem is that when I count the number of lines in the output model generated by svm-train, it has 12 fewer lines than that of the input file. The line count here shows 450, although there are also 9 lines at the beginning showing the various parameters generated i.e.
svm_type c_svc kernel_type rbf gamma 1 nr_class 2 total_sv 441 rho -0.156449 label 0 1 nr_sv 228 213 SV
Therefore 12 lines in total from the original input of 453 have gone. I am new to svm and was hoping that someone could shed some light on why this might have happened?
Thanks in advance
Do you know what the output file is supposed to be/contain ? This depends on the software implementation you're using and not telling us. My guess is that it contains the model and hence the vectors are the support vectors. If so, this means that your model needs almost all the training data to represent the training set which suggests there's room for improvement or that there's not much structure in the data.
Yeah the output file is the model however, I now believe that in generating the model, it has removed lines whereby the labels and all the parameters are exactly the same. To explain............... My input is a set of miRNAs which have been classified as 1 and 0 depending on their involvement in a particular process or not (i.e 1=Yes & 0=No). The input file looks something like.......
0 1:22 2:30 3:14 4:16 1 1:26 2:15 3:17 4:25 0 1:22 2:30 3:14 4:16
Whereby, lines one and three are exactly the same and as a result will be removed from the output model. My question is then both why the output model would do this and how I can get around this (whilst using the same features)?
In the output line you gave, I noticed this: total_sv 441 which I think means that your model has 441 support vectors. If your training set had 453 vectors then 453 - 441 = 12 and there is nothing missing. I don't see why you would want to have the whole training set in the model. Now if your input has duplicates, they are usually removed (but again this is software dependent). So if the support vectors are the whole training set minus the duplicates, the model is probably not very good or your data doesn't have enough structure to separate the two classes. Also the formats of the input and output files depend on the software you're using.
My apologies, I don't follow what you are saying. The training set (input file) has 453 vectors and the model (output file) has 441 support vectors (i.e 453 - 12).
NOTE: These are not actually duplicates. Each vector of the input file is in fact a different miRNA, however the vectors that svm-train has removed when generating the output model do have identical labels and also identical values for each feature. My features are A,U,G & C percentage content of each individual miRNA.
Is there any way around the removal of these "duplicates"? as i will want to classify all the miRNAs. Or are you saying that the features (i.e A,U,G and C percentage content) are insufficient and i may need to use different features?
Now I don't understand what the problem is. Do you mean that the support vectors in the model are all identical ? I think it would be more useful if you told what software you use and showed what you've done. As a first step you could check whether nucleotide composition has any chance of separating the two classes. Because your data has only 4 dimensions, you could try visualizing the classes in a scatterplot matrix (e.g. the pairs() function in R) or in a MDS or PCA projection. Another consideration is that for compositional data, a transformation of the data (e.g. logratio) may be useful. This is because you have redundant information due to the constraint that the sum of the vector features is 100%.