Question

What is the format of input test data in svmlight classifier? How to generate it?

0

Entering edit mode

9.5 years ago

karimkhan.it • 0

I am using svm classifier svmlight

In sample example that take input test file in this format:

4 qid:4 1:1 2:0 3:0 4:0.2 5:1
3 qid:4 1:1 2:1 3:0 4:0.3 5:0
2 qid:4 1:0 2:0 3:0 4:0.2 5:1
1 qid:4 1:0 2:0 3:1 4:0.2 5:0

but generally we classify plain input text, how above format is achieved?

I mean how to convert plaint input text to above specific format?

svmlight classification svm machinelearning • 7.0k views

ADD COMMENT • link updated 3.1 years ago by Ram 43k • written 9.5 years ago by karimkhan.it • 0

score 0 · Answer 1 · 2014-10-11

Assuming you want to classify DNA/RNA/protein sequence input (otherwise this question should be posted on StackOverflow) the first thing to do is to build your dictionary. The most trivial thing would be to make a k-mer dictionary, e.g. for a DNA sequence and k=4 this would be AAAA, AAAT, AAAG, AAAC, AATA, ..., 256 features in total. If a k-mer #1 (AAAA) is present in your sequence you let the feature 1 equal to 1 (1:1), if not it would be 0 (1:0), and so on. In case you have ambiguous letters, e.g. K (G or T) in AAAK, you can use weights instead of 0/1, so you'll let AAAG:0.5 and AAAT:0.5.