Question

Scikit-learn feature selection, just select the train set?

0

Entering edit mode

8.9 years ago

hrbrt.sch ▴ 10

Hello,

I'm using scikit-learn for machine learning. I have 800 samples with 2048 features, therefore I want to reduce my features to get hopefully a better accuracy.

It is a multiclass problem (class 0-5), and the features consists of 1's and 0's: [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0....,0]

I'm using the Random Forest Classifier.

Should I just feature select the training data ? And is it enough if I'm using this code:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)

clf=RandomForestClassifier(n_estimators=200,warm_start=True,criterion='gini', max_depth=13)
clf.fit(X_train, y_train).transform(X_train)

predicted=clf.predict(X_test)
expected=y_test
confusionMatrix=metrics.confusion_matrix(expected,predicted)

Cause the accuracy didn't get higher. Is everything OK in the code or am I doing something wrong?

I'll be very grateful for your help.

Machine-learning Python Scikit-Learn • 3.7k views

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by hrbrt.sch ▴ 10

Ram · Answer 1 · 2015-05-28

0

Entering edit mode

8.9 years ago

learnBioinformatics ▴ 60

Should I just feature select the training data?

Yes, it is just for training set. After some important features was picked up based on the training set, the you can use these features in the test set.

For the accuracy, there are many factors can give an effect on it. For example, normalized features and imbalanced samples, etc.

Hope this helps

Kevin

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by learnBioinformatics ▴ 60