Question

Filtering before multivariate analysis

1

Entering edit mode

8.1 years ago

fansili2013 ▴ 30

I am a newbie in statistics and this may be a silly question.

For a n << p problem, I wonder if it is feasible to first filter variables according to some criterion, for example correlation between each independent variable and Y, and then use the left number of variables to build model and select variable according to the model? Apologize that this may not be even a clear question. Let me give an example.

Suppose we have 100 samples and 1000 variables. One continuous dependent variable (Y). First, I exclude variable one by one that has no (linear or non-linear) relationship with Y. Then suppose after this filtering, I have 300 variable left. Then, use this 300 variable to build partial least square (PLS) model. Then use this PLS model to finally select important variables according to VIP score.

Another example is that I use these 1000 variable to build random forest. Then according to the VIP of this random forest to select top 50 important variables and finally build a new random forest model using these 50 variables and use this random forest as the final model to predict, etc.

I think there must be something wrong with these two examples, but I just don't know what it is.

multivariate analysis filtering • 1.6k views

ADD COMMENT • link updated 8.1 years ago by dario.garvan ▴ 530 • written 8.1 years ago by fansili2013 ▴ 30

score 1 · Answer 1 · 2016-03-23

1

Entering edit mode

8.1 years ago

dario.garvan ▴ 530

You should not filter based on a variable's association with the dependent variable. Independent filtering is a well-developed technique for your needs. There is also a Bioconductor software package that can run independent filtering on your dataset.

ADD COMMENT • link 8.1 years ago by dario.garvan ▴ 530