I am a newbie in statistics and this may be a silly question.
For a n << p problem, I wonder if it is feasible to first filter variables according to some criterion, for example correlation between each independent variable and Y, and then use the left number of variables to build model and select variable according to the model? Apologize that this may not be even a clear question. Let me give an example.
Suppose we have 100 samples and 1000 variables. One continuous dependent variable (Y). First, I exclude variable one by one that has no (linear or non-linear) relationship with Y. Then suppose after this filtering, I have 300 variable left. Then, use this 300 variable to build partial least square (PLS) model. Then use this PLS model to finally select important variables according to VIP score.
Another example is that I use these 1000 variable to build random forest. Then according to the VIP of this random forest to select top 50 important variables and finally build a new random forest model using these 50 variables and use this random forest as the final model to predict, etc.
I think there must be something wrong with these two examples, but I just don't know what it is.