Feature Selection method dependent on response variable?
1
0
Entering edit mode
6.2 years ago

I have a 450k Inifinium DNA methylation microarray data and I want to predict age (continuous response variable) from this data. As there are way more features than samples, I will have to apply feature selection. I was wondering, is my choice of feature selection partially determined by the type of response variable that I have. For example if I had age classes, would I have to choose another feature selection method?

feature selection modeling microarray data • 2.0k views
4
Entering edit mode
6.2 years ago
Steven Lakin ★ 1.8k

You wouldn't necessarily have to change your feature selection method (though you may want to depending on the question you're asking). You should change the way your target classes are represented though, if you're using machine learning classifiers.

Continuous variables can be predicted by regression, as you stated. Discrete variables can also be done in this way if you don't care about strict boundaries between classes (e.g. 2.5 is an acceptable answer), but if you're looking for a potentially more accurate output, you could use indicator variables to create a separate classification boundary.

So let's say you had age classes 1, 2, and 3. You would create the target label matrix of dimension M, where M = # of instances you wish to classify:

[ [ 1 0 0 ],
[ 0 1 0 ],
[ 0 0 1 ],
[ 0 1 0 ],
....
[ 1 0 0 ] ]


Each column represents a true/false value for each age category. This way, feature selection won't select features as often that correspond to an output of age class 2.5, but will instead favor outputs that correctly classify to age class 2 or 3.

All of this depends on which feature selection strategy you use though.

0
Entering edit mode

Thanks Steven,

What if I don't want to discretize my response variable into groups? In that case it would definitely influence feature selection techniques applicable to this problem, right? I'm currently reading this article by Hira & Gillies (2015), which makes distinctions between FS techniques for classification and regression problems. So my current hypothesis is that indeed depends on whether you have class labels as reponse or a continuous variable like age.

So maybe my question was a bit unclear, but I want to use continuous age. I only want to know if there are FS techniques should be applied in this setting in stead of the class where you have class labels as a response variable.

2
Entering edit mode

It sounds to me like you want regression in the end but might be more interested in dimensionality reduction to begin with. Here is a brief synopsis of the three most relevant output measures for classification/regression:

• Regression - A mapping of any number of measures to a continuous output variable: y = f(x,y ,.. ,z) where y is continuous
• Logistic regression - A mapping of any number of measures to a continuous output across multiple classes: P(y1,y2,...,yn|d) = f(x,y,..,z) where P(yn|d) is the probability of class yn given the data (still continuous, since it's a probability, but corresponds to discrete clasess)
• Classification - Methods used to cluster the data space into categories: {y1, y2, ... ,yn} = f(x,y, ... ,z) where there is some function that determines segregating boundaries in the data space to classify inputs into the classes. These methods typically include support vector machines, linear discriminant analysis, neural networks, etc.

Any number of techniques can do this, and most people traditionally use generalized linear models (GLMs) to do the actual regression part. However, since you're interested in feature selection, you may want to do this in multiple steps:

1. dimensionality reduction
2. regression on the reduced data space

Dimensionality reduction can again include any number of techniques, (e.g. Principal Components Analysis, Localized Linear Embedding, Laplacian Eigenmaps, etc. -- the list is quite long). Another paper out of my university that might interest you (especially for microarray data) is using Support Vector Machines with the sparse 1-norm to select batches of features, then use those reduced features to do regression: http://www.ncbi.nlm.nih.gov/pubmed/24274115

The most important aspect of what you want to achieve is to most accurately predict age (continuous) from your highly dimensional data, so you want to make sure that the measures you use are the best possible measures. Dimensionality reduction then regression is the way to go for that. As for which methods you should choose, you'll have to base that on testing how good your regression is (some measure of error). If that error falls, then you're doing better; standard ways of doing this with a single data set include cross-fold validation and seeing how your error looks on each validation fold.

Sorry I can't be of more help here; the answer to big data questions is usually that there are many choices that could work equally well. They have to each be tested on the data to determine their value.