Question: Finding Class-Discriminant Genes In A Microarray Experiment
gravatar for Michael Schubert
8.4 years ago by
Cambridge, UK
Michael Schubert6.9k wrote:

I have a microarray expriment (Affymetrix) with 50 samples (each belonging to one of 5 disease subtypes) and about 5000 genes after filtering. Now, I would like to find a list of, let's say, 20 genes whose expression can be used to correctly classify the disease subtypes.

For now, I have created a design matrix in R that defines which subtype a sample belongs to. I then create a contrast matrix to distinguish between class A and the other classes, use eBayes to get the p-values, adjust them for multiple testing:

contrast.matrix = makeContrasts(A-(B+C+D+E)/4, levels=design)
fit = eBayes(, design), contrast.matrix))

This is repeated for each class, so I end up with a 5000x5 matrix of p-values (where one column corresponds to how significant the current class expression compared to the average of all others is). How can I use this matrix to extract my class-discriminant genes? Or [preferred] how can I apply statistical testing on all classes instead of just 2 at a time?

Note: this is (1) for all classes at once (I'm not searching for pairwise comparison) and (2) taking the 4 highest p-values for each class might not be the smartest approach (or is it?).

ADD COMMENTlink written 8.4 years ago by Michael Schubert6.9k
gravatar for Laurent Gautier
8.4 years ago by
Laurent Gautier810 wrote:

Finding "class discriminant genes" is sometimes referred to as "feature selection". A search engine will point you to a number of references.

RandomForests (recommended by Sean) are rather well suited for the task with the note that when building a classifier it can be really tricky to infer a mechanistic/biological explanation to it (you can also use it to perform a feature selection). For a reference in the context of microarrays, there is

Prior to all that, and in order to further reduce the number of features, you can also study the correlation structure for your 5000 genes and decide to group the strongly correlated ones together (the resulting groups are sometimes called "meta-genes").

ADD COMMENTlink written 8.4 years ago by Laurent Gautier810
gravatar for David Quigley
8.4 years ago by
David Quigley11k
San Francisco
David Quigley11k wrote:

If you're willing to consider non-parametric approaches, there is a large literature on this topic in machine learning. If you're not familiar with them, you might consider applying k-nearest-neighbor approaches (e.g. the class package in CRAN) or PAM ( KNN would be particularly applicable if you know a priori that you have 5 phenotype groups. Most of these packages have cross-validation built in, as well as tools to identify features with the greatest contribution to the predictor.

As an aside, it is unlikely that you are sufficiently powered to build a classifier that will stand up to testing with an independent test data set, but that depends on the signal present in your data.

ADD COMMENTlink written 8.4 years ago by David Quigley11k

I often use randomForests for this type of thing, though I cannot say that I have objective evidence that it is better than any other. It gives ranked gene lists as output (importance) and can also give sample weights for each class.

ADD REPLYlink written 8.4 years ago by Sean Davis25k

I'm currently classifying with SVMs on the whole filtered set, if there is a way to read the predictor strengths from the model this would already be great - I'll look into it, thanks.

ADD REPLYlink written 8.4 years ago by Michael Schubert6.9k

SVM weight extraction:

ADD REPLYlink written 8.4 years ago by Michael Schubert6.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1186 users visited in the last hour