I'm looking for a simple feature selection approach for "meaningful" genes given a dendrogram of expression samples. In short, I'm looking for an algorithm to choose a small number of genes that help elucidate the relationships among the samples.
There are many packages and a lot of literature on feature selection for gene expression-based classification. There is basic filtering of uninformative features. And there are filtering and wrapper approaches for gene selection where there is a small number of class labels (diseased, etc.). A lot of this is built into popular packages. There are also very sophisticated and computationally intensive methods for hierarchical feature selection and feature selection for sparse clustering.
I am not trying to classify or cluster. I'm looking to provide visualization or data exploration. I want to visualize the relationship of the samples (dendrogram) using a heatmap where the genes are meaningful features that help the biologist to understand how the samples differ.
Concretely, I'm looking at situations where the number of samples (actually aggregated single cell expression samples) is small (< 20) and the number of genes is very larger (1000s). The question arises: what are meaningful genes that explain the clustering.
A couple approaches come to mind. (1) I could choose different tree depths, label the samples by the nodes, use any of many feature selection / classification tools, and keep the union of the feature sets. (2) I could walk the tree choosing one or more genes at each bifurcation that minimizes the entropy or maximizes correlation between the two groups. (2a) As a variation, at each bifurcation (except the root), I could have three class labels (left, right, and non-ancestors) to identify genes that might be informative "within" and "without".
This feels like it should be well-trodden territory. After all, one of the first activities that most biologists do is to create a heatmap with dendrograms on both axes and then ask what markers explain their data. Ideas?