2.9 years ago by
Hasso Plattner Institute, Potsdam, Germany
I would like to bring this topic up again. From what I have read in papers, Internet, and Bioconductor workflows so far, it seems that gene expression data sets are preprocessed (filtering, normalization, log-transformation,...), then a differential expression analysis is done (DESeq2, edgeR, ...), and afterwards an approach for pattern mining (e.g. clustering) is applied. For the latter, a feature selection method is used. A common example seems to be the rowVars function from the genefilter R package:
topVarGenes <- head(order(rowVars(dataset), decreasing = TRUE), 50)
I have also seen other approaches, e.g. applying InformationGain, ReliefF, etc. - well established methods. I was wondering, however, why are the results from the differential expression analysis not used for feature selection, as originally suggested here? Or is it used, but just poorly documented? What is the state of the art here?