Question: Feature Selection Methods For Rnaseq Data?
gravatar for antass
6.8 years ago by
United States
antass30 wrote:

I am working with RNAseq data - raw counts from HTSeq as well as RPKM from Cufflinks - and want to apply feature selection. For microarray data, I would usually look into using linear modeling, random forest, or R packages like glmnet.

Are there any feature selection experts out there who could recommend RNA-seq specific FS methods, preferably implemented in R?

rna-seq • 3.9k views
ADD COMMENTlink modified 2.9 years ago by cindy.perscheid90 • written 6.8 years ago by antass30
gravatar for Sean Davis
6.8 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

Linear models are possible using edgeR and DESeq2, among others. Random forests should still be applicable. If you use something like voom (limma) or vst (DESeq) to transform to more bell-shaped data, many other approaches are probably applicable, as well.

ADD COMMENTlink written 6.8 years ago by Sean Davis26k

I was planning on using voom transform and edgeR. I haven't used DESeq2 nor the vst transform - I'll look into those. Thanks Sean!

ADD REPLYlink written 6.8 years ago by antass30

Probably obvious, but just for posterity sake, one would not want to use voom in concert with edgeR for analysis since edgeR needs raw counts. You could process with voom and then use limma, though; alternatively, you could use the raw counts from HTSeq as direct input to edgeR.

ADD REPLYlink written 6.8 years ago by Sean Davis26k

Ah, yes - I wasn't very specific. I am not using these together. I am actually developing a biomarker, so I will try and test multiple combinations of parameters and methods (that hopefully will make sense as a combo). Thanks for the reminder.

ADD REPLYlink written 6.8 years ago by antass30
gravatar for cindy.perscheid
2.9 years ago by
Hasso Plattner Institute, Potsdam, Germany
cindy.perscheid90 wrote:

I would like to bring this topic up again. From what I have read in papers, Internet, and Bioconductor workflows so far, it seems that gene expression data sets are preprocessed (filtering, normalization, log-transformation,...), then a differential expression analysis is done (DESeq2, edgeR, ...), and afterwards an approach for pattern mining (e.g. clustering) is applied. For the latter, a feature selection method is used. A common example seems to be the rowVars function from the genefilter R package:

topVarGenes <- head(order(rowVars(dataset), decreasing = TRUE), 50)

I have also seen other approaches, e.g. applying InformationGain, ReliefF, etc. - well established methods. I was wondering, however, why are the results from the differential expression analysis not used for feature selection, as originally suggested here? Or is it used, but just poorly documented? What is the state of the art here?

ADD COMMENTlink written 2.9 years ago by cindy.perscheid90
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 749 users visited in the last hour