Question: What is the best way to combine machine learning algorithms for feature selection such as Variable importance in Random Forest with differential expression analysis?
gravatar for elmahy2005
19 months ago by
elmahy200540 wrote:

I am applying different algorithms for feature selection on RNASeq data to find the best set of genes that classify a participant into normal or diseased such as using Lasso and variable importance in Random Forest fit. On the other hand, I could have done differential expression analysis e.g. with DEseq2 and get a set of genes that also distinguish disease from normal.

My question is : How does these two approaches differ from each other? And What is the best way to combine both? Any suggested reading is very welcome.

rna-seq • 1.9k views
ADD COMMENTlink modified 10 hours ago by CY410 • written 19 months ago by elmahy200540
gravatar for Kevin Blighe
19 months ago by
Kevin Blighe50k
Kevin Blighe50k wrote:

NB - this answer has been updated January 21st, 2019

While things like RandomForest®, lasso penalised regression (and both elastic-net and ridge regression), deep learning, machine learning, AI, neural networks, etc. may each sound great, from what I have seen, they do not perform better than a well- performed / curated differential expression analysis followed by gene signature refinement of differentially expressed genes (DEGs) through further modeling. In fact, I frequently find that these algorithms perform worse. The 'craze' surrounding these 'buzz' words has come but it will fade away once everyone realises that they won't bring about the next revolution in healthcare (hard work and tackling bureaucracy will).

If one could conduct the perfect study, one would do the following:

  1. Differential expression analysis to identify DEGs using a program (e.g., EdgeR, DESeq2, limma/voom for RNA-seq) that sufficiently normalises your data and deals with the anticipated sources of bias
  2. Further refinement / validation of the DEGs, likely using downstream methods (e.g. PCA, clustering, et cetera) applied to the transformed normalised counts from the same cohort, and / or in an independent dataset (e.g., from TCGA, CCLE for cancer, or some SRA/GEO/ArrayExpress study for other diseases). Here, people may also sometimes just manually pick the top 50 or 100 differentially expressed genes. Others, I have observed, pick them based on their own experienced knowledge of the field in which they conduct research. There are many ways to go in an analysis, from this point.
  3. With your top genes, these should then be validated in a separate cohort of higher sample n and using a separate technology, such as high throughput PCR (Fluidigm), NanoString, or even a customised microarray. The number of top DEGs that you choose from #Step 2 will be dicatated by the chosen technology here.
  4. Further refine the top genes through regression modeling using either of stepwise regression (forward, backward, and/or both) or by testing each gene independently and then keeping all those that are statistically significant independent predictors of your outcome. There may likely be clinical parameters included in these models at this stage, too, and possibly as covariates (e.g. BMI status, smoking status, clinical grade, lab markers of inflammation, histology scores, et cetera).
  5. Test your final regression model (or models) through various metrics and processes, such as R2 shrinkage, cross-validaton, Cook's test (for outliers), ROC analysis, and the derivation of precision, accuracy, sensitivity, and specificity. Again, there may be clinical parameters mixed with genes in these final models

If you can do all of that and produce a robust gene signature, then you're talking about stuff that is the equivalent to, for example, OncoType DX® and MammaPrint® in breast cancer.

Note that the 'final model' to which I refer here may look something like (example for lm() and glm()):

lm(MutationLoad ~ TP53 + TumourGrade + CCNB1 + ATM + POLE)

glm(ArthritisStatus ~ ESR + CD20 + Age + SmokingStatus)

MutationLoad is continuous; ArthritisStatus is binary / categorical


My recommendations should not dissuade you from nevertheless trying out the RandomForest®. However, I am highly confident that it will not perform better than the method I have presented above.

Let me know if I can assist further


PS - if you elect for stepwise regression in part #2, which is somewhat automated and based on AIC and BIC, then you may face crticism from a statistician. Still, this is much better than just chucking all of your data into a 'machine learning' algorithm and letting that do everything for you

ADD COMMENTlink modified 8 months ago • written 19 months ago by Kevin Blighe50k

Thanks for detailed explanation. I am have been giving some thoughts on using DEG (as oppose to others such as somatic mutation or methylation) to build classification / regression model. I was initially thought that gene expression is a dynamic process and the variation of gene expression is rather large between or within phenotype (perhaps even different time point of the same individual). This make DEG as a inferior feature type comparing with more definitive feature, for example differential methylation region (DMR).

However on second thought, I guess this variation of gene expression has been offset by the statisitcal analysis during differential expression analysis. That is exactly the whole point differential expression analysis, right? Can you share some comments on this? Thanks

ADD REPLYlink written 10 hours ago by CY410
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2163 users visited in the last hour