Question: What is the best way to combine machine learning algorithms for feature selection such as Variable importance in Random Forest with differential expression analysis?
gravatar for elmahy2005
15 months ago by
elmahy200540 wrote:

I am applying different algorithms for feature selection on RNASeq data to find the best set of genes that classify a participant into normal or diseased such as using Lasso and variable importance in Random Forest fit. On the other hand, I could have done differential expression analysis e.g. with DEseq2 and get a set of genes that also distinguish disease from normal.

My question is : How does these two approaches differ from each other? And What is the best way to combine both? Any suggested reading is very welcome.

rna-seq • 1.6k views
ADD COMMENTlink modified 15 months ago by Kevin Blighe45k • written 15 months ago by elmahy200540
gravatar for Kevin Blighe
15 months ago by
Kevin Blighe45k
Kevin Blighe45k wrote:

NB - this answer has been updated January 21st, 2019

While things like RandomForest®, lasso penalised regression (and both elastic-net and ridge regression), deep learning, machine learning, AI, neural networks, etc. may each sound great, from what I have seen, they do not perform better than a well- performed / curated differential expression analysis followed by gene signature refinement of differentially expressed genes (DEGs) through further modeling. In fact, I frequently find that these algorithms perform worse. The 'craze' surrounding these 'buzz' words has come but it will fade away once everyone realises that they won't bring about the next revolution in healthcare (hard work and tackling bureaucracy will).

If one could conduct the perfect study, one would do the following:

  1. Differential expression analysis to identify DEGs using a program (e.g., EdgeR, DESeq2, limma/voom for RNA-seq) that sufficiently normalises your data and deals with the anticipated sources of bias
  2. Further refinement / validation of the DEGs, likely using downstream methods (e.g. PCA, clustering, et cetera) applied to the transformed normalised counts from the same cohort, and / or in an independent dataset (e.g., from TCGA, CCLE for cancer, or some SRA/GEO/ArrayExpress study for other diseases). Here, people may also sometimes just manually pick the top 50 or 100 differentially expressed genes. Others, I have observed, pick them based on their own experienced knowledge of the field in which they conduct research. There are many ways to go in an analysis, from this point.
  3. With your top genes, these should then be validated in a separate cohort of higher sample n and using a separate technology, such as high throughput PCR (Fluidigm), NanoString, or even a customised microarray. The number of top DEGs that you choose from #Step 2 will be dicatated by the chosen technology here.
  4. Further refine the top genes through regression modeling using either of stepwise regression (forward, backward, and/or both) or by testing each gene independently and then keeping all those that are statistically significant independent predictors of your outcome. There may likely be clinical parameters included in these models at this stage, too, and possibly as covariates (e.g. BMI status, smoking status, clinical grade, lab markers of inflammation, histology scores, et cetera).
  5. Test your final regression model (or models) through various metrics and processes, such as R2 shrinkage, cross-validaton, Cook's test (for outliers), ROC analysis, and the derivation of precision, accuracy, sensitivity, and specificity. Again, there may be clinical parameters mixed with genes in these final models

If you can do all of that and produce a robust gene signature, then you're talking about stuff that is the equivalent to, for example, OncoType DX® and MammaPrint® in breast cancer.

Note that the 'final model' to which I refer here may look something like (example for lm() and glm()):

lm(MutationLoad ~ TP53 + TumourGrade + CCNB1 + ATM + POLE)

glm(ArthritisStatus ~ ESR + CD20 + Age + SmokingStatus)

MutationLoad is continuous; ArthritisStatus is binary / categorical


My recommendations should not dissuade you from nevertheless trying out the RandomForest®. However, I am highly confident that it will not perform better than the method I have presented above.

Let me know if I can assist further


PS - if you elect for stepwise regression in part #2, which is somewhat automated and based on AIC and BIC, then you may face crticism from a statistician. Still, this is much better than just chucking all of your data into a 'machine learning' algorithm and letting that do everything for you

ADD COMMENTlink modified 5 months ago • written 15 months ago by Kevin Blighe45k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1722 users visited in the last hour