**NB - this answer has been updated January 28th, 2020**

**Update: It is important to point out that the assumptions of RandomForest® differ from those of, e.g., a regression model. So, RandomForest® and other classification algorithms certainly must be considered. Just use my general pointers here as just that, i.e., general - they are a guide.**

While things like RandomForest®, lasso penalised regression (and both elastic-net and ridge regression), deep learning, machine learning, AI, neural networks, etc. may each sound great, from what I have seen, they do not perform better than a well- performed / curated differential expression analysis followed by gene signature refinement of differentially expressed genes (DEGs) through further modeling. In fact, I frequently find that these algorithms perform *worse*. The 'craze' surrounding these 'buzz' words has come but it will fade away once everyone realises that they won't bring about the next revolution in healthcare (hard work and tackling bureaucracy *will*).

If one could conduct the perfect study, one would do the following:

- Differential expression analysis to identify DEGs using a program (e.g., EdgeR, DESeq2, limma/voom for RNA-seq) that sufficiently normalises your data and deals with the anticipated sources of bias
- Further refinement / validation of the DEGs, likely using downstream methods (e.g. PCA, clustering,
*et cetera*) applied to the transformed normalised counts from the same cohort, and / or in an independent dataset (e.g., from TCGA, CCLE for cancer, or some SRA/GEO/ArrayExpress study for other diseases). Here, people may also sometimes just manually pick the top 50 or 100 differentially expressed genes. Others, I have observed, pick them based on their own experienced knowledge of the field in which they conduct research. There are many ways to go in an analysis, from this point.
- With your top genes, these should then be validated in a separate cohort of higher sample
*n* and
using a separate technology, such as high throughput PCR (Fluidigm),
NanoString, or even a customised microarray. The number of top DEGs that you choose from **#Step 2** will be dicatated by the chosen technology here.
- Further refine the top genes through regression modeling using either of stepwise regression
(forward, backward, and/or both) or by testing each gene
independently and then keeping all those that are statistically
significant independent predictors of your outcome. There may likely
be clinical parameters included in these models at this stage, too, and possibly as covariates (e.g. BMI status, smoking status, clinical grade, lab markers of inflammation, histology scores,
*et cetera*).
- Test your final regression model (or models) through various metrics and
processes, such as R2 shrinkage, cross-validaton, Cook's test (for
outliers), ROC analysis, and the derivation of precision, accuracy, sensitivity, and specificity. Again, there may be clinical parameters mixed with genes in these final models

If you can do all of that and produce a robust gene signature, then you're talking about stuff that is the equivalent to, for example, OncoType DX® and MammaPrint® in breast cancer.

Note that the '*final model*' to which I refer here may look something like (example for `lm()`

and `glm()`

):

```
lm(MutationLoad ~ TP53 + TumourGrade + CCNB1 + ATM + POLE)
glm(ArthritisStatus ~ ESR + CD20 + Age + SmokingStatus)
```

`MutationLoad`

is continuous; `ArthritisStatus`

is binary / categorical

## ------------------------------

My recommendations should not dissuade you from nevertheless trying out the RandomForest®. However, I am highly confident that it will not perform better than the method I have presented above.

Let me know if I can assist further

Kevin

PS - if you elect for stepwise regression in part #2, which is somewhat automated and based on AIC and BIC, then you may face crticism from a statistician. Still, this is much better than just chucking all of your data into a 'machine learning' algorithm and letting that do everything for you