NB - this answer has been updated January 28th, 2020
Update: It is important to point out that the assumptions of RandomForest® differ from those of, e.g., a regression model. So, RandomForest® and other classification algorithms certainly must be considered. Just use my general pointers here as just that, i.e., general - they are a guide.
While things like RandomForest®, lasso penalised regression (and both elastic-net and ridge regression), deep learning, machine learning, AI, neural networks, etc. may each sound great, from what I have seen, they do not perform better than a well- performed / curated differential expression analysis followed by gene signature refinement of differentially expressed genes (DEGs) through further modeling. In fact, I frequently find that these algorithms perform worse. The 'craze' surrounding these 'buzz' words has come but it will fade away once everyone realises that they won't bring about the next revolution in healthcare (hard work and tackling bureaucracy will).
If one could conduct the perfect study, one would do the following:
- Differential expression analysis to identify DEGs using a program (e.g., EdgeR, DESeq2, limma/voom for RNA-seq) that sufficiently normalises your data and deals with the anticipated sources of bias
- Further refinement / validation of the DEGs, likely using downstream methods (e.g. PCA, clustering, et cetera) applied to the transformed normalised counts from the same cohort, and / or in an independent dataset (e.g., from TCGA, CCLE for cancer, or some SRA/GEO/ArrayExpress study for other diseases). Here, people may also sometimes just manually pick the top 50 or 100 differentially expressed genes. Others, I have observed, pick them based on their own experienced knowledge of the field in which they conduct research. There are many ways to go in an analysis, from this point.
- With your top genes, these should then be validated in a separate cohort of higher sample n and
using a separate technology, such as high throughput PCR (Fluidigm),
NanoString, or even a customised microarray. The number of top DEGs that you choose from #Step 2 will be dicatated by the chosen technology here.
- Further refine the top genes through regression modeling using either of stepwise regression
(forward, backward, and/or both) or by testing each gene
independently and then keeping all those that are statistically
significant independent predictors of your outcome. There may likely
be clinical parameters included in these models at this stage, too, and possibly as covariates (e.g. BMI status, smoking status, clinical grade, lab markers of inflammation, histology scores, et cetera).
- Test your final regression model (or models) through various metrics and
processes, such as R2 shrinkage, cross-validaton, Cook's test (for
outliers), ROC analysis, and the derivation of precision, accuracy, sensitivity, and specificity. Again, there may be clinical parameters mixed with genes in these final models
If you can do all of that and produce a robust gene signature, then you're talking about stuff that is the equivalent to, for example, OncoType DX® and MammaPrint® in breast cancer.
Note that the 'final model' to which I refer here may look something like (example for lm()
and glm()
):
lm(MutationLoad ~ TP53 + TumourGrade + CCNB1 + ATM + POLE)
glm(ArthritisStatus ~ ESR + CD20 + Age + SmokingStatus)
MutationLoad
is continuous; ArthritisStatus
is binary / categorical
------------------------------
My recommendations should not dissuade you from nevertheless trying out the RandomForest®. However, I am highly confident that it will not perform better than the method I have presented above.
Let me know if I can assist further
Kevin
PS - if you elect for stepwise regression in part #2, which is somewhat automated and based on AIC and BIC, then you may face crticism from a statistician. Still, this is much better than just chucking all of your data into a 'machine learning' algorithm and letting that do everything for you
Thanks for detailed explanation. I am have been giving some thoughts on using DEG (as oppose to others such as somatic mutation or methylation) to build classification / regression model. I was initially thought that gene expression is a dynamic process and the variation of gene expression is rather large between or within phenotype (perhaps even different time point of the same individual). This make DEG as a inferior feature type comparing with more definitive feature, for example differential methylation region (DMR).
However on second thought, I guess this variation of gene expression has been offset by the statisitcal analysis during differential expression analysis. That is exactly the whole point differential expression analysis, right? Can you share some comments on this? Thanks
Hi Kevin, thanks for the very informative answer to this question!
Quick question on step #4: You mention using stepwise regression or testing each gene independently. I'm curious what you see as the advantage of using each of these methods.
Kevin, thanks for your valuable answer to the query. I am a big fan of your posts on Biostar. Applying genomic classifiers is more and more drawing the researcher's attention. While the original answer was clear to me, I have doubts about getting the main message of your updated comment. Would you please elaborate more on the topic: "steps need to be undertaken in order to make genomic classifiers ".
Hey Hamid. Thank you. The main theme of my original answer is that one does not need 'fancy' classification algorithms, like deep learning, RandomForest, etc., in order to build a predictive model.
In my 'update' comment, then, I am just saying that RandomForest and other classification algorithms can also be used, if one wishes to use these.
As an example, one could, technically, use RandomForest as a feature selection approach, and then further refine these features via logistic regression modeling.
Dear Prof. Kevin Blighe, many thanks for your kind comment. About the 4th point, I dont know how to deal with this step. For a common method, we filter DEGs into top 100 or 500 which may be acceptable. And then we filter the top genes into a gene panel via LASSO. But couldn't undertstand when we should combine the clinical parameters? As clinical parameters are not high dimensional data, should we just compare them via T-test, and pick the significant parameters into the gene expression profiles with LASSO together?
Hi Di Wu, there is no real order in which these steps must be performed. They just serve to foment ideas.
The point at which to introduce clinical parameters is at the very end when you have just a few genes that are key, e.g., 5 genes.
Like in the 2 examples provided:
Dear Prof. Kevin Blighe, thanks again for your kind reply. In additional, we usually use machine learning after spliting dataset instead of having two separate dataset. If so, about the 1st point, should we split the dataset (training and testing set) and then identify DEGs using a program(in the training set)? Or we could firstly did DE analyze and then split dataset which could make sure that the DE genes will be significant different in both training and testing set. Could you please give me a suggestion about it? Thanks again.