Hi, I've had a 20yr career in AI/DL, most recently successfully applied in finance, elevating me to Head of Research during that time based on the success I had. Having left finance almost 2yrs ago I'd been looking to apply my skillset to the non-finance world, and recently came across the 33-cancer TCGA gene-expression dataset. Having also recently developed a novel approach to sparsity on high dimensional problems that does away entirely with the L1 and L2 weight penalties of LASSO & Elastic Net, this dataset seemed a worthy testbed.
Although the initial sparse optimiser I had built was for linear regression, it still gave excellent results on the UCI PANCAN 5-cancer subset dataset, just MSE training OvR on +1/-1 targets. After solving for the various loss functions of Logistic Regression, SVM, and SVM with squared hinge-loss, I set about applying my three classification-based sparse optimisers to the full 33-cancer dataset.
The specific dataset I chose is the batch-adjusted files at https://gdc.cancer.gov/about-data/publications/pancanatlas .
Unfortunately, not all the TCGA samples are labelled in the info text file there, but I managed to fill in gaps with another TCGA sample list elsewhere, to get my sample count to 10283, which excluded all the off-site normal tissues and duplicates. I also removed all 4196 genes that had one or more NAs in them (genes removed on a sample by the batch correction process), resulting in 20531-4196=16335 features total.
Although I couldn't quite perfectly align my sample set with their 10267 sample set, the current state-of-the-art on the PANCAN dataset, from what I can see, is the DL method MI_DenseNetCAM:
https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2021.670232/full
This achieves a 96.8% multi-class classification accuracy on the full 33-cancer dataset using 10-fold x-validation, utilising a shared set of 3600 genes per cancer-type/class, and a total parameter count of 13.9M parameters.
Although I have seen other papers report higher accuracy on the PANCAN dataset, they have all been on much smaller subsets of cancers. They often did not appear to be as robust in their procedures either.
Utilising 10-fold x-validation, my sparse optimisation method* achieves 97.2% accuracy on the full 33-cancer dataset. However, it does this with an average of only ~400 genes per cancer, and ~13K parameters total, i.e. 1/1000th of the parameters of MI_DenseNetCAM and 1/9th of the per cancer gene-count.
Even more remarkable* is that 96.4% accuracy can still be achieved with only ~800 parameters total (not per cancer) and an average of 24 genes per cancer type. This level of accuracy seems unprecedented for such sparse models.
My method also achieves 69% accuracy on READ, a cancer that most other models achieve 0% on, as it's difficult to differentiate from COAD.
One other interesting fact is that there is very little overlap between the genes selected for each cancer type on the sparsest model, and each cancer appears to have a smallish set of signature genes.
As someone with no background in Bioinformatics and Genomics, it appears to me that the ability of my algorithm to zone-in on the cancer signatures, should be very useful in the development of targeted treatments, and efficient diagnosis tools. I've come to this forum to ask for advice, guidance & thoughts on what my next steps should be, what the likely applications of my method are and challenges I may still need to overcome. Whether it's to write a paper, open source it, licence it, raise investment and start a company, I'm open to all well-argued opinions!
I'm happy to provide more details on the sample sets and other experimental setup details, along with full in-sample/out-of-sample stats. I'm also very happy to share any of the sparse model files for discussion on any of the individual 33-cancers.
An example of a very sparse set of genes that classified the cancer LAML with 100% accuracy OOS is below:
"NFE2|4778" : {"weight": 0.152544975, "mean": 3.99823, "stddev": 2.22415},
"ATP8B4|79895" : {"weight": 0.119082607, "mean": 5.8553, "stddev": 1.62709},
"RPL7|6129" : {"weight": 0.0841606408, "mean": 10.4455, "stddev": 1.09871},
"MTA3|57504" : {"weight": -0.0870735943, "mean": 9.6818, "stddev": 0.734529},
"LGMN|5641" : {"weight": -0.13933, "mean": 11.2215, "stddev": 1.1614},
"BCAR1|9564" : {"weight": -0.165008873, "mean": 10.5392, "stddev": 1.3575},
"bias" : {"weight": -1.44177818}
The same 6 genes above were selected by the best optimiser, for each of the 10-fold x-val runs. The specific weights shown were obtained by training on the full dataset. The 'mean' and 'stddev' values are of the training set features after the log2(1+x) transformation and are used to standardise the data before optimisation. Note that the bottom three gene-expressions have negative weights.
I look forward to your comments and thoughts!
Mark
*Achieved using my modified SVM approach with squared hinge-loss, although my modified Logistic Regression method is only marginally less good. My modified SVM with the standard hinge loss is a little worse that the others.
I remember this pancancer classification task being done to death awhile back - here with miRNAs only. Be good to know how this compares https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5389567/