There exists a lot of literature on distinguishing driver mutations from passengers. I am trying to build my own deep learning model to do the same. I am facing some potential issues. First, I have downloaded COSMIC mutation data and used the FATHMM labels to designate drivers (positive examples) in my dataset. I am sceptical to use passengers from COSMIC as they may be false negatives. So I turned to the 1000 genome project to download SNVs (to construct my negative examples). I am unsure if this is correct, however, I have seen some papers do the same. Do I need to apply any filters on the 1000 genome SNV data to construct the final dataset? One such paper talks of using SNVs with a global minor allele frequency≤1%.
Many thanks for your detailed explanation. I am indeed using FATHMM-MKL and not FATHMM. Sorry for the confusion. I will definitely go through the papers you mentioned and try to get their data. But I was wondering whether to stick to COSMIC to construct the labels for my training set. It is a sizeable dataset with enough positive/negative classes.
In the COSMIC link that you shared above, the cutoffs are given for FATHMM-MKL. Can I use this cutoff to build any learning models? I was thinking of using different thresholds, such as>=0.9 for drivers and <=0.1 for passengers to ward out false positives. Please let me know what you think.
There is no right or wrong here, and doing something like what you are aiming to do will always involve a lot of trial and error, and it may be dataset-specific and not applicable to external datasets outside of the training dataset. It can be a frustrating area in which to work. You could divide the training data into training and validation, of course, but this still may not then be applicable to new data.
I am actually working on one right now (not exactly what you are doing, though) and am today going all the way back to the start, where thresholds for
Loware being used - they will have to be modified. In this way, ensuring that your code is flexible is obviously beneficial so that you can re-run it quickly.
COSMIC is a bit of a 'catch all' database... a lot of those variants / mutations may not, in fact, be somatic mutations, and may even be false-positive variant calls. I think that Sanger Institute mentions these pitfalls on their site somewhere.
So basically play with the thresholds and pick the one that gives the best generalizable model. I am in fact worried about how all of this will play with the reviewers, but again like you said it is safe to build models on already publishes datasets. Thanks again for the links.
Sure, Collin'a answer is good, too. At least we now have 2 people highlighting the peril of the 1000 Genomes approach. I am aware that there are publications that have used that data, though.
In CHASMplus, I used somatic mutations calls from the TCGA MC3 effort (pmid: 29596782), which provided a completely standardized way of mutation calling. Their pipeline was automated, so there likely are incorrect variant calls, but at least the data is consistent across many tumors.
Another "gotcha" from COSMIC are artifact "hotspot" mutations. One reason for this is the same tumor sample could have their mutations reported more than once because a second study uses the data from a previous study. Another reason is a single study could sequence multiple regions of the same tumor, resulting in you seeing passenger mutations as also being a "hotspot".