2
1
Entering edit mode
2.4 years ago
Gene_MMP8 ▴ 210

There exists a lot of literature on distinguishing driver mutations from passengers. I am trying to build my own deep learning model to do the same. I am facing some potential issues. First, I have downloaded COSMIC mutation data and used the FATHMM labels to designate drivers (positive examples) in my dataset. I am sceptical to use passengers from COSMIC as they may be false negatives. So I turned to the 1000 genome project to download SNVs (to construct my negative examples). I am unsure if this is correct, however, I have seen some papers do the same. Do I need to apply any filters on the 1000 genome SNV data to construct the final dataset? One such paper talks of using SNVs with a global minor allele frequency≤1%.

passenger mutations driver mutations cancer • 700 views
3
Entering edit mode
2.4 years ago

This area of research is difficult because each person has their own opinion about what constitutes a 'driver' and 'passenger' mutation. For example, I disagree with the 1000 Genomes approach because I already know that some 1000 Genomes polymorphisms that have appreciable minor allele frequencies of around 15% in Caucasians can drive ER-positive breast cancer. To be frank: we don't have a clue what >90% of the variants listed in dbSNP / 1000 Genomes are doing. A large proportion of them could ultimately be driving cancer and other diseases.

If I were you, I would not try to define on my own what is / is not a driver and passenger. Why not utilise the work that has already been published and then follow their guidance for your deep learning model?

Look at this paper, published in the highly reputable Cell journal:

Here is another one, published a couple of weeks ago in Nature Genetics:

I also just stumbled upon this online platform, which focuses on drivers:

Build upon the work that is already out there. Then, at least, if you try to publish your work, it will be more difficult for reviewers to criticise you.

By the way, I am not sure why you are using FATHMM, or did you mean FATHMM-MKL, as mentioned HERE on COSMIC's page. GWAVA and Funseq2 were designed for somatic mutations ( see here: A: pathogenicity predictors of cancer mutations ).

Kevin

0
Entering edit mode

Many thanks for your detailed explanation. I am indeed using FATHMM-MKL and not FATHMM. Sorry for the confusion. I will definitely go through the papers you mentioned and try to get their data. But I was wondering whether to stick to COSMIC to construct the labels for my training set. It is a sizeable dataset with enough positive/negative classes.

The functional scores for individual mutations from FATHMM-MKL are in the form of a single p-value, ranging from 0 to 1. Scores above 0.5 are deleterious, but in order to highlight the most significant data in COSMIC, only scores ≥ 0.7 are classified as 'Pathogenic'. Mutations are classed as 'Neutral' if the score is ≤ 0.5.


In the COSMIC link that you shared above, the cutoffs are given for FATHMM-MKL. Can I use this cutoff to build any learning models? I was thinking of using different thresholds, such as>=0.9 for drivers and <=0.1 for passengers to ward out false positives. Please let me know what you think.

0
Entering edit mode

There is no right or wrong here, and doing something like what you are aiming to do will always involve a lot of trial and error, and it may be dataset-specific and not applicable to external datasets outside of the training dataset. It can be a frustrating area in which to work. You could divide the training data into training and validation, of course, but this still may not then be applicable to new data.

I am actually working on one right now (not exactly what you are doing, though) and am today going all the way back to the start, where thresholds for High and Low are being used - they will have to be modified. In this way, ensuring that your code is flexible is obviously beneficial so that you can re-run it quickly.

COSMIC is a bit of a 'catch all' database... a lot of those variants / mutations may not, in fact, be somatic mutations, and may even be false-positive variant calls. I think that Sanger Institute mentions these pitfalls on their site somewhere.

0
Entering edit mode

So basically play with the thresholds and pick the one that gives the best generalizable model. I am in fact worried about how all of this will play with the reviewers, but again like you said it is safe to build models on already publishes datasets. Thanks again for the links.

0
Entering edit mode

Sure, Collin'a answer is good, too. At least we now have 2 people highlighting the peril of the 1000 Genomes approach. I am aware that there are publications that have used that data, though.

0
Entering edit mode

In CHASMplus, I used somatic mutations calls from the TCGA MC3 effort (pmid: 29596782), which provided a completely standardized way of mutation calling. Their pipeline was automated, so there likely are incorrect variant calls, but at least the data is consistent across many tumors.

Another "gotcha" from COSMIC are artifact "hotspot" mutations. One reason for this is the same tumor sample could have their mutations reported more than once because a second study uses the data from a previous study. Another reason is a single study could sequence multiple regions of the same tumor, resulting in you seeing passenger mutations as also being a "hotspot".

2
Entering edit mode
2.4 years ago
Collin ▴ 1000

You'll fist need to clarify what exactly you mean by driver mutation vs passenger mutation. 1) Are you concerned with somatic mutations or germline mutations (most often it is somatic, if you say "driver mutation")? 2) Are you concerned mostly about protein-coding mutations or non-coding mutations?

1) If you are interested in somatic driver mutations, I highly recommend that you do NOT use germline variants from sources such as the 1000 genome project as "passengers". Germline variants are systematically different than somatic mutations because they undergo substantially more negative selection than somatic mutations do (PMID: 19654296). If you are interested in "pathogenic" germline variants that relate to cancer, then using high allele frequency germline variants as "passengers" is an ok but not perfect solution (see Kevin's comments).

2) If you are interested in protein-coding driver mutations, I highly recommend that you do not develop a method that also tries to predict non-coding driver mutations. The recent PCAWG papers have highlighted that indeed most of the somatic driver mutations in cancer happen in protein-coding regions. Moreover, non-coding prediction methods applied to protein coding regions generally don't fare well in benchmarks (see PMID: 32079540). The top methods in that recent benchmark at predicting driver mutations were CHASM, CTAT-cancer (see my paper, PMID: 29625053), DEOGEN2 and PrimateAI. I recently released CHASMplus (pmid: 31202631) which is a substantial improvement over top performing CHASM at predicting driver mutations, and my training labels are available (https://chasmplus.readthedocs.io/en/latest/faq.html , see "Where can I obtain the training data for CHASMplus?").

0
Entering edit mode

I am mainly interested in coding region only (Substitution: missense, nonsense and coding silent). I am really confused regarding whether to use COSMIC as my primary source of data to build a model (Can you kindly go through the comment I made on @KevinBlighe's answer and let me know what you think?). Thanks a lot for sharing your method and the training labels. But you had a considerably imbalanced dataset (1:300). From what I read, you used undersampling followed by random forests to build your model. Did you try any ensemble technique too (bagging etc)?

0
Entering edit mode

There are only one to several driver mutations per tumor, so the task of identifying driver mutations is your classical "needle in the haystack" problem. In my case, since I was using random forest, the under sampling was done through the bagging procedure of the random forest (randomForest R package). If you are thinking about deep learning, you could do something similar by randomly balancing your dataset in the batches of stochastic gradient descent.