There exists a lot of literature on distinguishing driver mutations from passengers. I am trying to build my own deep learning model to do the same. I am facing some potential issues. First, I have downloaded COSMIC mutation data and used the FATHMM labels to designate drivers (positive examples) in my dataset. I am sceptical to use passengers from COSMIC as they may be false negatives. So I turned to the 1000 genome project to download SNVs (to construct my negative examples). I am unsure if this is correct, however, I have seen some papers do the same. Do I need to apply any filters on the 1000 genome SNV data to construct the final dataset? One such paper talks of using SNVs with a global minor allele frequency≤1%.