Question: TCGA driver mutation data
I would like to download driver mutation data for TCGA patients ( in particular lung cancer but ideally 'pan cancer'.

For example I would like to be able to discover the proportions of patients with adenocarcinoma of the lung who have driver mutations in KRAS, EGFR, TP53 etc etc.

I came across this paper - 'Comprehensive Characterization of Cancer Driver Genes and Mutations' ( where they produced a database of 9423 exomes annotated with various putative drivers. Does anyone know how to access their dataset? I can't seem to find any instructions.

If not, which source would you recommend to get such data from (i.e. driver mutation data), and why

Thanks in advance

There are a number of different algorithms that try to identify driver mutation in cancer mutation data diverse approaches. Check an update one: MutPanning (v2.0) from Dana-Farber Cancer Institute and Broad Institute. Here is the paper. You can find a sort of benchmarking in the paper also.

My understanding is that question refers to driver mutations. MutPanning is a gene-based method. Identifying which specific mutations within those genes are actually driver mutations is a much harder task, as driver genes contain a mixture of passenger and driver mutations.

Many clinical interpretation guidelines clearly delineate that missense mutation in a known disease gene is not sufficient evidence in of it self to be labeled oncogenic/pathogenic.

Thank you for this. It may be useful as a complementary resource to Collin's - will take a look

I'm one of the first authors of that paper.

The data is available on the Genomic Data Commons website for our paper ( Please see the file described: "Mutation Scores and tool aggregation" (Mutation.CTAT.3D.Scores.txt). It contains scores for all missense mutations (~750k mutations).

To get the filtered dataset, you only need to filter based on the flag column for each of CTAT-population ("New_Linear (functional) flag"), CTAT-cancer ("New_Linear (cancer-focused) flag"), and structural clustering ("New_3D mutational hotspot flag"). By convention, a value of "1" indicates a flag for a potential driver mutation according to that approach. The 3,437 number is from any mutation with at least two of the approaches agreeing. The raw scores for CTAT cancer and CTAT population are found in columns "eigenscore (cancer)" and "eigenscore (functional)", respectively.

For loss-of-function mutations in tumor suppressors, you might look at the genes annotated as tumor suppressors in Table S1. Most variant annotation databases regard frameshift indels, nonsense mutations, essential splice site, stop loss or start loss mutations as likely oncogenic in tumor suppressor genes.

Lastly, if you also want to predict driver missense mutations in new tumor samples outside of the TCGA, you could try CHASMplus ( ). The results were highly consistent with our results from the TCGA pancanatlas study, but substantially simplifies the scoring process (available via OpenCRAVAT, ).

This is great - I didn't know that resource page for TCGA existed. This resource/paper is very useful because, as you say, the leap from variants within genes to annotation of 'driver' is difficult - thank you!

Glad to help. Hopefully this can also help anybody else that had the same question as you.

Insightful Collin. Thanks

