Tutorial: Exploring cancer mutation data portals
3.9 years ago by
Washington University School of Medicine, St. Louis, USA
Malachi Griffith

This tutorial describes examples of data portals (visual interfaces, APIs, etc.) that allow a user to mine publicly available cancer sequence data for somatic and/or germline mutations.  These resources allow the user to assess the recurrence of specific mutations within cancer subtypes, their sequence identity, predicted functional consequence, etc.  Example questions one might ask of such resources:

  • What are the most significantly mutated genes in a particular cancer type?
  • What mutations tend to co-occur or are mutually exclusive with each other in a tumor?
  • What positions or domains within the amino acid sequence of a gene are most frequently mutated?  i.e. where are the mutation 'hotspots'? 

Some relevant posts:

Here are some resources that I already know about and have used:

The first four are fantastic resources along the lines I am looking for.  Please comment below if I am missing others?  For example, there may be others that are less well known or that are more focused on a specific question.

Relevant reviews, primary articles, open-source software projects, etc. would also be welcome.  I'm most interested in resources that create a platform for performing complex queries of the raw data, provide summaries and visualizations, etc.  I will try to update this tutorial with examples and feedback from the community.

Here are some related resources that we have created ourselves to complement some of those resources listed above:

A nice introductory tutorial (video) on Cancer Variant Knowledgebases: "Introduction to Publicly Available Knowledgebases to Aid Interpretations of Genomic Findings in Oncology".


thanks very very very much!!

You are most welcome. I just updated this post to include the Genomic Data Commons. This is a great resource for accessing the raw data, variant call files, etc.

Thank you very much!

Are you solely interested in tools that just visualize/download publicly available data sets? What about resources where you can actually submit your own mutations and annotate, analyze, and visualize?

3 months ago by
rafi.zon

That's an excellent list of the different available databases. I'm doing a research about driver mutations vs. passenger mutations and I'm not sure which database to use to get a list of driver mutations that are known to cause cancer and not just appear in cancer samples.
Which databases or available datasets would you recommend the most for this purpose?

The reason these portals exist is that in part it is still an open research question which mutations are definitively drivers versus passengers. That being said, for your needs you might want to approach this problem from the perspective of more established tumor suppressors and oncogenes.

Here is a companion post to this one that covers those: Database Of Tumor Suppressors And/Or Oncogenes

Probably the most popular answer is to use the Cancer Gene Census

Malachi, Thanks again!

I already searched many of the databases for cancer driving mutations for doing supervised learning. The Cancer Gene Census is something in the line of what I'm looking for. However, what it provides is a list of cancer genes and it doesn't differentiate the passenger and driver mutations that could be present within the same cancer gene. Isn't there a list of well-known driver mutations that are known to directly have a carcinogenic effect? I believe the following portals are most closely related to my search:

Would these be reliable resources altogether for my 'list' of driver mutations in cancer or are there limitations for using them?

Other options that are in the vein of DoCM: Cancer Hotspots

Other options that are in the vein of CIViC: Jackson Lab's JAX CKB MSKCC's ONCOKB Cornell's PMKB IRB's CGI

If you would like learn more about efforts to harmonize the efforts of the cancer variant interpretation resources you can check out cancervariants.org I wish I could say that there are not significant limitations to these resources, but there are. With all the tumor genome/exome sequence data that is now out there it is starting to be possible to come up with lists of specific mutation sites that are significant hotspots. These are highly suggestive of activating mutations that are key drivers in cancer. We are still discovering new hotspots in rarer cancer types though. But the bigger problem is that of tumor suppressors (TS). It is relatively easy to define a TS that is significantly mutated at the gene level. But there are so many ways to break the function of a gene, we don't see the hotspot pattern we see with oncogenes. Yet these mutations can be just as critical to carcinogenesis. When we see a new mutation in BRCA1 or TP53 it can be hard to know for sure if that mutation is pathogenic/functional. It might not have been seen previously. Until we functionalize it or see it enough times in cancers of that type we are unsure if it is just a random passenger that happens to be in a cancer gene or a true driver. Thus the cataloging of these mutations is highly incomplete. Resources like the BRCA exchange have expended enormous effort to try and do a decent job of tracking down the functional pathogenic vs. benign variants for just two cancer genes. Other databases focus on other genes. Coming up with one grand list that is comprehensive and high quality remains a major research challenge. Efforts such at the GA4GH and VICC driver project cancervariants.org) are trying to harmonize the various efforts underway and at least make it easier to combine knowledge from across the many, many relevant resources and data sets out there. Still very much a problem to be solved though.

Thanks a lot for your elaborate answer, Malachi. Your answer gave me a better overview of the problem at hand. Let's hope with time and more research efforts we will understand the pathogenic pathways (as for the TS) much better, leading to more harmonization.

2.2 years ago by
Alex Reynolds23k
Seattle, WA USA
Alex Reynolds

If you have specific genes you are interested in, I wrote a tool to explore expression between tissues

Nice! This looks awesome and performs very well.  The above list is very DNA focused.  Maybe we should create a separate post on Exploring cancer *expression* data portals...

Good idea, I'll delete this post.

