Question: Retreiving Data From Tcga Database
7.1 years ago
Iraz70


First, I would like to say that, I am not a bioinformatician, I have no computational biology background or skills, I do not know any programming languages, no unix background etc. But I do need to retrieve some data from the TCGA database, I could not find any search option to look for a particular mutation on the website.

What I need to do is to get a list of all the somatic mutations of a "Gene X" that were found in all the cancer types and in how many cases this mutation is found.

My list would basically go like this:

Mutation //// Cancer Type ///// number of cases

I would like to learn how to generate a list like this from the TCGA website.

Best regards.

the cBio portal for cancer genomics enables you to do just that.

Did you try to address this question to TCGA guys? (

ICGC is better organized and much straight forward then actual TCGA

7.1 years ago
Sean Davis
National Institutes of Health, Bethesda, MD
Sean Davis

The ICGC has compiled data, including TCGA, into this site that you might find could answer some of your questions:

I personally find the ICGC site a bit more intuitive than trying to navigate and the data easier to consume than the current TCGA data portal, particularly for folks without some scripting skills.

I agree that ICGC's website is more intuitive to use and, directed by your comment, I went there to retrieve TCGA data. Unfortunately, it seems that not all of the TCGA data has made it into their databases; I was looking at somatic mutations in the TCGA Ovarian Cancer cohort and found that they only have 88 donors with SSM data available. Any idea why this would be the case? (I've asked ICGC directly a few days ago but haven't heard back)

For TCGA Ovarian, a total of 462 TN-pairs were exome-sequenced by Broad, Baylor and WashU. 316 were completed in time for the first marker paper - 88 from WashU, 80 Baylor, and 148 Broad. But these were early days of TCGA... so coverage/quality wasn't great, and many of these cases were scrapped in subsequent analyses. WashU later sequenced the last 146 TN-pairs, and in this publication used only 429 of the 462 TN-pairs, though 3 of these have no detected exonic mutations, despite decent coverage and depth.

OV is also the only TCGA tissue type where a consensus 3-center MAF was never uploaded on the DCC. So the latest available TCGA OV MAF must be grabbed from publications. Click here for a list of best available variant list per tumor type that I try to maintain. For an automatically maintained list, try the Firehose MAF dashboard... there's only a few cases where their script doesn't list the best available MAF.

Cyriac, this is the most helpful post I've ever read on a bioinformatics forum.

Haha... glad to help. To standardize TCGA MAFs for analyses, try Annotating TCGA MAFs with the latest Ensembl/Gencode transcripts

I may be little off from actual discussion on this question but would like to add a comment which is very important for funding agencies to listen. NIH/ any funding agency has a policy in place for public access of data. However what I have seen is whenever any investigator (outside the TCGA grp) would like to have access- in the name of controlled access they have grp of people who say yes and no. The tactics is they will keep on objecting and would like to get in writing that particular investigator will not publish and would like to know all finer details of hypothesis and so on. In the end they enjoyed the authority to reject that application. So what will happen is either that requester will be frustrated (time) or rejected or will have to include the TCGA guy into his grant and publication. That is actually very bad and against the policies of public access. The people who did TCGA were supposed to use the funding for generating data and rest is for other investigators (TCGA big bosses) to analyze but to get forcibly include them self or pick and choose is against the policy. I am no body and does not have the power to mend rule but I hope some one in this forum can provide a wind to appropriate dept if something can be streamlined.

Yes, your comment is way off topic, but I'll briefly respond to say that I haven't seen this behavior. In grad school, I was in a small lab with no direct connection to TCGA and we had no problems getting access to the protected data. Yes, you need to state a rough research plan so that they can verify that you'll safeguard protected patient information. Yes, you also need to wait until the marker paper is published, as the people who worked so hard to generate the data get the first shot at one general paper describing the dataset. I don't feel like that's unreasonable.

Well that is exactly my point is if you are part of TCGA grp lab you have a cake. However when TCGA was funded the core concept was to distribute funds so that few grps can archive the data for public usage. My concern is from business point of view those grps were paid and in return they got good illustrious papers of 200 authors in nature NEJM and so on. Why dont an independent committee regulate the access of data (funding agency/ NIH or else) instead of those few labs controlling the data. Chris you may not have faced any issue as you are with in the grp which is very well funded and is part of TCGA. However, in the next few months this demand will be increasingly get attention. As Sean quoted indirectly, it is easy to get access of same data from ICGC than dbGap. One should not be using these public data to get mileage as they were part of the TCGA, that defeat the aim of funding mechanism and public access policy.

I think you misread my comment. Several years ago, I was not in a TCGA group and still got access and published using TCGA data. No one is hoarding data - it is freely accessible via the TCGA data portal and CGHub. (Here, for example, are all somatic mutations found in the AML cohort: When such data contains information that is potentially identifying (like raw sequence reads and germline variant calls), the NIH requires that you fill out a short form so that they can verify you're using the data for research. This is not a difficult hurdle.

The model you describe in your third sentence is exactly what is set up. A Data Access Committee coordinated through dbGaP is in place to review requests for data that are protected for privacy reasons. All other available data from TCGA are publicly accessible.

My answer (too flip) has to do with the fact that finding data on the TCGA website can be a bit challenging, particularly for a non-computational person, as the original poster states. I DID NOT mean to imply that the TCGA data were not accessible, as clearly they are. You seem to imply that one needs to "apply" for access to somatic mutation data; in fact, TCGA somatic mutation data are not protected. As for the "protection" process you talk about, that protection process is in place to protect the privacy of those data that are considered to allow identification of the patients that volunteered for TCGA. The process to gain access to these protected data types is not as quick as we might all hope, but I do not think it is constructive to argue that it should not be there. So, to summarize, the data in question ARE freely and publicly available from the TCGA website. I find that the ICGC site is more useful for biologists to navigate than the current TCGA data portal and my answer was meant to reflect my own opinion on the matter.

7.1 years ago
Chris Miller
Washington University in St. Louis, MO
Chris Miller

There is no webpage that currently displays the exact information you're looking for. That said, it would be pretty straightforward to do it like this:

1) for each cancer type, download all of the MAF files that describe the somatic mutations in that cancer

2) combine them into one big list and pull out per-gene counts. Ideally, you'd do this with some scripting, but you could even use a spreadsheet program (but be careful!

Alternately, find someone with basic scripting skills and get them to help you. Many bioinformaticians would be happy to help you out for money, authorship, or booze (but not necessarily in that order!)

Edit: You may also find information worth exploring on Synapse, which the TCGA Pan-cancer project is using to track files:!Synapse:syn300013

