I am new to TCGA and require some help with a problem described below.

The total number of cases for BRCA in TCGA is 1098, but 'Mutations' data is associated with only 986 cases. Now when I look into the same data through Firebrowse it shows that only 977 cases have the mutations data. Also an attempt to download this gives access to two files, namely Mutations_Packager_Calls (MD5) and Mutations_Packager_Oncotated_Calls (MD5).

Now, if I take one specific case and analyse the mutations data, it is not the same among all the three resources (TCGA and the two files from Firebrowse). Kindly help me get insights into this. Also if you could let me know the difference between different files available for mutations data in firebrowse.

Welcome to the TCGA data. The more you work with this data, the more inconsistencies you will find, so, care is required.

There can be any one or more of the following reasons for what you have found:

  1. the TCGA dataset contains multiple biopsies from the same original tumour, which were then removed in the FireBrowse data
  2. as the TCGA data was processed in different centers, in some cases the same biopsy from the same tumour was sequenced twice (or more) in different centers
  3. some tumour samples that were from FFPE were removed in the FireBrowse dataset

On point 2, you'd think that at least the multiple centers would use the same analysis pipeline, but they didn't. The open access TCGA mutation data is an agglomeration of somatic variant calls from different variant callers, which introduces bias, of course.

I quickly checked the FireBrowse and, when you select to download the raw mutation data and the new smalll window opens, you will see some text at the top of the window that says:

Files may also be downloaded here, or with firehose_get, or exported to GenomeSpace with the SendTo tab.


Click on the link 'here', and you will then be taken to a FTP server where you can get it. The files that you downloaded are just MD5 checksums that are used to check the integrity of the main files after they have been downloaded,

Aside from everything that I've mentioned here, my recommendation is to go with the FireBrowse data because you can then at least cite FireBrowse and avoid having to deal with the many issues related to the data taken direct from the TCGA GDC Data portal.


Thanks for this detailed reply Kevin!

I have one follow up query. These two files available for download at Firebrowse are also slightly different. Is it because of different variant callers being used?

Thanks again for your help!

You mean the 'oncotated' versus the other? For information on that, you should take a look here: Mutation Pipelines

