I am trying to download mutation data from the TCGA LUSC project. My goal is to be able to programmatically calculate the mutation rate in this cohort for a given gene (e.g. TP53).
Using the (excellent) online browser at https://portal.gdc.cancer.gov and selecting only the LUSC project, I can see there are 195,215 mutations in 504 patient samples. 432 of 504 samples have a TP53 mutation (85%).
I can't seem to get these numbers from the raw data available. Within the LUSC project there are 4 publicly available variant call files - MuTect, VarScan, SomaticSniper and MuSE. Each file lists around 60-90k mutations, from 492 patients. When I combine all four files I find 169,508 mutations - still short of the number in the TCGA browser (but from 12 fewer patients). Each individual file has a very low TP53 mutation rate - around 40% each - but when I combine the four files I get a realistic rate of 85%. Other genes I have tried also give a similar rate to the browser.
Can anyone please explain how the TCGA browser calculates mutation rates? I am guessing that they combine these four files (as I have done) but some samples are not publicly available. If I want to look up rates programatically, is it a reasonable approach to use a combination of these files as I have done?