Question

TCGA calling pipeline query

6

Entering edit mode

7.3 years ago

adampennycuick ▴ 140

Hi all,

I am trying to download mutation data from the TCGA LUSC project. My goal is to be able to programmatically calculate the mutation rate in this cohort for a given gene (e.g. TP53).

Using the (excellent) online browser at https://portal.gdc.cancer.gov and selecting only the LUSC project, I can see there are 195,215 mutations in 504 patient samples. 432 of 504 samples have a TP53 mutation (85%).

I can't seem to get these numbers from the raw data available. Within the LUSC project there are 4 publicly available variant call files - MuTect, VarScan, SomaticSniper and MuSE. Each file lists around 60-90k mutations, from 492 patients. When I combine all four files I find 169,508 mutations - still short of the number in the TCGA browser (but from 12 fewer patients). Each individual file has a very low TP53 mutation rate - around 40% each - but when I combine the four files I get a realistic rate of 85%. Other genes I have tried also give a similar rate to the browser.

Can anyone please explain how the TCGA browser calculates mutation rates? I am guessing that they combine these four files (as I have done) but some samples are not publicly available. If I want to look up rates programatically, is it a reasonable approach to use a combination of these files as I have done?

Many thanks

TCGA • 2.8k views

ADD COMMENT • link updated 7.3 years ago by 乙 ▴ 240 • written 7.3 years ago by adampennycuick ▴ 140

1

Entering edit mode

Have you tried looking here? I found it difficult to find my way to all of these pages, but maybe your answer is in there.

ADD REPLY • link 7.3 years ago by Mathias ▴ 90

1

Entering edit mode

Hi there,

Do you mind if I ask how you performed the combining of the four files ? And the TP53 mutation rate ?

I think this is a good question. It is important to understand first what does the Exploration says:

# Affected cases in cohort: Breakdown of affected cases in cohort. Number of cases where Gene is mutated / number of cases tested for Simple Somatic Mutations.

This means that 495 samples have been analyzed for somatic mutations out of 504.

# Mutations: Number of unique mutations in the Gene in cohort.

This means that by combining the four different files, there are 292 different type of mutations found in TP53 (INDELs and single base substitution).

You also mentioned that you took the raw files and tried to combine them. Did you re-annotate the raw files ? If yes, it is important to note that the version you have used should be similar to theirs. Not to mention that they have used 3 different type of annotation tools (VEP, SIFT and Polyphen). If no, I assume you used their annotated files, then you have to look at how you combined them.

Can anyone please explain how the TCGA browser calculates mutation rates? I am guessing that they combine these four files (as I have done) but some samples are not publicly available.

I think it is best that you send them an email support@nci-gdc.datacommons.io and ask them this question. They take 1 to 2 days business and will provide you with accurate answers.

If I want to look up rates programatically, is it a reasonable approach to use a combination of these files as I have done?

I think this is an open question. It's a grey area where opinions are mixed. In my opinion, it is best to take the consensus of at least two to three tools (see this paper) to make sure that you get (significant) results. If you use multiple tools to analyse your sample and take the overlap, you will end with very few mutations but "trusted" ones and if you use only one, you could have false positives mutations that could be misleading.

I have recently posted a topic to compare the output of two known callers that discuss the poor overlap between them. I haven't received any concrete reply yet. But I believe that this ultimately depends on the algorithm the caller uses.

Best regards, Alaa

ADD REPLY • link 7.3 years ago by 乙 ▴ 240

0

Entering edit mode

Thanks both. Mathias - that link was very helpful (and hard to find!) - from reading this I think the discrepancy is that the openly accessible files I am using are masked MAF files, which don't contain all mutations, as germline mutations are masked. So to replicate their results I think I'd have to go through the controlled data access process. It's still not quite clear from the document how they combine the four files to get to the numbers on the browser, but I can ask them about this. The fact that I have pretty similar results suggests to me that I am on the right track.

Alaa, to answer your questions - the files downloaded using GenomicDataCommons are already annotated and are in a common format, so I combined them simply using rbind, then searched for duplicated mutations (i.e. same patient, chromosome, position, alt, ref) and removed them. You talk about using overlap of multiple tools - this seems like a more rigorous method, but does not seem to be what the TCGA team have done - each of their tools only gives a TP53 mutation rate of 40% yet they quote an overall rate of 85%, so they must be using an additive rather than reductive method to combine them.

Also thanks for pointing out that 495 rather than 504 samples have been profiled - that was an error on my part.

ADD REPLY • link 7.3 years ago by adampennycuick ▴ 140

0

Entering edit mode

Hi Adam, getting the controlled access data would probably be a good idea in any case, but you would have to apply for access. If you download the open access MAF files, then the calls in each (as you've mentioned) will have been produced by different somatic variant callers used at each center where the sequencing was performed. Also as you've found, the same sample may even have been sequenced at 2 or more centers. You have to be sure that your way of identifying these duplicate calls is robust through the creation of a unique 'key'. I downloaded the entire UCEC dataset a few months ago and found that some samples were sequenced in up to 3 different centers.

ADD REPLY • link 6.6 years ago by Kevin Blighe 89k

0

Entering edit mode

Also be wary of the flag panel_of_normals in the FILTER column.

ADD REPLY • link 7.3 years ago by Kevin Blighe 89k

score 0 · Answer 1 · 2018-04-12

Hey again,

Back to your question:

Can anyone please explain how the TCGA browser calculates mutation rates?

The Gene/Mutation data for these exploration visualizations comes from the Open-Access MAF files on the GDC Portal. If you take the union of the mutations in this file you should come up with the numbers you see on the GDC portal. The same goes for the number of mutations on a particular gene and the impact. The impact was taken from the canonical transcript. You can find out more about the MAF formats located here: https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/

Best regards, Alaa