Question

Where is the pathogenicity in TCGA MAF files?

1

Entering edit mode

7.7 years ago

pel ▴ 20

There does not appear to be pathogenicity prediction results in most of the MAF files from TCGA. Should one only use nonsense mutations?

TCGA somatic mutations pathogenicity • 2.4k views

ADD COMMENT • link 7.6 years ago by pel ▴ 20

0

Entering edit mode

Thanks - very helpful. Lately we have been fetching TCGA data from cBioPortal at your institution, so a couple of questions (thanks in advance).

-It seems that the Provisional "_tcga" data sets seem to be widely used, and the other datasets for the same cancer subtypes appear to be datasets that either used the provisional sets or spawned their own data?

-Regarding gene expression, there does not seem to be many expression data, and we have come across groups who have stated that expression and methylation are less persistent throughout tumor progression when compared with somatic mutations and CNA in driver genes -- hence a propensity to not focus on expression for tumor progression. We think differently, because we have found that some gene's expression correlates strongly with mutations or CNA's in the provisional breast data set.

-Last, if you use only pathogenic somatic mutations from COSMIC based on FATHMM scores for the TCGA cases (say, breast cancer, provisional), the mutation frequencies (over samples) in the more popular driver genes from e.g. DRIVERDB will not be the same as the most frequently mutated driver genes from COSMIC (--> where driver genes in this instance for the COSMIC data are based on calculations to infer order of mutations).

ADD REPLY • link 7.6 years ago by pel ▴ 20

score 5 · Answer 1 · 2016-08-24

The standard 34 MAF columns described here don't include any pathogenicity prediction data, beyond Variant_Classification. But the various centers that generated these TCGA MAFs added their own columns, in no particular format or consistency. These include data like SIFT, Polyphen, MutationAssessor scores, MutationTaster, etc. For consistent annotation, I would recommend that you "Fetch MAFs from Firehose" as described in Annotating TCGA MAFs with the latest Ensembl/Gencode transcripts

You can also run maf2maf with VEP as described in that tutorial. Of over 100 columns it generates, the following are useful for your purposes:

Consequence - consequence type of this variation as defined here, in decreasing order of severity.

CLIN_SIG - clinical significance of variant per ClinVar. Look for terms like drug_response and/or pathogenic.

PUBMED - pubmed ID(s) of publications that cite this variant. Be wary of large scale screens without functional validation.

IMPACT - the impact modifier for the consequence type as described here. HIGH means important.

Existing_variation - known IDs of variant (if the variant was seen in some other somatic/germline DB, its ID will be listed here).

SIFT - the SIFT prediction and/or score.

PolyPhen - the PolyPhen prediction and/or score.