Update: (2/8/2017) This tutorial applies to TCGA MAFs in the GDC Legacy Archive. Most of this tutorial is still valid, but I'll need to update some notes and broken links.
For folks familiar with the VCF format, TCGA's MAF files can be quite a pain to work with. You might just download the latest MAFs, pull loci and alleles for each variant, and redo annotations with ANNOVAR, snpEff, or Ensembl's VEP. Problem solved, right? Nope. You don't know the half of it! There are lots of caveats you should know about, and I try to document them below. Most of these caveats are handled with safe solutions in the MAFs at this page, and the specificity of variant calls are made more comparable across MAFs, at this page.
How TCGA MAFs are made
- Tumor-specific Analysis Working Groups (AWGs) take the auto-generated variant calls from the Genome Sequencing Centers (GSCs), and remove false-positive variants, or recover those missed by the GSCs. This is either done manually by experts, or by automated scripts or filters based on the extensive domain knowledge of said experts. So the quality of variant calls may differ wildly between TCGA GSCs and AWGs, and across samples as the pipelines get better over time
- In some tumor-types, and for some subsets of samples, mutations called from the first-pass of sequencing (usually exome-seq), are targeted by custom capture arrays, and re-sequenced. This results in much higher read-depths for the purpose of validating first-pass calls, for more accurate variant allele fractions (VAFs), or for finding calls that were missed
- The GSC-generated or AWG-curated MAFs are uploaded to the DCC under folders nested as
Finding the correct MAF to use
The Ding lab at TGI, WashU track the best available MAFs in this spreadsheet for use in MuSiC analyses. TCGA tumor-types COAD (Colon) and READ (Rectum) are treated as the same tumor type "COADREAD" (Colorectal), resulting in 19 MAFs across 20 tumor-types. At this link, folks at Broad Institute maintain a list of MAFs to feed into Firehose
- The 16th column in a MAF lists tumor sample barcodes in the form "TCGA-02-0021-01A-01D-0002-04". Different barcodes may be used for the same sample, if they were sequenced several times or by several GSCs. Be careful not to treat their variants as observed recurrence across patients
- Reducing the tumor IDs to the form "TCGA-XX-XXXX-XX" is useful to enumerate the real number of tumors studied, or to de-duplicate variants from the same tumor. And reducing IDs to the form "TCGA-XX-XXXX" helps to enumerate the distinct patients in the cohort. But note that MAF files will not list IDs of samples with zero reported mutations. So it helps to have a separate list of all samples in the cohort
- The breakdown of a TCGA barcode is here and their designations are tabulated here
- To re-annotate variants in a MAF (e.g. using maf2maf.pl), you'll only need column 4 for the reference genome used, columns 5-7 for genomic loci, and columns 11-13 for reference and variant alleles
- The variant alleles in columns 12 and 13 are not always reliable to determine zygosity (some variant callers or curators make bad assumptions). Whichever allele differs from column 11 (the reference allele) is the variant allele, and it may either be heterozygous or homozygous
- Almost everything in the TCGA MAFs are from targeted exome sequencing. 50 of the 200 LAML tumors were whole-genome sequenced, and the putative calls were targeted with custom capture arrays
- Per-sample coverage data can be found here. Genomic loci with sufficient depth to detect variants are listed in BED files, which can be used to correct for variant caller sensitivity. Exome-seq coverage can differ quite a bit between samples, and across the exome, as it has improved over time. But more importantly, several proprietary exome-capture technologies have been used across TCGA, with different targeted regions. And the 3 TCGA GSCs are not very transparent about their internal protocols either! So please make use of this coverage data if you can. It was not easy to collect/clean/liftover!
These caveats may change over time, but I'm listing what I know as of today, August 9th, 2014:
- The latest MAFs for PAAD, PRAD, THCA, and UCS are auto-generated GSC MAFs, and have not yet gone through AWG curation. Avoid these if your analysis is too sensitive to false positives
- Four of the tumors in the BRCA MAF are actually metastases (TCGA-XX-XXXX-06) of four other primary tumors (TCGA-XX-XXXX-01) in the same MAF. Be careful not to treat their variants as observed recurrence between different patients
- Similarly, the SKCM MAF has 2 metastases for 2 other primary tumors. But note that most samples in SKCM (melanoma) are from metastases, because primary SKCM tumors are hard to pinpoint
- Of the 200 sequenced LAML tumors, 3 had no reported exonic mutations. So you won't find them listed in the MAF. This is important to note, when measuring something like mutation frequency across samples. Of all the TCGA tumor types, LAML (Acute Myeloid Leukemia) has the lowest overall mutation frequency, and several known/suspected drivers are in the regulome. This was also the motive behind doing whole-genome sequencing for 50 of the 200 LAML tumors, while everything else in TCGA went through exome-seq