Update: (2/8/2017) This tutorial applies to TCGA MAFs in the GDC Legacy Archive. Most of this tutorial is still valid, but I'll need to update some notes and broken links.
For folks familiar with the VCF format, TCGA's MAF files can be quite a pain to work with. You might just download the latest MAFs, pull loci and alleles for each variant, and redo annotations with ANNOVAR, snpEff, or Ensembl's VEP. Problem solved, right? Nope. You don't know the half of it! There are lots of caveats you should know about, and I try to document them below. Most of these caveats are handled with safe solutions in the MAFs at this page, and the specificity of variant calls are made more comparable across MAFs, at this page.
How TCGA MAFs are made
- Tumor-specific Analysis Working Groups (AWGs) take the auto-generated variant calls from the Genome Sequencing Centers (GSCs), and remove false-positive variants, or recover those missed by the GSCs. This is either done manually by experts, or by automated scripts or filters based on the extensive domain knowledge of said experts. So the quality of variant calls may differ wildly between TCGA GSCs and AWGs, and across samples as the pipelines get better over time
- In some tumor-types, and for some subsets of samples, mutations called from the first-pass of sequencing (usually exome-seq), are targeted by custom capture arrays, and re-sequenced. This results in much higher read-depths for the purpose of validating first-pass calls, for more accurate variant allele fractions (VAFs), or for finding calls that were missed
- The GSC-generated or AWG-curated MAFs are uploaded to the DCC under folders nested as
Finding the correct MAF to use
The Ding lab at TGI, WashU track the best available MAFs in this spreadsheet for use in MuSiC analyses. TCGA tumor-types COAD (Colon) and READ (Rectum) are treated as the same tumor type "COADREAD" (Colorectal), resulting in 19 MAFs across 20 tumor-types. At this link, folks at Broad Institute maintain a list of MAFs to feed into Firehose
- The 16th column in a MAF lists tumor sample barcodes in the form "TCGA-02-0021-01A-01D-0002-04". Different barcodes may be used for the same sample, if they were sequenced several times or by several GSCs. Be careful not to treat their variants as observed recurrence across patients
- Reducing the tumor IDs to the form "TCGA-XX-XXXX-XX" is useful to enumerate the real number of tumors studied, or to de-duplicate variants from the same tumor. And reducing IDs to the form "TCGA-XX-XXXX" helps to enumerate the distinct patients in the cohort. But note that MAF files will not list IDs of samples with zero reported mutations. So it helps to have a separate list of all samples in the cohort
- The breakdown of a TCGA barcode is here and their designations are tabulated here
- To re-annotate variants in a MAF (e.g. using maf2maf.pl), you'll only need column 4 for the reference genome used, columns 5-7 for genomic loci, and columns 11-13 for reference and variant alleles
- The variant alleles in columns 12 and 13 are not always reliable to determine zygosity (some variant callers or curators make bad assumptions). Whichever allele differs from column 11 (the reference allele) is the variant allele, and it may either be heterozygous or homozygous
- Almost everything in the TCGA MAFs are from targeted exome sequencing. 50 of the 200 LAML tumors were whole-genome sequenced, and the putative calls were targeted with custom capture arrays
- Per-sample coverage data can be found here. Genomic loci with sufficient depth to detect variants are listed in BED files, which can be used to correct for variant caller sensitivity. Exome-seq coverage can differ quite a bit between samples, and across the exome, as it has improved over time. But more importantly, several proprietary exome-capture technologies have been used across TCGA, with different targeted regions. And the 3 TCGA GSCs are not very transparent about their internal protocols either! So please make use of this coverage data if you can. It was not easy to collect/clean/liftover!
These caveats may change over time, but I'm listing what I know as of today, August 9th, 2014:
- The latest MAFs for PAAD, PRAD, THCA, and UCS are auto-generated GSC MAFs, and have not yet gone through AWG curation. Avoid these if your analysis is too sensitive to false positives
- Four of the tumors in the BRCA MAF are actually metastases (TCGA-XX-XXXX-06) of four other primary tumors (TCGA-XX-XXXX-01) in the same MAF. Be careful not to treat their variants as observed recurrence between different patients
- Similarly, the SKCM MAF has 2 metastases for 2 other primary tumors. But note that most samples in SKCM (melanoma) are from metastases, because primary SKCM tumors are hard to pinpoint
- Of the 200 sequenced LAML tumors, 3 had no reported exonic mutations. So you won't find them listed in the MAF. This is important to note, when measuring something like mutation frequency across samples. Of all the TCGA tumor types, LAML (Acute Myeloid Leukemia) has the lowest overall mutation frequency, and several known/suspected drivers are in the regulome. This was also the motive behind doing whole-genome sequencing for 50 of the 200 LAML tumors, while everything else in TCGA went through exome-seq
This is fantastically useful - thanks for posting.! Do you have anything similar for the copy-number data?
No. But there are definitely fewer caveats to the copy-number data because it was all generated by the same folks, and analyzed by them too. You can find more at Broad Institute's copy number portal, which pulls data from Firehose runs.
Thanks Cyriac. I was interested in your point #5 in "MAF-format caveats" about not always being able to call zygosity based on the variant alleles in columns 12 and 13. Would you be able to give a bit more detail about why this is the case please?
It's often assumed that somatic point mutations or small indels in cancer are infrequent enough, that they almost always result in a homozygous site becoming heterozygous. Combine that imperfect assumption with someone's good old fashioned indifference, and you get MAF columns 12 and 13. :)
Thanks so much for this! Very very useful.
Thanks for this write-up, it has been very helpful. Regarding the CNV data, do you happen to know what the naming convention is for samples? I can't figure out what the relationship is between mutation sample ids and CNV sample ids.
See again the breakdown of sample barcodes over here. CNV data will be generated from a different portion of the same sample... meaning a different barcode altogether. To match mutation data to CNV data for the same sample, reduce barcodes to the form TCGA-XX-XXXX-XX.
Anyone met the problem I have? I come to TCGA SNP data download center and I found all these VCF files are controlled?
Thanks for this wonderful post Cyriac!
I have been learning about the MAF Files very recently and I was wondering what is the correct MAF File to work with when I download the data from TCGA. I have downloaded the somatic mutation data for KIRP and see that there are different MAFs from Broad Institute - 1)
BI__IlluminaGA_DNASeq_curated. Why are these different and what to do you think is the best one to go ahead with the analysis.
You're very welcome! See my answer below to Charles explaining why there are different MAFs. And see the spreadsheet from Ding lab in my post above for the "subjective best" MAF to use per tissue type, or the Broad Institute's MAF Dashboard for the "automated best" MAF to use. In a separate tab in the Ding lab spreadsheet, you'll also find the MAF curator's contact info. But in the case of KIRP, they'll likely point you to the HGSC MAF. If you want to choose one of the Broad MAFs, I'd go with one that says it's curated. They have a DESCRIPTION.txt that explains the differences between their uploaded MAFs.
Hi @Cyriac Kandoth
I got a bit confused after reading the entire thread. I want to do some analysis with mutation data so I think this thread will be the best thread to put my query. I recently downloaded the MAFs which you provided the link here, but I see for all the different diseases there are different MAF files , which is obviously fine. But I want to know is there any file which catalogs all somatic mutations across major types of cancer? Or do I have to take in account all the MAF files and then from there create a MAF or a mutation file that will give me an overall data file that contains mutations across all cancer types. I actually want a somatic mutation file across most cancer type which I will use to map the mutations for my samples. Is there any such kind of file in TCGA. I could not find such from TCGA. If you can share some light.
Several papers here have merged together TCGA MAFs for downstream analyses, but I worry that you might be using it wrong. Most cancer driving mutations do not share the same genomic locus between samples. Start a new post on Biostars detailing your project goals, and we can help. Be sure to tag it with keywords
Maybe I am missing something but it seems to me to be non-trivial to convert a MAF file to a VCF one. In my MAF files, frameshift insertions are represented for instance like this:
Instead, in vcf-files the annotation would include the information of the actual base at position 107602665 like for instance this
Annovar will not digest vcf files created straight from MAF file format because of this discrepancy. The only thing I would like to do is to annotate the variants (preferably using Annovar). Does anyone know how to achieve this?
Thank you very much,
hello, I am sorry to disturb you. I using the genome music calc-covg ,when I read the source code of
calcovg.c, I have some problems.
can you tell me the meaning of these parameters of this function? thank u
who can tell me the meaning of these parameters of this function? thank u
Please post a new question on biostars. The point of the forum is to help other people who have the same question as yours.
Hi Cyriac, thanks for this post. It is now 2016, do you have an updated version of the MAF files, I believe TCGA have more samples now.