Question

Tutorial:Working with MAF files (Mutation Annotation Format) from the TCGA (The Cancer Genome Atlas)

131

Entering edit mode

11.3 years ago

Cyriac Kandoth 6.1k

Update: (2/8/2017) This tutorial applies to TCGA MAFs in the GDC Legacy Archive. Most of this tutorial is still valid, but I'll need to update some notes and broken links.

Purpose

For folks familiar with the VCF format, TCGA's MAF files can be quite a pain to work with. You might just download the latest MAFs, pull loci and alleles for each variant, and redo annotations with ANNOVAR, snpEff, or Ensembl's VEP. Problem solved, right? Nope. You don't know the half of it! There are lots of caveats you should know about, and I try to document them below. Most of these caveats are handled with safe solutions in the MAFs at this page, and the specificity of variant calls are made more comparable across MAFs, at this page.

How TCGA MAFs are made

Tumor-specific Analysis Working Groups (AWGs) take the auto-generated variant calls from the Genome Sequencing Centers (GSCs), and remove false-positive variants, or recover those missed by the GSCs. This is either done manually by experts, or by automated scripts or filters based on the extensive domain knowledge of said experts. So the quality of variant calls may differ wildly between TCGA GSCs and AWGs, and across samples as the pipelines get better over time
In some tumor-types, and for some subsets of samples, mutations called from the first-pass of sequencing (usually exome-seq), are targeted by custom capture arrays, and re-sequenced. This results in much higher read-depths for the purpose of validating first-pass calls, for more accurate variant allele fractions (VAFs), or for finding calls that were missed
The GSC-generated or AWG-curated MAFs are uploaded to the DCC under folders nested as */gsc/*/*/mutations/

Finding the correct MAF to use

The Ding lab at TGI, WashU track the best available MAFs in this spreadsheet for use in MuSiC analyses. TCGA tumor-types COAD (Colon) and READ (Rectum) are treated as the same tumor type "COADREAD" (Colorectal), resulting in 19 MAFs across 20 tumor-types. At this link, folks at Broad Institute maintain a list of MAFs to feed into Firehose

MAF-format caveats

The 16th column in a MAF lists tumor sample barcodes in the form "TCGA-02-0021-01A-01D-0002-04". Different barcodes may be used for the same sample, if they were sequenced several times or by several GSCs. Be careful not to treat their variants as observed recurrence across patients
Reducing the tumor IDs to the form "TCGA-XX-XXXX-XX" is useful to enumerate the real number of tumors studied, or to de-duplicate variants from the same tumor. And reducing IDs to the form "TCGA-XX-XXXX" helps to enumerate the distinct patients in the cohort. But note that MAF files will not list IDs of samples with zero reported mutations. So it helps to have a separate list of all samples in the cohort
The breakdown of a TCGA barcode is here and their designations are tabulated here
To re-annotate variants in a MAF (e.g. using maf2maf.pl), you'll only need column 4 for the reference genome used, columns 5-7 for genomic loci, and columns 11-13 for reference and variant alleles
The variant alleles in columns 12 and 13 are not always reliable to determine zygosity (some variant callers or curators make bad assumptions). Whichever allele differs from column 11 (the reference allele) is the variant allele, and it may either be heterozygous or homozygous
Almost everything in the TCGA MAFs are from targeted exome sequencing. 50 of the 200 LAML tumors were whole-genome sequenced, and the putative calls were targeted with custom capture arrays
Per-sample coverage data can be found here. Genomic loci with sufficient depth to detect variants are listed in BED files, which can be used to correct for variant caller sensitivity. Exome-seq coverage can differ quite a bit between samples, and across the exome, as it has improved over time. But more importantly, several proprietary exome-capture technologies have been used across TCGA, with different targeted regions. And the 3 TCGA GSCs are not very transparent about their internal protocols either! So please make use of this coverage data if you can. It was not easy to collect/clean/liftover!

Tumor-type caveats

These caveats may change over time, but I'm listing what I know as of today, August 9th, 2014:

The latest MAFs for PAAD, PRAD, THCA, and UCS are auto-generated GSC MAFs, and have not yet gone through AWG curation. Avoid these if your analysis is too sensitive to false positives
Four of the tumors in the BRCA MAF are actually metastases (TCGA-XX-XXXX-06) of four other primary tumors (TCGA-XX-XXXX-01) in the same MAF. Be careful not to treat their variants as observed recurrence between different patients
Similarly, the SKCM MAF has 2 metastases for 2 other primary tumors. But note that most samples in SKCM (melanoma) are from metastases, because primary SKCM tumors are hard to pinpoint
Of the 200 sequenced LAML tumors, 3 had no reported exonic mutations. So you won't find them listed in the MAF. This is important to note, when measuring something like mutation frequency across samples. Of all the TCGA tumor types, LAML (Acute Myeloid Leukemia) has the lowest overall mutation frequency, and several known/suspected drivers are in the regulome. This was also the motive behind doing whole-genome sequencing for 50 of the 200 LAML tumors, while everything else in TCGA went through exome-seq

music cancer tcga mutation maf • 55k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 11.3 years ago by Cyriac Kandoth 6.1k

2

Entering edit mode

This is fantastically useful - thanks for posting.! Do you have anything similar for the copy-number data?

ADD REPLY • link 11.3 years ago by david.wettmann ▴ 30

0

Entering edit mode

No. But there are definitely fewer caveats to the copy-number data because it was all generated by the same folks, and analyzed by them too. You can find more at Broad Institute's copy number portal, which pulls data from Firehose runs.

ADD REPLY • link 9.7 years ago by Cyriac Kandoth 6.1k

1

Entering edit mode

Thanks Cyriac. I was interested in your point #5 in "MAF-format caveats" about not always being able to call zygosity based on the variant alleles in columns 12 and 13. Would you be able to give a bit more detail about why this is the case please?

Thanks, Dave

ADD REPLY • link 11.3 years ago by david.wettmann ▴ 30

0

Entering edit mode

It's often assumed that somatic point mutations or small indels in cancer are infrequent enough, that they almost always result in a homozygous site becoming heterozygous. Combine that imperfect assumption with someone's good old fashioned indifference, and you get MAF columns 12 and 13. :)

ADD REPLY • link 11.3 years ago by Cyriac Kandoth 6.1k

0

Entering edit mode

Thanks so much for this! Very very useful.

ADD REPLY • link 10.5 years ago by Danielk ▴ 640

1

Entering edit mode

Thanks for this write-up, it has been very helpful. Regarding the CNV data, do you happen to know what the naming convention is for samples? I can't figure out what the relationship is between mutation sample ids and CNV sample ids.

ADD REPLY • link updated 4.6 years ago by Ram 44k • written 10.2 years ago by gregorymcinnes ▴ 10

0

Entering edit mode

See again the breakdown of sample barcodes over here. CNV data will be generated from a different portion of the same sample... meaning a different barcode altogether. To match mutation data to CNV data for the same sample, reduce barcodes to the form TCGA-XX-XXXX-XX.

ADD REPLY • link updated 4.6 years ago by Ram 44k • written 10.2 years ago by Cyriac Kandoth 6.1k

2

Entering edit mode

Anyone met the problem I have? I come to TCGA SNP data download center and I found all these VCF files are controlled?

https://portal.gdc.cancer.gov/legacy-archive/search/f?filters=%7B%22op%22:%22and%22,%22content%22:%5B%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22cases.project.primary_site%22,%22value%22:%5B%22Liver%22%5D%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22cases.project.disease_type%22,%22value%22:%5B%22Liver%20Hepatocellular%20Carcinoma%22%5D%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22files.data_category%22,%22value%22:%5B%22Simple%20nucleotide%20variation%22%5D%7D%7D%5D%7D&pagination=%7B%22files%22:%7B%22from%22:81,%22size%22:20,%22sort%22:%22cases.project.project_id:asc%22%7D%7D

ADD REPLY • link 6.4 years ago by Shicheng Guo ★ 9.5k

0

Entering edit mode

Thanks for this wonderful post Cyriac!

I have been learning about the MAF Files very recently and I was wondering what is the correct MAF File to work with when I download the data from TCGA. I have downloaded the somatic mutation data for KIRP and see that there are different MAFs from Broad Institute - 1) BI__IlluminaGA_DNASeq , 2) BI__IlluminaGA_DNASeq_automated and 3) BI__IlluminaGA_DNASeq_curated. Why are these different and what to do you think is the best one to go ahead with the analysis.

Thanks, Yaseswini

ADD REPLY • link updated 4.6 years ago by Ram 44k • written 9.7 years ago by yaseswini.neelamraju • 0

0

Entering edit mode

You're very welcome! See my answer below to Charles explaining why there are different MAFs. And see the spreadsheet from Ding lab in my post above for the "subjective best" MAF to use per tissue type, or the Broad Institute's MAF Dashboard for the "automated best" MAF to use. In a separate tab in the Ding lab spreadsheet, you'll also find the MAF curator's contact info. But in the case of KIRP, they'll likely point you to the HGSC MAF. If you want to choose one of the Broad MAFs, I'd go with one that says it's curated. They have a DESCRIPTION.txt that explains the differences between their uploaded MAFs.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Cyriac Kandoth 6.1k

0

Entering edit mode

Hi @Cyriac Kandoth

I got a bit confused after reading the entire thread. I want to do some analysis with mutation data so I think this thread will be the best thread to put my query. I recently downloaded the MAFs which you provided the link here, but I see for all the different diseases there are different MAF files , which is obviously fine. But I want to know is there any file which catalogs all somatic mutations across major types of cancer? Or do I have to take in account all the MAF files and then from there create a MAF or a mutation file that will give me an overall data file that contains mutations across all cancer types. I actually want a somatic mutation file across most cancer type which I will use to map the mutations for my samples. Is there any such kind of file in TCGA. I could not find such from TCGA. If you can share some light.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Several papers here have merged together TCGA MAFs for downstream analyses, but I worry that you might be using it wrong. Most cancer driving mutations do not share the same genomic locus between samples. Start a new post on Biostars detailing your project goals, and we can help. Be sure to tag it with keywords tcga and maf.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Cyriac Kandoth 6.1k

0

Entering edit mode

Hi guys,

Maybe I am missing something but it seems to me to be non-trivial to convert a MAF file to a VCF one. In my MAF files, frameshift insertions are represented for instance like this:

9       107602665       Frame_Shift_Ins .       C       .       PASS    5d11fd9e-86e8-4e08-96cb-9ca66a7fa283

Instead, in vcf-files the annotation would include the information of the actual base at position 107602665 like for instance this

9       107602665  Frame_Shift_Ins      A       AC       .       PASS    5d11fd9e-86e8-4e08-96cb-9ca66a7fa283

Annovar will not digest vcf files created straight from MAF file format because of this discrepancy. The only thing I would like to do is to annotate the variants (preferably using Annovar). Does anyone know how to achieve this?

Thank you very much,
Thomas

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 9.1 years ago by thomaskuilman ▴ 850

0

Entering edit mode

Please create a new question on biostars, because comments go unnoticed. Anyway, use the maf2vaf script at http://github.com/ckandoth/vcf2maf

ADD REPLY • link 9.1 years ago by Cyriac Kandoth 6.1k

0

Entering edit mode

hello, I am sorry to disturb you. I using the genome music calc-covg ,when I read the source code of calcovg.c, I have some problems.

can you tell me the meaning of these parameters of this function? thank u

uint32_t tid, uint32_t pos, int n, const bam_pileup1_t *pl, void *data

static int pileup_func_1( uint32_t tid, uint32_t pos, int n, const bam_pileup1_t *pl, void *data ) \
{
  pileup_data_t *tmp = (pileup_data_t*)data;
  if( pos >= tmp->beg && pos < tmp->end ) /
  {
    // Count the number of reads that pass the mapping quality threshold across this base  /
    int  mapq_n = 0;i,
    for( i = 0; i < n; ++i )
    {
      const bam_pileup1_t *base = pl + i;
      if( !base->is_del && base->b->core.qual >= tmp->min_mapq )
        mapq_n++;
    }
    tmp->bam1_cvg[pos - tmp->beg] = ( mapq_n >= tmp->min_depth_bam1 );  //？？
  }
  return 0;
}

who can tell me the meaning of these parameters of this function? thank u

uint32_t tid, uint32_t pos, int n, const bam_pileup1_t *pl, void *data

ADD REPLY • link updated 4.6 years ago by Ram 44k • written 9.1 years ago by qwertyuiop201320142015 • 0

2

Entering edit mode

Please post a new question on biostars. The point of the forum is to help other people who have the same question as yours.

ADD REPLY • link 9.1 years ago by Cyriac Kandoth 6.1k

0

Entering edit mode

Hi Cyriac, thanks for this post. It is now 2016, do you have an updated version of the MAF files, I believe TCGA have more samples now.

ADD REPLY • link 8.0 years ago by Ming Tommy Tang ★ 4.1k

Ram · Answer 1 · 2013-12-19

4

Entering edit mode

10.6 years ago

Charles Warden 8.2k

This is a big help.

Also, I've noticed that TCGA seems to have done a better job with presenting the public data for a given disease. For example, check out this site for colorectcal data:

https://tcga-data.nci.nih.gov/docs/publications/coadread_2012/

That said, I'm wondering what is the difference between the Broad and WashU MAF files? Why would you use one over the other? For example, there are two MAF files available to download for endometrial cancer (in addition to the somatic MAF file):

https://tcga-data.nci.nih.gov/docs/publications/ucec_2013/

FYI, I am accessing these links from this main page:

https://tcga-data.nci.nih.gov/docs/publications/

ADD COMMENT • link 10.6 years ago by Charles Warden 8.2k

7

Entering edit mode

Some backstory... TCGA has 3 DNA-sequencing centers (GSCs): WashU, Broad, and Baylor. And there are at least 4 centers that do automated mutation calling: UCSC, Broad, WashU, and Baylor. The tumor-specific AWGs are varied multi-institutional teams with experience in transcriptomics, proteomics, epigenetics, tumor-type experts, etc... but the somatic mutation analyses of each tumor-type are always led by one of the 3 GSCs. See the "GSC Analysis Leads" in this spreadsheet.

Originally, the plan was that 1 or more GSCs would do the sequencing for a tumor-type, and at least 3 centers would generate mutation calls in VCF format and pass those to the AWGs. The AWGs would then consolidate/curate these VCFs to shortlist the high confidence mutations, and annotate them to gene names in a final MAF that is used for subsequent analyses. As it turns out, that plan was all too idealistic!

None of the centers had much experience with VCFs, and inevitably wrote their pipelines around MAFs or some internal tab-delimited format. Only recently, did they modify their mutation calling pipelines to use VCFs at the core. VCFs have useful information straight from the variant callers that are important during curation... like read-depths, variant allele fractions, sequence context, reasons for filtering, genotyping info, etc. They may also contain all candidate calls making it easier to recover false negatives. All these are lost when you convert to a MAF. Some GSCs added these as additional columns to the tab-delimited MAF format, but there was no standard for column headers... which made it more of a headache when writing parsers.

As of today, only a lead GSC does the sequencing and mutation calling for each tumor-type. Other centers are invited to participate in "network mutation calling", but that doesn't always happen, and if it does, their calls may not make it in time to be incorporated into the final manuscript-ready MAF.

So Broad's MAF for UCEC that you pointed to here, is an auto-generated MAF straight out of their pipeline, that prioritizes sensitivity over specificity. While WashU's UCEC MAF went through a ton of manual and automated curation. UCSC also did mutation calling on this dataset, but they generated VCFs which are stored in protected access. WashU was the GSC analysis lead in UCEC, and made the decision to stick with the strictly curated WashU MAF for the final manuscript. Considering the high mutation rate of UCEC, it would have taken too much effort to curate the tens of thousands of additional calls from Broad/UCSC... a majority of which were at allele fractions <10%, or were indels near homopolymers. Some of those indels were important, and WashU restored them manually. Indels near homopolymers are associated with MLH1 loss, common in UCEC.

Another important difference between Broad/WashU MAFs is in the transcript annotation database and the gene names used, and in the selection of an isoform to map a variant's effect onto. Broad uses Oncotator on GAF while WashU uses their annotator on Gencode. As of June 2014, TCGA standardized to Gencode, and Broad has an internal version of Oncotator using Gencode. If you're looking for TCGA MAFs with standardized annotations and gene names try these. Or run your downloaded TCGA MAFs through the maf2maf.pl script in the vcf2maf repo.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 10.6 years ago by Cyriac Kandoth 6.1k

0

Entering edit mode

@ Charles: Great question.

@all: this video from this years TCGA Symposium about Multi-Center Mutation Calling in TCGA might be very informative, for those who are wondering about the multple MAF files and how they are produced. BTW: seems, that this is a great meeting. Great to have all talks online.

@ Cyriac: you mentioned, that you "track the best available MAFs in this spreadsheet". I guess David Wheeler in the above video very briefly mentions that on slide 20 (15:15), but doesnt say anythiong about that. My question is: how are your "best available MAFs" produced?

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by Sebastian Boegel ▴ 520

1

Entering edit mode

Unfortunately, there is no good automated way to choose the best available TCGA MAF for a tumor type. Broad Institute has an automated way to fetch the latest curated MAF for use in their Firehose analyses. But the latest is not always the best. I keep the spreadsheet updated by staying posted on what the AWGs are up to, but I've been out of the loop recently. So the safest way is to email the AWG MAF curators listed in my spreadsheet, and wait for a reply. Most of them are quite responsive if you keep your email really short and specific.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by Cyriac Kandoth 6.1k

0

Entering edit mode

Thanks for answering so fast and keeping trakc of all the information. Last question: are the MAFs in your spreadsheet the ones you used in your Nature publication?

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by Sebastian Boegel ▴ 520

0

Entering edit mode

Yea, but from an older version of the spreadsheet, and only 12 of the 20+ tumor types could be used in our paper because of a TCGA publication embargo. A data freeze of the MAFs used in the paper, and subsequent cleanup/filtering steps can be found here.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by Cyriac Kandoth 6.1k

Ram · Answer 2 · 2014-05-12

1

Entering edit mode

10.2 years ago

Sebastian Boegel ▴ 520

Thank you for the tutorial. Maybe this question is a bit off-topic, but does anyone know a nci wiki page (or so) where they describe how the called and subsequently validated somatic mutations?

All I found is: https://wiki.nci.nih.gov/display/TCGA/Genome+Sequencing+Center, which doesn't help much.

Thank you

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 10.2 years ago by Sebastian Boegel ▴ 520

1

Entering edit mode

The somatic calling methods used by the 3 TCGA GSCs have evolved a lot over the years. Sensitivity/specificity of the calls will differ between tumor-types, but is generally consistent across samples of the same tumor-type - Two known exceptions are OV and COAD/READ where multiple GSCs performed sequencing and/or variant calling.

WashU makes it's code public, so skimming through the POD of this module might give you some insight into their latest workflow. But in general, I would recommend reading the methods section of each marker paper listed here.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.2 years ago by Cyriac Kandoth 6.1k

0

Entering edit mode

Thanks a lot for your helpful answer.

ADD REPLY • link 10.2 years ago by Sebastian Boegel ▴ 520

Ram · Answer 3 · 2014-10-12

1

Entering edit mode

9.8 years ago

Chip ▴ 130

I've got a question concerning caveat #2:

Since samples with no reported mutations are not listed, where can I find the list of the entire cohort used?

Currently I'm using this list but I haven't found anything saying that all those samples have been analysed to detect mutations.

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.8 years ago by Chip ▴ 130

1

Entering edit mode

Unfortunately, there's no simple way, but you shouldn't break your head over it, like I did! If a sample has zero mutations reported, it was most likely due to poor exome-sequencing coverage.

The closest to an ideal solution would be to query or browse CGHub for all TCGA samples with WXS/WGS, but then you have to find out which samples were kicked out later by the AWG, because of a QC failure or low-level contamination.

If you just need something quick, you could use sample IDs from the per-sample coverage data collected here.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.8 years ago by Cyriac Kandoth 6.1k

0

Entering edit mode

OK, thank you very much for the quick and comprehensive reply.

ADD REPLY • link 9.8 years ago by Chip ▴ 130

0

Entering edit mode

No worries. I also just remembered this nice GUI interface to lookup TCGA sequence data on CGHub. Here's a URL query that returns all WGS/WXS BAMs for TCGA AML. Deduplicate the sample IDs in the form TCGA-XX-XXXX-XX, and you should find a few more than just the 197 IDs seen in the TCGA AML MAF.

ADD REPLY • link 9.8 years ago by Cyriac Kandoth 6.1k

0

Entering edit mode

To clarify: if a sample is listed as a bed file, it can be assumed that it was then used for variant calling?

ADD REPLY • link 3.2 years ago by rebeliscu ▴ 60

0

Entering edit mode

Correct. The BED files in https://www.synapse.org/#!Synapse:syn1695394 are all samples that underwent DNA-sequencing and variant calling.

ADD REPLY • link 3.2 years ago by Cyriac Kandoth 6.1k

Ram · Answer 4 · 2015-03-06

0

Entering edit mode

9.4 years ago

sviatoslav.kendall ▴ 880

Can anyone explain what exactly is meant by "Nonstop_Mutation", a value seen in the "Variant_Classification" column of an MAF file? I don't understand how it would be different from a missense mutation or perhaps a silent mutation...

ADD COMMENT • link 9.4 years ago by sviatoslav.kendall ▴ 880

1

Entering edit mode

Always better to open a new thread for questions as specific as yours. Makes it easier for others to find it. Nonstop_Mutation is the same as stop_lost explained here.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.4 years ago by Cyriac Kandoth 6.1k