Tutorial:Annotating TCGA MAFs with the latest Ensembl/Gencode transcripts
Entering edit mode
10.1 years ago

Update 12/8/2020: Uniform realignment on GRCh38, somatic variant calling, and effect prediction is now done by NCI's GDC in MAFs downloadable from https://portal.gdc.cancer.gov. Use the instructions below only if you want to use the GRCh37 MAFs that most closely match what was published in the first batch of marker papers.


Input: MAF files on TCGA's Firehose, that list somatic variants annotated using Broad's Oncotator

Output: MAF files re-annotated using Ensembl's VEP on GRCh37 transcripts in Gencode v19, using maf2maf

Fetch MAFs from Firehose

Firehose has a nice UI at firebrowse.org, but we'll use their command-line tool instead. They automatically select a "best MAF" per tumor type at this page, that ends up in a tarball named Mutation_Packager_Oncotated_Raw_Calls in Firehose. But some cancer types only have calls in tarballs named Mutation_Packager_Oncotated_Calls, where the selection of the "best MAF" was non trivial.

Fetch the firehose_get tool, and start downloading Mutation_Packager_Oncotated_Raw_Calls tarballs:

wget -P bin http://gdac.broadinstitute.org/runs/code/firehose_get_latest.zip
unzip bin/firehose_get_latest.zip -d bin
./bin/firehose_get -b -only Mutation_Packager_Oncotated_Raw_Calls data latest

That creates a folder structure with tarballs in separate tumor-type subfolders. Find the tumor types with empty subfolders, and download the Mutation_Packager_Oncotated_Calls for those:

find stddata__* -type d -empty | cut -f2 -d/ | xargs ./bin/firehose_get -b -only Mutation_Packager_Oncotated_Calls data latest

Unpack all the tarballs, and rename the resulting subfolders to just the tumor type codes:

mkdir mafs
cat stddata__*/*/*/*Level_3*.tar.gz | tar -izxf - -C mafs
ls -d mafs/* | perl -ne 'chomp; ($t)=m/gdac.broadinstitute.org_(\w+)/; print "mv $_ mafs/$t\n"' | bash

Let's toss the COAD and READ MAFs because these samples are already accounted for under COADREAD. In retrospect, TCGA should have given 1 tumor type code (e.g. CRC) for these molecularly identical tumor types:

rm -rf mafs/{COAD,READ}

Conversely, let's toss the combined cohorts for GBM+LGG, Pan-Kidney, Pan-GI. We'd rather deal with the same samples in their individual tumor types:

rm -rf mafs/{GBMLGG,KIPAN,PANGI}

Set up VEP and maf2maf

Follow these instructions to set up Ensembl's VEP v86 with a local transcript and annotation cache for GRCh37. In this tutorial, I'll assume that both VEP and its cache are set up at /opt/common/CentOS_6-dev/vep/v86.

Download and unpack vcf2maf v1.6.11, which includes the maf2maf tool that we'll be using:

wget -P bin https://github.com/mskcc/vcf2maf/archive/v1.6.11.tar.gz
tar -zxf bin/v1.6.11.tar.gz

Run maf2maf and merge tumor-type MAFs

For each per-sample MAF, we'll run maf2maf with the following tweaks:

  • Use arguments --vep-path and --vep-data to point to /opt/common/CentOS_6-dev/vep/v86
  • Simplify the names of the resulting MAFs in the format sample_barcode.vep.maf
  • For some genes, map variants to MSKCC's canonical isoform, rather than what VEP picks
  • Retain some useful Oncotator columns from the input MAF like sequence context and dbNSFP scores

Use some in-line perl to generate maf2maf commands for each per-sample MAF, and either submit each of those as parallel jobs on a cluster, or pipe them into bash for sequential execution like shown below:

ls mafs/*/*.maf.txt | perl -ne 'chomp; ($s)=m/^(.*TCGA-\w\w-\w{4}-\d\d)/; print "perl vcf2maf-1.6.11/maf2maf.pl --input-maf $_ --output-maf $s.vep.maf --custom-enst vcf2maf-1.6.11/data/isoform_overrides_at_mskcc --vep-path /opt/common/CentOS_6-dev/vep/v86 --vep-data /opt/common/CentOS_6-dev/vep/v86 --ref-fasta /opt/common/CentOS_6-dev/vep/v86/homo_sapiens/86_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --retain-cols Center,Verification_Status,Validation_Status,Mutation_Status,Sequencing_Phase,Sequence_Source,Validation_Method,Score,BAM_file,Sequencer,Tumor_Sample_UUID,Matched_Norm_Sample_UUID,ref_context,i_dbNSFP_MutationAssessor_pred,i_dbNSFP_MutationAssessor_rankscore,i_dbNSFP_MutationAssessor_score,i_dbNSFP_MutationTaster_converted_rankscore,i_dbNSFP_MutationTaster_pred,i_dbNSFP_MutationTaster_score\n" unless(-s "$s.vep.maf")' | bash
ensembl maf tcga vcf • 12k views
Entering edit mode

Hi Cyriac,

Thank you for the excellent tutorial(s). I was following the VEP build tutorial but somehow keep running into an issue when trying to install the Ensembl API and reference FASTAs (Something to do with my perl build - will figure that another day)


Anyway I figured for the purpose of the tutorial above, I only need the reference fasta file really. So I downloaded that directly from http://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz which is where INSTALL.pl is also sourcing from. However, when trying to run the maf2maf.pl, I run into an error like -

>$ ERROR: Cannot retreive bp at 8:42932460! Please check that you chose the correct reference genome with --ref-fasta: vep/v75/homo_sapiens/75/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

>$ ERROR: Failed to run maf2vcf!

>$ Command: /usr/bin/perl vcf2maf-1.6.7/maf2vcf.pl --input-maf mafs/STAD/$MAF --output-dir /tmp/NAFcX2u_G5 --ref-fasta vep/v75/homo_sapiens/75/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --tum-depth-col t_depth --tum-rad-col t_ref_count --tum-vad-col t_alt_count --nrm-depth-col n_depth --nrm-rad-col n_ref_count --nrm-vad-col n_alt_count --per-tn-vcfs*

Most obvious guess seems that the chromosome naming is not the same in the fasta and the maf files ( chr1 vs 1). Do I need to change that first in the maf files ? Or am I doing something incorrect?

2nd question -- Why should we prefer to use v84 VEP instead of v75 specially if we are working with Grch37 instead of Grch38? I initially was trying to use the release 75 of Ensembl VEP (just because all my work is using release 75 files). I amended the scripts similarly, however the v75 INSTALL.pl script does not support plugin download. Seems like the script has been heavily amended between release 75 and 84. I then proceeded to use v84 only to realise later that at the end of the day its still downloading the v75 fasta file and we wouldn't really need the plugin for this tutorial. Am I correct ?

Thanks for your time and help!

Entering edit mode

You'll need to unpack the gzipped FASTA file with

$ gzip -d vep/v75/homo_sapiens/75/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

Ensembl annotation on GRCh37 was frozen at release 75; while newer software releases still support using GRCh37, most data is frozen at 75, hence why the v75 FASTA gets downloaded.

Entering edit mode
10.1 years ago
DG 7.3k

Great tutorial. Is there a particular rational for working with per-sample VCFs and MAFs in this case? I know my preference would be to have it as a multi-sample VCF all of the way through the process.

Entering edit mode

The vcf2maf script cannot handle a multi-sample VCF as input. Also note that in this particular case, a merged multi-sample VCF will be inefficient and enormous because somatic variants are rarely recurrent across samples. If needed, you can make one using VCFtools' vcf-merge.

Entering edit mode

It seems we are doing similar things in parallel universes Dr. Dan Gaston....

Entering edit mode

Welcome to BioStar, Javier! Where we try to consolidate duplication of effort. :)


Login before adding your answer.

Traffic: 2070 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6