Variant allele frequency in TCGA and ICGC
2
6
Entering edit mode
6.7 years ago

Hi all

For a cancer research project some labmates and myself are conducting, I would like to obtain lists of cancer mutations along with their variant allellic frequency (AF) for a wide range of malignancies. Looking at the TCGA, I noticed that the freely accessible MAF file format does not include this data field, but the TCGA VCF standard does list AF as an optional field in the INFO field. The VCF files are all under the restricted access tier, however.

Could someone with access to this data give me an indication of the percentage of TCGA malignancies that include AF in their mutation call files? Any pointers on other ways to find such data would also be appreciated highly!

Best,
Maarten

genome VCF maf TCGA ICGC • 11k views
4
Entering edit mode

For future reference in this rather daunting task of finding pan-cancer MAFs, there is no standard column that stores VAF in TCGA MAFs (See this post) but rather there are varying column names between the GDACs that created the MAFs.

HGSC generated files include columns named: TTotCov TVarCov NTotCov NVarCov

Broad institute generated files include columns named: t_alt_count t_ref_count

The Broad's list above does not include Sanger MAFs, of which at least one (example here) includes the fields n_ref_count and a_ref_count.

Between those, almost all malignancies should be covered with a MAF that includes VAF. For those that seem outdated on the Broad's list (e.g. COAD, revised in 2013), the reference on the TCGA page is just as old, but I didn't check all malignancies here.

0
Entering edit mode

Do you have any idea about "i_TVarCov"? Sometimes there are two numbers like 19|18. What does this mean?

0
Entering edit mode

Thanks! I've been looking for MAFs that contain additional fields in Firehose and the TCGA data portal but no luck yet on finding any that include REF and ALT allele counts, will get back to you as soon as I'm successful.

0
Entering edit mode
0
Entering edit mode

Thanks once again! The UCSC produced MAF files I looked at indeed include REF and ALT allele counts. Is there a 1-to-1 correspondence between presence of this field in the MAF file and presence in the corresponding protected VCF file? In other words, would it be useful to apply for access to the protected VCF files nonetheless?

0
Entering edit mode

And I should have mentioned that ICGC maintains a much cleaner DCC than TCGA here: https://dcc.icgc.org/repository/icgc/current/

Try their .TSV files of somatic mutations. I believe they have allele counts for at least a subset of tumor types.

8
Entering edit mode
6.7 years ago

You're on the right track with those column name aliases. I have seen: TVarCov, t_alt_count, tumor_var_reads, TumorVarReads_WU, i_t_alt_count  for tumor variant allele counts.

Most of the MAF files listed here have the necessary allele depths to measure Variant Allele Fractions (VAFs). Note that we call it "Fractions" to avoid confusion with population "Frequencies", commonly used in the germline world. I already collected, fixed, liftOvered, re-annotated, and filtered TCGA MAFs into a standardized format downloadable here. Unpack the tarball and see the readme for proper provenance on what was done. The columns are standardized to this format, which uses t_alt_count and t_ref_count for your purposes. If you or anyone uses this data for their papers, then please cite Mutational landscape and significance across 12 major cancer types.

You should be aware that these allele counts are based on different callers that use different mapping/base quality cutoffs. They should be comparable within a cohort (tumor type), but not necessarily across cohorts.

0
Entering edit mode

Thanks a lot Cyriac! That's an impressive amount of columns you've added and indeed makes this data very attractive. We're striving to conduct a pan-cancer analysis and would highly prefer to restrict ourselves to donors for which RNAseq is also available - granted that this restriction will leave us with a sufficiently large body of donors. The ICGC data portal seems attractive, as it has both data types for a large amount of patients (~4500) and also incorporates other projects besides TCGA. Do you perhaps already know the amount of donors included in your MAFs (8804) for which RNAseq is also available? I'm not sure whether the MAFs in the ICGC have been subjected to a similarly rigorous filtering procedure as yours.

Good point on the variant calling differences between the cohorts, I think that is something we will have to live with but consider in our analyses as it seems unpractical to do the variant calling on all the raw sequencing files ourselves (we're a small team). I'm looking forward to the PCAWG for that purpose!

0
Entering edit mode

I don't have a recent intersection between RNA-seq and exome-seq from TCGA. But it shouldn't be much trouble to grab all available RSEMs matrices from Firehose...

# Use firehose_get to download the latest RNAseqv2 data across TCGA tumor types:
unzip firehose_get_latest.zip
./firehose_get -b -only Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes__data data latest

0
Entering edit mode

Thanks once again, this is very useful. I found the oncotated MAFs offered by the Broad also to be very complete they don't seem to offer complete MAFs for THYM, MESO and ESCA.

I noticed that the column names deviate from what's supposed to be there in the readme. In the STAD and SKCM files I find the following for instance:

1-34 Standard TCGA MAF V2.2 column names
35 Genome_Change
36 Annotation_Transcript
37 Transcript_Strand
38 Transcript_Exon
39 Transcript_Position
40 cDNA_Change
41 Codon_Change
42 Protein_Change
43 Other_Transcripts
44 Refseq_mRNA_Id
45 Refseq_prot_Id
46 SwissProt_acc_Id
47 SwissProt_entry_Id
48 Description
49 UniProt_AApos
50 UniProt_Region
51 UniProt_Site
52 UniProt_Natural_Variations
53 UniProt_Experimental_Info
54 GO_Biological_Process
55 GO_Cellular_Component
56 GO_Molecular_Function
57 COSMIC_overlapping_mutations
58 COSMIC_fusion_genes
59 COSMIC_tissue_types_affected
60 COSMIC_total_alterations_in_gene
61 Tumorscape_Amplification_Peaks
62 Tumorscape_Deletion_Peaks
63 TCGAscape_Amplification_Peaks
64 TCGAscape_Deletion_Peaks
65 DrugBank
66 ref_context
67 gc_content
68 CCLE_ONCOMAP_overlapping_mutations
69 CCLE_ONCOMAP_total_mutations_in_gene
70 CGC_Mutation_Type
71 CGC_Translocation_Partner
72 CGC_Tumor_Types_Somatic
73 CGC_Tumor_Types_Germline
74 CGC_Other_Diseases
75 DNARepairGenes_Role
76 FamilialCancerDatabase_Syndromes
77 MUTSIG_Published_Results
78 OREGANNO_ID
79 OREGANNO_Values
80 t_alt_count
81 t_ref_count
82 validation_alt_allele
83 validation_method
84 validation_status
85 validation_tumor_sample
86 pox
87 qox
88 pox_cutoff
89 isArtifactMode
90 oxoGCut


Also, could you or do you already offer this dataset on Synapse? That would be great for reference purposes, thanks!

0
Entering edit mode

The tarball and readme I provided are not officially from TCGA. They're just my effort to standardize column names across TCGA MAFs... plus the gene names, transcripts, variant effect annotation, etc. using the maf2maf tool. Broad has a more portable version of oncotator now, and they'd normally run it on all the Firehose MAFs and make it available at this link. If you want all these extra columns that oncotator generates, I'd recommend downloading oncotator and running it on all the MAFs listed here.

1
Entering edit mode
6.7 years ago

In the VCF spec, AF is the allele frequency segregating in the population, which is a different concept than the Variant Allele Frequency (VAF). The VAF is what I think you are referring to in your question. The MAF files for some of the TCGA diseases from some MAF providers do provide the information to calculate the VAF. For example, look for MAFs produced by UCSC, as they produce MAF files with REF and ALT allele counts in tumor and normal.