Where to get somatic mutations from WGS data?!
2
2
Entering edit mode
5.7 years ago
dankoc ▴ 20

What is the best way to download somatic mutations called from whole genome sequencing data in the cancer genome atlas (TCGA) project? Downloading somatic mutations in VCF for MAF format is preferable, but I will also download BAMs and call mutations if necessary.

I have spent a lot of time browsing the Genomic Data Commons (GDC) data portal (https://gdc-portal.nci.nih.gov/) today and can find only whole exome sequencing data.

cancer genomics data • 4.5k views
1
Entering edit mode

Some of the MAF files have somatic variants from WGS, check Sequence_Source column in MAF file. Its also worth mentioning that TCGA only provides somatic variants and no germline snps for privacy reasons. Not sure if one can get WGS bam files for the same reason.

0
Entering edit mode

Hi Charles, I am resuscitating this old post since I am facing the same issue right now. Did you finally find where to be able to download cancer somatic mutations from WGS? Thanks

0
Entering edit mode

I have posted an answer below.

1
Entering edit mode
4.4 years ago
sacha ★ 2.3k

If you need cancer genomics data, take a look on ICGC : https://dcc.icgc.org/ and the PCAWG project https://dcc.icgc.org/pcawg

0
Entering edit mode

Unfortunately, I have access to controlled data on TCGA, not on ICGC so I cannot get TCGA data from ICGC.

0
Entering edit mode
4.4 years ago

As mentioned, you can obtain the TCGA somatic mutations in Mutation Annotation Format (MAF) from the GDC Legacy Archive - this data is open access, i.e., available to the public. Anyone can download it. To convert MAF to VCF, take a look here: Vcf To Maf (Mutation Annotation Format) Conversion ?

The BAM and VCF files are protected and one requires access permissions granted by the TCGA data access committee. For details of this process, take a look here: Application Process

The caveats:

• the MAF files were produced at different sites (1 file per site) and using different variant callers
• the same sample may have been processed at more than 1 site
• a single MAF file contains a merged list of variant calls from a group of patients
• the MAF files each have a filter for 'panel of normals', which indicates that the somatic variant call was additionally encountered in a panel of healthy controls sequenced by The Broad Institute of Harvard and MIT.

Kevin

1
Entering edit mode

Thanks for the info. Unfortunately, it seems to be restricted to WXS and not WGS data. Indeed, if I select for instance breast TCGA WGS data, I only see BAM files. So I would have to call variants myself.

0
Entering edit mode

Yes, they only have 248 whole genome sequencing samples, and in BAM format:

portal.gdc.cancer.gov/legacy-archive/search/f?filters=%7B%22op%22...

0
Entering edit mode

0
Entering edit mode

@Anthony Mathelier, could you please confirm if there is indeed no mutation calls (either in MAF or VCF) for WGS data? I really don't want to call them myself if possible as other people probably have done it already.

0
Entering edit mode

You already got an answer from Chris, A: Where are mutation files (MAF) for TCGA normal samples on Firebrowse

My findings (above) also corroborate with those of Chris.

0
Entering edit mode

I meant to ask if mutation for WGS (whoe-genome sequencing) data are available. I've only found that for WXS (whole-exome sequencing) data.

From what I have read, seems it's not available. A related question is I wonder if whole-exome in the context of TCGA include 3' UTRs or not.

1
Entering edit mode

For some studies, there are WGS samples; however, the sequencing in TCGA was mostly WXS (more cost-effective, obviously). If you go to GDC or GDC Legacy, you can configure the filters to look at only WGS. The data is likely protected, too, i.e., you require approval to access it.

Regarding 3' UTRs, that will depend on the kit that was used for exome capture. In all cases that I've seen (even outside the TCGA), the UTRs are sequenced in WXS to some extent. Indeed, for UCEC, at least, there are definitely UTR variants in the open access MAF data. Here's a line of my own code where I list the variant types that I observed from the open access MAF data:

variants[grep("Nonsense_Mutation|RNA|Intron|IGR|5'Flank|3'Flank|5'UTR|3'UTR|Missense_Mutation|Silent|Translation_Start_Site|Splice_Site|Nonstop_Mutation|Frame_Shift_Del|In_Frame_Del|In_Frame_Ins|Frame_Shift_Ins", variants\$Variant_Classification),]