Question: Where to get somatic mutations from WGS data?!
2
gravatar for dankoc
3.1 years ago by
dankoc20
dankoc20 wrote:

What is the best way to download somatic mutations called from whole genome sequencing data in the cancer genome atlas (TCGA) project? Downloading somatic mutations in VCF for MAF format is preferable, but I will also download BAMs and call mutations if necessary.

I have spent a lot of time browsing the Genomic Data Commons (GDC) data portal (https://gdc-portal.nci.nih.gov/) today and can find only whole exome sequencing data.

cancer genomics data • 2.8k views
ADD COMMENTlink modified 22 months ago by sacha1.8k • written 3.1 years ago by dankoc20
1

Some of the MAF files have somatic variants from WGS, check Sequence_Source column in MAF file. Its also worth mentioning that TCGA only provides somatic variants and no germline snps for privacy reasons. Not sure if one can get WGS bam files for the same reason.

ADD REPLYlink written 3.1 years ago by poisonAlien2.8k

Hi Charles, I am resuscitating this old post since I am facing the same issue right now. Did you finally find where to be able to download cancer somatic mutations from WGS? Thanks

ADD REPLYlink written 22 months ago by Anthony Mathelier870

I have posted an answer below.

ADD REPLYlink written 22 months ago by Kevin Blighe52k
1
gravatar for sacha
22 months ago by
sacha1.8k
France
sacha1.8k wrote:

If you need cancer genomics data, take a look on ICGC : https://dcc.icgc.org/ and the PCAWG project https://dcc.icgc.org/pcawg

ADD COMMENTlink written 22 months ago by sacha1.8k

Unfortunately, I have access to controlled data on TCGA, not on ICGC so I cannot get TCGA data from ICGC.

ADD REPLYlink written 22 months ago by Anthony Mathelier870
0
gravatar for Kevin Blighe
22 months ago by
Kevin Blighe52k
Kevin Blighe52k wrote:

As mentioned, you can obtain the TCGA somatic mutations in Mutation Annotation Format (MAF) from the GDC Legacy Archive - this data is open access, i.e., available to the public. Anyone can download it. To convert MAF to VCF, take a look here: Vcf To Maf (Mutation Annotation Format) Conversion ?

The BAM and VCF files are protected and one requires access permissions granted by the TCGA data access committee. For details of this process, take a look here: Application Process

The caveats:

  • the MAF files were produced at different sites (1 file per site) and using different variant callers
  • the same sample may have been processed at more than 1 site
  • a single MAF file contains a merged list of variant calls from a group of patients
  • the MAF files each have a filter for 'panel of normals', which indicates that the somatic variant call was additionally encountered in a panel of healthy controls sequenced by The Broad Institute of Harvard and MIT.

Kevin

ADD COMMENTlink modified 22 months ago • written 22 months ago by Kevin Blighe52k
1

Thanks for the info. Unfortunately, it seems to be restricted to WXS and not WGS data. Indeed, if I select for instance breast TCGA WGS data, I only see BAM files. So I would have to call variants myself.

ADD REPLYlink written 22 months ago by Anthony Mathelier870

Yes, they only have 248 whole genome sequencing samples, and in BAM format:

portal.gdc.cancer.gov/legacy-archive/search/f?filters=%7B%22op%22...

ADD REPLYlink written 22 months ago by Kevin Blighe52k

That's unfortunately what I thought. Thanks for your reply.

ADD REPLYlink written 21 months ago by Anthony Mathelier870

@Anthony Mathelier, could you please confirm if there is indeed no mutation calls (either in MAF or VCF) for WGS data? I really don't want to call them myself if possible as other people probably have done it already.

ADD REPLYlink written 14 months ago by -_-840

You already got an answer from Chris, A: Where are mutation files (MAF) for TCGA normal samples on Firebrowse

My findings (above) also corroborate with those of Chris.

ADD REPLYlink written 14 months ago by Kevin Blighe52k

I meant to ask if mutation for WGS (whoe-genome sequencing) data are available. I've only found that for WXS (whole-exome sequencing) data.

From what I have read, seems it's not available. A related question is I wonder if whole-exome in the context of TCGA include 3' UTRs or not.

ADD REPLYlink modified 14 months ago • written 14 months ago by -_-840
1

For some studies, there are WGS samples; however, the sequencing in TCGA was mostly WXS (more cost-effective, obviously). If you go to GDC or GDC Legacy, you can configure the filters to look at only WGS. The data is likely protected, too, i.e., you require approval to access it.

Regarding 3' UTRs, that will depend on the kit that was used for exome capture. In all cases that I've seen (even outside the TCGA), the UTRs are sequenced in WXS to some extent. Indeed, for UCEC, at least, there are definitely UTR variants in the open access MAF data. Here's a line of my own code where I list the variant types that I observed from the open access MAF data:

variants[grep("Nonsense_Mutation|RNA|Intron|IGR|5'Flank|3'Flank|5'UTR|3'UTR|Missense_Mutation|Silent|Translation_Start_Site|Splice_Site|Nonstop_Mutation|Frame_Shift_Del|In_Frame_Del|In_Frame_Ins|Frame_Shift_Ins", variants$Variant_Classification),]
ADD REPLYlink written 14 months ago by Kevin Blighe52k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2347 users visited in the last hour