Question: Publicly available somatic variant calls for kidney cancer using WGS
gravatar for tralynca
4.5 years ago by
South Africa
tralynca40 wrote:

Good day,

Does anyone know where I can find published somatic mutation calls for kidney cancer by using whole genome sequencing and NOT whole exome sequencing. I need it for the non-coding portion of the genome. Preferably not TCGA because they have controlled access data and the somatic variants are mixed with germline mutations.

Thank you in advance,






ADD COMMENTlink modified 6 weeks ago by Biostar ♦♦ 20 • written 4.5 years ago by tralynca40

Just to clarify for readers down the road, the TCGA somatic variants are not controlled-access.  The BAM files, of course, are controlled-access, as will be the case for pretty much all human data.  ALL studies using NGS will have somatic variants that are "contaminated" with germline variants, unfortunately; the extent will vary, of course, based on technical details. 

ADD REPLYlink written 4.5 years ago by Sean Davis25k

Hi Sean,

Maybe I misunderstood, but the Data Levels and Data Types tab shows that the mutation files (whole genome and whole exome data) that are vcf and maf files (Level 2 data) are Controlled Access data (


ADD REPLYlink written 4.5 years ago by tralynca40

You did ask about whole genome somatic variants.  The exome somatic variants are available as somatic MAF files (but not the genomic somatic variants).  That said, it is relatively straightforward to get access to the controlled-access data, so that really shouldn't stop your analysis.  

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Sean Davis25k

Thanks for the feedback Sean. My supervisor is processing the request for the data. I was just hoping there was something else out there.

ADD REPLYlink written 4.5 years ago by tralynca40
gravatar for Manu Prestat
4.5 years ago by
Manu Prestat3.9k
Marseille, France
Manu Prestat3.9k wrote:

Hi Tracey, sorry for being very pessimistic. I think it would be difficult (if not impossible) as the recommended depth of coverage is around 500x to be able to make calls for detecting low allele frequencies as it is often the case for somatic mutations. Thus it is very unlikely that such a dataset where whole genomes were sequenced at this depth for these kinds of tumorous samples can be found nowadays. Let's consider 1000x on average to expect a 500x DC on most part of the genome (which is surely an underestimation of the sequencing effort needed):

Stating that you need to sequence:
1000x 3.4x10^9bp = 3.4x10^12 bp = 3400 Gb
and you have (for instance):
MiSeq output ~ 15Gb max
HiSeq 4000 output ~ 1500Gb max

=> 226 MiSeq runs / sample
=> 3 HiSeq 4000 runs / sample

I can't imagine if you needed a set of several samples (roughly at least 15 = 45 HiSeq 4000 runs) to ensure that you have a significant representation of variant calls to tell it is specific to kidney cancer.


ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Manu Prestat3.9k

Thank you for your response Manu. Is that supposed to be 1000X or 100X because most articles state that 30-60X is sufficient for DC of WGS data?

ADD REPLYlink written 4.5 years ago by tralynca40

I think Manu is just pointing out that, while 30-60x is what is typically done, for low allele frequency variants, a much higher depth is needed that what is typically done.  Studies using 30-60x for somatic variant calling are very likely underpowered to detect somatic variants.  

ADD REPLYlink written 4.5 years ago by Sean Davis25k

Makes sense. Thank you again Manu and Sean.

ADD REPLYlink written 4.5 years ago by tralynca40
gravatar for Julian Gehring
4.5 years ago by
Cambridge, UK
Julian Gehring20 wrote:

The ICGC has two whole-genome sequencing studies for renal cancer and renal cell cancer:

The data repository contains the somatic variants calls (SNVs and InDels, called simple somatic variants by ICGC) for the two studies. You should note the studies may have used different processing and variant calling pipelines. In general, the calls are saved as tab-delimited files, with additional metainformation regarding calling and genomic annotation. If you are only interested in non-coding variants, you can filter for variants with the respective attributes (e.g. those in intergenic regions).

Of course it depends very much on the question you want to address if these two studies are enough, but it should hopefully provide a good basis for your analysis.

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Julian Gehring20

Hi Julian,

I meant to still get back at you and thank you for your suggestion. I ended up using the ICGC data for my project.


ADD REPLYlink written 4.4 years ago by tralynca40

I'm having a look at it now. Thank you Julian.

ADD REPLYlink written 4.5 years ago by tralynca40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1776 users visited in the last hour