Question: TCGA germline and somatic snvs
gravatar for susibing
2.1 years ago by
susibing20 wrote:

Dear all,

I am looking for already called and annotated germline and somatic snvs from different TCGA projects and have already been approved through dbGaP for the data access.

Unfortunately, it seems that in the harmonized data portal during variant calling all germline mutations are already filtered out (even in controlled access files) - please correct me, if I am wrong.

Therefore, as recommended in this post: How do I obtain germline mutation for TCGA samples? , I am planning to switch to the legacy data.

Here, it seems that most patients have been analyzed via two platforms, Illumina Hiseq and Illumina GA. Do you know which of those data is of higher quality and why both platforms have been used? Moreover, has anyone found some good documentation on how data of the legacy archive has been processed? (e.g. variant caller, ...).

Any help would be much appreciated!

ADD COMMENTlink modified 23 months ago • written 2.1 years ago by susibing20
gravatar for regmkbl
2.1 years ago by
regmkbl66k wrote:

There are indeed tumour and normal VCF files (separate) in the GDC Legacy. Indel and SNV calls appear to be split across different files.

Once you download these, you can look up the TCGA barcode via the UUID or filename using these functions which have relatively recently been posted on Biostars:

Regarding the Genome Analyser versus the HiSeq, it's a reflection of the fact that the samples were sequenced in different institutions. The VCFs are not large, so, why not just download both separately and then determine the samples to which they both relate? For TCGA DNA-seq, generally, I believe the GA was used more than the HiSeq. You could also just merge all of the files together after you have curated both of these groups of VCFs separately. The final data-point is just a boolean of whether the variant is present or not, after all.

You can also just obtain all of the BAMs, which I am currently doing for one of the TCGA cancers. However, it's another minefield to deal with due to the data load and the fact that BAMs were seemingly aligned to different genomes within the same genome release.


ADD COMMENTlink written 2.1 years ago by regmkbl66k

Thank you very much!

ADD REPLYlink written 2.1 years ago by susibing20
gravatar for susibing
23 months ago by
susibing20 wrote:

Just as an additional: after being in contact with the NCI GDC support, turns out that germline snvs are hidden in the aggregated-somatic_mutation files that are with closed access, just not annotated as germline. I was told to overlap the open access and controlled access files to retrieve the germline snvs. However, I believe that some of the mutations have been previously filtered out by the pipeline due to bad quality.

ADD COMMENTlink written 23 months ago by susibing20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1274 users visited in the last hour