Question: TCGA germline and somatic snvs
gravatar for susibing
18 months ago by
susibing20 wrote:

Dear all,

I am looking for already called and annotated germline and somatic snvs from different TCGA projects and have already been approved through dbGaP for the data access.

Unfortunately, it seems that in the harmonized data portal during variant calling all germline mutations are already filtered out (even in controlled access files) - please correct me, if I am wrong.

Therefore, as recommended in this post: How do I obtain germline mutation for TCGA samples? , I am planning to switch to the legacy data.

Here, it seems that most patients have been analyzed via two platforms, Illumina Hiseq and Illumina GA. Do you know which of those data is of higher quality and why both platforms have been used? Moreover, has anyone found some good documentation on how data of the legacy archive has been processed? (e.g. variant caller, ...).

Any help would be much appreciated!

ADD COMMENTlink modified 16 months ago • written 18 months ago by susibing20
gravatar for Kevin Blighe
18 months ago by
Kevin Blighe56k
Kevin Blighe56k wrote:

There are indeed tumour and normal VCF files (separate) in the GDC Legacy. Indel and SNV calls appear to be split across different files.

Once you download these, you can look up the TCGA barcode via the UUID or filename using these functions which have relatively recently been posted on Biostars:

Regarding the Genome Analyser versus the HiSeq, it's a reflection of the fact that the samples were sequenced in different institutions. The VCFs are not large, so, why not just download both separately and then determine the samples to which they both relate? For TCGA DNA-seq, generally, I believe the GA was used more than the HiSeq. You could also just merge all of the files together after you have curated both of these groups of VCFs separately. The final data-point is just a boolean of whether the variant is present or not, after all.

You can also just obtain all of the BAMs, which I am currently doing for one of the TCGA cancers. However, it's another minefield to deal with due to the data load and the fact that BAMs were seemingly aligned to different genomes within the same genome release.


ADD COMMENTlink written 18 months ago by Kevin Blighe56k

Thank you very much!

ADD REPLYlink written 18 months ago by susibing20
gravatar for susibing
16 months ago by
susibing20 wrote:

Just as an additional: after being in contact with the NCI GDC support, turns out that germline snvs are hidden in the aggregated-somatic_mutation files that are with closed access, just not annotated as germline. I was told to overlap the open access and controlled access files to retrieve the germline snvs. However, I believe that some of the mutations have been previously filtered out by the pipeline due to bad quality.

ADD COMMENTlink written 16 months ago by susibing20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1847 users visited in the last hour