Question: The variance gene for normal people as the control set compared with TCGA gene mutations for a certain cancer type
gravatar for Miao Yu
4.3 years ago by
Miao Yu70
Miao Yu70 wrote:

Hi everyone,

I want to analysis the mutation genes in Pan-TCGA, provided in Mutational landscape and significance across 12 major cancer types. For a better understand the analysis the mutation genes in the certain cancer type, so I want to know where could I get the control set data compared with this cancer samples.

Which means, I organize the data from the upper link into this format,

Cancer Sample Gene_1 Gene_2 Gene... Gene_20000
TCGA-02-001... NA Mis-sense Nonsense
TCGA-02-002... Indel NA Mis-sense
TCGA-02-003... NA Mis-sense   NA
TCGA-02-584... Nonsense NA NA

As a comparation, I need the data for normal sample(people without any cancer) also with some mutation genes for the whole genome, which can be organized in the same format as upper. Does anyone know where could I get this kind of data for analysis?



For making sure my idea is clear to everyone, I will illustrate it in more detail.

The source data from paper is listed in this format(Part of it),

Tumor_Sample Gene Start_Position Variant_Class Ref_Allele Var_Allele amino_change.
TCGA-02-0003-01A... HRH2 175110351 Missense G A p.V39I
TCGA-02-0003-01A... NR1I3 161206281 Silent C T p.A25
TCGA-02-0003-01A... ALMS1 73680365 Silent A G p.E2236
TCGA-02-0047-01A... GPR132 105518226 Missense C T p.C74Y
TCGA-02-0047-01A... BUB1B 40512942 Silent G A p.G1059
TCGA-02-0047-01A... PLEKHG1 151152163 Missense G A p.G639E
TCGA-02-0047-01A... SPACA3 31322643 Frame_Shift_Del C 0 p.S16fs

As  we can see from the table above, for sample 'TCGA-02-0003-01A-01D-1490-08' and 'TCGA-02-0047-01A-01D-1490-08' have no more than one variant genes in whole genomes. And those two samples are all belong to Glioblastoma multiforme(One type of cancer).

So the data I'm looking for is similar to the data above, and the only difference is the data I'm looking for is belong to Normal people(not belong to any cancer types) or Normal samples from cancer patients(with the TCGA barcode is like to 'TCGA-02-0047-11A-01D-1490-08', which is belong to normal tissues of the patients).

I think TCGA group probably have the data I want in both, normal people or normal samples, at least I think the normal samples from patients is public accessible. But I don't know where to find it :( ...

Wish I have describe my question clearly.

ADD COMMENTlink modified 4.3 years ago by Cyriac Kandoth5.2k • written 4.3 years ago by Miao Yu70
gravatar for Cyriac Kandoth
4.3 years ago by
Cyriac Kandoth5.2k
Memorial Sloan Kettering, New York, USA
Cyriac Kandoth5.2k wrote:

I understand the data you are trying to find, but I am worried that you are looking for the wrong thing. There are important differences between somatic and germline mutations that you do not mention in your question. These differences might make your analysis difficult or impossible. So which one of the following is closest to the kind of analysis you are trying to do?

1. Find mutations unique to TCGA tumors, but not seen in the matched normal samples of the same patients:
This is actually the definition of a somatic mutation. It is the only kind of data that TCGA makes publicly available, and what was studied in that publication. A mutation seen in both tumor and matched normal sample is (usually) germline. These are not made publicly available, and you will have to request access through dbGaP.

2. Compare the effects of somatic mutations on a protein (e.g. Missense, Silent, Frame-shift), to the effects of germline mutations seen in the same protein:
This is an interesting analysis, if done carefully. The selective pressure on tumor cells (somatic mutations) is quite different from evolutionary pressure in population genetics (germline mutations)

3. Find germline mutations more common in cancer patients than in people with no history of cancer:
This is the essence of population genetics, and it can be done with some of the data that Sean linked you to. For example, ExAC lists germline mutations from 61,486 cases in various disease-specific studies. You might be able to create separate cohorts for cancer-related and cancer-unrelated diseases.

ADD COMMENTlink modified 4.2 years ago • written 4.3 years ago by Cyriac Kandoth5.2k

It is very nice for you to answer my question so specific.

I think my work should be related with the first one. After reading your answer, I still have two questions about it.

1. According to your answer, the mutations seen in both tumor and matched normal samples are usally germline, so I wonder, do the germline mutations, which can be detected from normal tissues of this patient, are eliminated from the somatic mutations, listed in that publication? (I guess if we check and search mutations in the tumor tissues of one patient, it is impossible to determine whether one mutation is belong to the somatic mutations or germline mutations, inherited from his parents. Unless we make the same analysis to the normal tissues of this patient and find the germline mutations of s/he, can we make sure which is somatic mutation and which is germline mutation.)

2. I just make sure I'm clear about what you say. Most of the mutations for samples like ''TCGA-02-0047-11A-01D-1490-08' are belong to germline mutations, which are hard to get, unless I am authorized by TCGA/dbGaP. Is that right?


ADD REPLYlink written 4.3 years ago by Miao Yu70

1. Correct. All tools that detect somatic mutations from tumor-normal pairs, will identify germline mutations as an intermediate result, but they do not distinguish them from recurrent sequencing artifacts seen in both tumor and normal. See "Germline variant calling and filtering" on TCGA exomes in this paper, for simple ways to deal with these problems.

2. TCGA-xx-xxxx-0 indicates tumor tissue, and TCGA-xx-xxxx-1 indicates normal tissue. The breakdown of a TCGA barcode is here and their designations are tabulated here. You should also read Working with MAF files (Mutation Annotation Format) from the TCGA (The Cancer Genome Atlas).

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by Cyriac Kandoth5.2k
gravatar for Sean Davis
4.3 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

You will not be able to get the normal genotypes (mutations) without doing a Data Access Request through dbGaP.  However, there are databases of variants that include variant allele frequencies.  I have an incomplete list here:


ADD COMMENTlink written 4.3 years ago by Sean Davis25k

Thanks for your reply Sean Davis.

I check the dbGaP and the link list provided by you, however nothing did I get :( . maybe because I'm novice ...

Those databases all provide the interface of gene ID, chromosomal location, dbSNP rs ID, or cancer type, instead of the normal people variation. Although I find some data from dbSNP are belong to the normal person variation, but those variations are just isolated from each other and hard to make sure those variations are from the same sample. So it won't be suitable for using as my input data.

It would be grateful if it is convenient for you to provide more information for me. :)

ADD REPLYlink written 4.3 years ago by Miao Yu70

The 1000Genomes dataset is, as far as I know, the only dataset that will fit your criteria.

ADD REPLYlink written 4.3 years ago by Sean Davis25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2011 users visited in the last hour