Question

The variance gene for normal people as the control set compared with TCGA gene mutations for a certain cancer type

1

Entering edit mode

10.7 years ago

Miao Yu ▴ 80

Hi everyone,

I want to analysis the mutation genes in Pan-TCGA, provided in Mutational landscape and significance across 12 major cancer types. For a better understand the analysis the mutation genes in the certain cancer type, so I want to know where could I get the control set data compared with this cancer samples.

Which means, I organize the data from the upper link into this format,

Cancer Sample      Gene_1     Gene_2        Gene...     Gene_20000
TCGA-02-001...     NA         Mis-sense     …           Nonsense
TCGA-02-002...     Indel      NA            …           Mis-sense
TCGA-02-003...     NA         Mis-sense                 NA
…                  …          …             …           …
TCGA-02-584...     Nonsense   NA            …           NA

As a comparison, I need the data for normal sample (people without any cancer) also with some mutation genes for the whole genome, which can be organized in the same format as upper. Does anyone know where could I get this kind of data for analysis?

Thanks!

PS:

For making sure my idea is clear to everyone, I will illustrate it in more detail.

The source data from paper is listed in this format(Part of it),

Tumor_Sample            Gene       Start_Position     Variant_Class     Ref_Allele     Var_Allele     amino_change.
TCGA-02-0003-01A...     HRH2       175110351          Missense          G              A              p.V39I
TCGA-02-0003-01A...     NR1I3      161206281          Silent            C              T              p.A25
TCGA-02-0003-01A...     ALMS1      73680365           Silent            A              G              p.E2236
…                       …          …                  …                 …              …              …
TCGA-02-0047-01A...     GPR132     105518226          Missense          C              T              p.C74Y
TCGA-02-0047-01A...     BUB1B      40512942           Silent            G              A              p.G1059
TCGA-02-0047-01A...     PLEKHG1    151152163          Missense          G              A              p.G639E
TCGA-02-0047-01A...     SPACA3     31322643           Frame_Shift_Del   C              0              p.S16fs
…                       …          …                  …                 …              …              …

As we can see from the table above, for sample 'TCGA-02-0003-01A-01D-1490-08' and 'TCGA-02-0047-01A-01D-1490-08' have no more than one variant genes in whole genomes. And those two samples are all belong to Glioblastoma multiforme(One type of cancer).

So the data I'm looking for is similar to the data above, and the only difference is the data I'm looking for is belong to Normal people (not belong to any cancer types) or Normal samples from cancer patients (with the TCGA barcode is like to 'TCGA-02-0047-11A-01D-1490-08', which is belong to normal tissues of the patients).

I think TCGA group probably have the data I want in both, normal people or normal samples, at least I think the normal samples from patients is public accessible. But I don't know where to find it :( ...

Wish I have describe my question clearly.

mutation TCGA gene variance control-set • 5.0k views

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 10.7 years ago by Miao Yu ▴ 80

Ram · Answer 1 · 2014-11-08

2

Entering edit mode

10.7 years ago

Cyriac Kandoth 6.1k

I understand the data you are trying to find, but I am worried that you are looking for the wrong thing. There are important differences between somatic and germline mutations that you do not mention in your question. These differences might make your analysis difficult or impossible. So which one of the following is closest to the kind of analysis you are trying to do?

Find mutations unique to TCGA tumors, but not seen in the matched normal samples of the same patients: This is actually the definition of a somatic mutation. It is the only kind of data that TCGA makes publicly available, and what was studied in that publication. A mutation seen in both tumor and matched normal sample is (usually) germline. These are not made publicly available, and you will have to request access through dbGaP.
Compare the effects of somatic mutations on a protein (e.g. Missense, Silent, Frame-shift), to the effects of germline mutations seen in the same protein: This is an interesting analysis, if done carefully. The selective pressure on tumor cells (somatic mutations) is quite different from evolutionary pressure in population genetics (germline mutations)
Find germline mutations more common in cancer patients than in people with no history of cancer: This is the essence of population genetics, and it can be done with some of the data that Sean linked you to. For example, ExAC lists germline mutations from 61,486 cases in various disease-specific studies. You might be able to create separate cohorts for cancer-related and cancer-unrelated diseases.

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 10.7 years ago by Cyriac Kandoth 6.1k

0

Entering edit mode

It is very nice for you to answer my question so specific.

I think my work should be related with the first one. After reading your answer, I still have two questions about it.

According to your answer, the mutations seen in both tumor and matched normal samples are usally germline, so I wonder, do the germline mutations, which can be detected from normal tissues of this patient, are eliminated from the somatic mutations, listed in that publication? (I guess if we check and search mutations in the tumor tissues of one patient, it is impossible to determine whether one mutation is belong to the somatic mutations or germline mutations, inherited from his parents. Unless we make the same analysis to the normal tissues of this patient and find the germline mutations of s/he, can we make sure which is somatic mutation and which is germline mutation.)
I just make sure I'm clear about what you say. Most of the mutations for samples like ''TCGA-02-0047-11A-01D-1490-08' are belong to germline mutations, which are hard to get, unless I am authorized by TCGA/dbGaP. Is that right?

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.7 years ago by Miao Yu ▴ 80

0

Entering edit mode

Correct. All tools that detect somatic mutations from tumor-normal pairs, will identify germline mutations as an intermediate result, but they do not distinguish them from recurrent sequencing artifacts seen in both tumor and normal. See "Germline variant calling and filtering" on TCGA exomes in this paper, for simple ways to deal with these problems.
TCGA-xx-xxxx-0 indicates tumor tissue, and TCGA-xx-xxxx-1 indicates normal tissue. The breakdown of a TCGA barcode is here and their designations are tabulated here. You should also read this.

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.7 years ago by Cyriac Kandoth 6.1k

Ram · Answer 2 · 2014-11-05

1

Entering edit mode

10.7 years ago

Sean Davis 27k

You will not be able to get the normal genotypes (mutations) without doing a Data Access Request through dbGaP. However, there are databases of variants that include variant allele frequencies. I have an incomplete list here.

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 10.7 years ago by Sean Davis 27k

0

Entering edit mode

Thanks for your reply Sean Davis

I check the dbGaP and the link list provided by you, however nothing did I get :( . maybe because I'm novice ...

Those databases all provide the interface of gene ID, chromosomal location, dbSNP rs ID, or cancer type, instead of the normal people variation. Although I find some data from dbSNP are belong to the normal person variation, but those variations are just isolated from each other and hard to make sure those variations are from the same sample. So it won't be suitable for using as my input data.

It would be grateful if it is convenient for you to provide more information for me. :)

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.7 years ago by Miao Yu ▴ 80

0

Entering edit mode

The 1000Genomes dataset is, as far as I know, the only dataset that will fit your criteria.

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.7 years ago by Sean Davis 27k