TCGA: Does TCGA cancer studies have mRNA expression data for Control/Normal samples?
7.0 years ago
komal.rathi ★ 3.8k

Hi everyone,

I am using the TCGA portal to get mRNA expression data for various cancer studies (e.g. lung, liver, thyroid etc). We have been on a lookout for control/normal samples for the cancer studies on TCGA. On the website we could find case/tumor samples but couldn't find any control samples.

Does anyone know or have used control/normal samples from TCGA and can point me to it? Or do you know of a good resource (preferably using RNASeq V2 RSEM normalized expression values or z-scores) for control/normal samples in tissues like Lung, Liver, Thyroid etc. (basically all the fore-gut tissues)?

Thanks!

RSEM controls TCGA normals RNASeq • 29k views
You can use TCGA-Assembler for that. There is a Nature Methods paper "describing it" (see ref on the link).

When you download the data using the DownloadRNASeqData function, you can specify if you want normal, primary tumor, recurrent tumor or metastatic. this will have you download RNASeqV1 or V2 level 3 data (RSEM normalized (or not)). you will have to transform it in z-scores yourself though.

You can do it by following this thread in Google groups by matching the sample names (for matched samples) or taking the average of normal controls for the non matched data

Thanks, what russhh said worked for me, but I will definitely give this a try. Looks promising!

TCGA-Assembler out of service, any good alternative?

Hi,

Since TCGA data are now on NCI website how can I download gene expression data (FPKM) for breast cancer and associated normal tissue. I do not find any "normal tissue" option (maybe I missed it..)

Since this is a separate query, you might consider starting a new question

There's certainly RNASeq data from matched normal samples (ie, normal lung tissue from a lung cancer patient) for the lung samples, eg TCGA-44-2655-11 here.

So, there are a lot of TN (Tumor samples that have matched normals) compared to NT ( Normal samples that have matched tumors). How is this possible? Shouldn't the number of TN be same as NT?

0
I don't know what you mean, that's certainly not what I thought I'd said - apologies.

There are very few control samples (ie, normal lung tissue from individuals who do not have cancer), but for around 20-25% of the lung tumour samples, there is an associated matched-normal lung sample

Hence, there are more tumour samples for which there isn't a matched-normal sample than there is tumour samples for which there is a matched normal sample

I meant, I referred to this & this, sample names ending in 01 are Tumor and those ending in 11 are Normal. When I went to the data matrix on TCGA for LUAD, there are options like Tumor-matched & Normal-matched. Also, according to this

• TN (Tumor, matched normal) - Data for a tumor tissue for which matched normal tissue exists.
• NT (Normal, matched tumor) - Data for normal tissue for which matched tumor tissue exists.

So I am a bit confused that shouldn't there be equal number of TN & NT when you check the data matrix?

Hi, komal.rathi , if I want analysis the TCGA data talked above for a differential expression test(for paired data), whether the quantity of TN set is too small compared with the NT set for a certain cancer type? Which might lead a deviation to the result.

Maybe it would be better, if I using the RNASeq data for the normal sample(without any cancer) as the control set for the differential analysis compared with a certain cancer? Will you give me a light where could I get the RNASeq dataset compared with TCGA?

Thanks!

ivivek_ngs

I am assuming you have the barcodes, e.g. TCGA-09-0364-01, for each of your samples. This is the code table I referred to. The last two digits tell you if it is a tumor or normal sample. I used the TCGA Assembler to first download everything and then extracting out the matched Tumor and Normal samples. When you download from the data matrix, blue is for Matched Tumor sample and yellow is for Matched Normal sample.

But I just checked, there is no matched normal sample available for download for Ovarian serous cystadenocarcinoma in TCGA. I went to the data matrix portal, selected RNASeq and RNASeqV2 in Data Type, Level 3 in Data Level, and Tumor - matched & Normal - matched in Tumor/Normal section. It returned only Matched Tumor samples but no matched Normal samples. I guess they are not available for download yet.

Yes I could not find the matched normal samples as well for both RNASeq and RnASeqV2 in the data type for Level 3. It also returned only blue codes which is for matched tumor samples. So I guess it would be not possible for me to get a few patient cohort that might give me matched tumor and normal RNA-Seq data. Will it be helpful to download the clinical data from any other repositories?? Any inputs on that? I have asked a question in another link, if you would like to answer.

ivivek_ngs I am not aware of any other repository but I will try to find it.

Oh, alright! Thanks!

0
Refer to section--> "ExtractTissueSpecificSamples" on page 27.

5.0 years ago
JJ ▴ 570

Hi,

If you then look at one of the merged_only_clinical file e.g., KIRC.merged_only_clinical_clin_format.txt, then look at the barcodes: https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode

The two digits at position 14-15 of the barcode indicates the sample type.

Tumor types range from 01 - 09, normal types from 10 - 19 and control samples from 20 - 29

So 0 are tumors and 1 are normals e.g, 01 are primary tumours

Some datasets will contain normals, some only cancer samples.

EDIT: RNASeq V2 RSEM normalized expression values are available over http://firebrowse.org as well.

Best,
Julia

ok thanks. They should add this option in their search tool... It's a little bit a pain in the a#* ;)

For filenames that don't have position 14-15, is position 6-7 equivalent?

e.g.

TCGA-08-0531 -> Tumor ;
TCGA-12-0615 -> Control ;
TCGA-26-1438 -> Normal ;


Thanks for the link to firebrowse Julia. Great resource!

nope, that is not the same

0
Hi Julia,

As bann13 pointed, I don't see the format that you mentioned in (KIRC.merged_only_clinical_clin_format.txt) file, instead I saw "tcga-3z-a93z" - missing the 14-15 position. I am looking for Lung cancer(LUAD) Normal and cancer patient gene expression data. I have also checked LUAD file and I found the same format tcga-05-4244.

Help will be appreciated.

in the clinical data you won't have data (mostly) about normal or tumor, i.e. 14-15 position simply because they come from the same patient and therefore they won't add duplicate information.

