Question

How to retrieve paired samples of RNASeq data from TCGA? (Normal vs Tumor)

1

Entering edit mode

6.9 years ago

antmantras ▴ 80

Hello, everyone.

I want to download paired samples from the TCGA database of RNA-Seq experiments. I've been looking for information on how to download this data and it looks like it's from https://gdac.broadinstitute.org/. Specifically, I'm looking for breast cancer samples, so I looked in the mRNASeq section.

Inside this section there are several files to download (I want the raw counts) but I don't know what the differences are between the files:

illuminahiseq_rnaseqv2-RSEM_genes (MD5) illuminahiseq_rnaseq-gene_expression (MD5)

I would also like to know how to filter these files to keep the paired samples. I found something about the sample codes in a previous post, but I couldn't access the link to the explanation (https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode).

Could you help me with this problem please?

PS: Suggestions about other databases are welcome!

rna-seq tcga cancer • 3.7k views

ADD COMMENT • link updated 6.9 years ago by Kevin Blighe 89k • written 6.9 years ago by antmantras ▴ 80

score 2 · Accepted Answer · 2018-08-05

2

Entering edit mode

6.9 years ago

Kevin Blighe 89k

In your file manifest that you used to download the data, you will have a UUID. Use that to look up the TCGA barcode for each file via this function: C: Sample names for TCGA data from GDC-legacy archive

You can then infer Tumour-Normal pairings by matching on the TCGA barcode. Yet more information on barcodes:

I have already analysed the TCGA BRCA data many times and there are ~111 Tumour-Normal pairs for the RNA-seq data.

Kevin

ADD COMMENT • link 6.9 years ago by Kevin Blighe 89k

0

Entering edit mode

Hey, Kevin, thanks for your answer.

Following your advice, I took the data from the illuminahiseq_rnaseq-gene_expression (MD5) file of the TCGA database (for breast cancer) and found 97 paired samples. For this purpose, I have filtered the samples that were duplicated per patient, for example: TCGA.A6.2675.11A.01R.1723.07 and TCGA.A6.2675.01A.02R.1723.07. In this case, the first would refer to healthy tissue and the second to tumor tissue.

Am I doing something wrong? I'm saying this because of your previous answer where you said there were about 111 paired samples.

I have also analyzed the available data on COAD and have only found 26 samples paired by this method.

ADD REPLY • link 6.9 years ago by antmantras ▴ 80

0

Entering edit mode

Hey, yes, the first sample is healthy tissue, whilst the second is tumour.

The number of matched paired Tumour and Normals will vary based on the exact data that you obtain, and also the filtering that's applied on samples. 97 is a number that I've seen, too! It varies.

ADD REPLY • link 6.9 years ago by Kevin Blighe 89k