Question: How to retrieve paired samples of RNASeq data from TCGA? (Normal vs Tumor)
gravatar for antmantras
21 months ago by
antmantras0 wrote:

Hello, everyone.

I want to download paired samples from the TCGA database of RNA-Seq experiments. I've been looking for information on how to download this data and it looks like it's from Specifically, I'm looking for breast cancer samples, so I looked in the mRNASeq section.

Inside this section there are several files to download (I want the raw counts) but I don't know what the differences are between the files:

illuminahiseq_rnaseqv2-RSEM_genes (MD5) illuminahiseq_rnaseq-gene_expression (MD5)

I would also like to know how to filter these files to keep the paired samples. I found something about the sample codes in a previous post, but I couldn't access the link to the explanation (

Could you help me with this problem please?

PS: Suggestions about other databases are welcome!

cancer rna-seq tcga • 1.0k views
ADD COMMENTlink modified 21 months ago by Kevin Blighe59k • written 21 months ago by antmantras0
gravatar for Kevin Blighe
21 months ago by
Kevin Blighe59k
Kevin Blighe59k wrote:

In your file manifest that you used to download the data, you will have a UUID. Use that to look up the TCGA barcode for each file via this function: C: Sample names for TCGA data from GDC-legacy archive

You can then infer Tumour-Normal pairings by matching on the TCGA barcode. Yet more information on barcodes:

I have already analysed the TCGA BRCA data many times and there are ~111 Tumour-Normal pairs for the RNA-seq data.


ADD COMMENTlink modified 21 months ago • written 21 months ago by Kevin Blighe59k

Hey, Kevin, thanks for your answer.

Following your advice, I took the data from the illuminahiseq_rnaseq-gene_expression (MD5) file of the TCGA database (for breast cancer) and found 97 paired samples. For this purpose, I have filtered the samples that were duplicated per patient, for example: TCGA.A6.2675.11A.01R.1723.07 and TCGA.A6.2675.01A.02R.1723.07. In this case, the first would refer to healthy tissue and the second to tumor tissue.

Am I doing something wrong? I'm saying this because of your previous answer where you said there were about 111 paired samples.

I have also analyzed the available data on COAD and have only found 26 samples paired by this method.

ADD REPLYlink modified 21 months ago • written 21 months ago by antmantras0

Hey, yes, the first sample is healthy tissue, whilst the second is tumour.

The number of matched paired Tumour and Normals will vary based on the exact data that you obtain, and also the filtering that's applied on samples. 97 is a number that I've seen, too! It varies.

ADD REPLYlink written 21 months ago by Kevin Blighe59k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1066 users visited in the last hour