Question: TCGA: What are mRNA expression z-scores? Does TCGA have mRNA expression from controls?
gravatar for komal.rathi
6.6 years ago by
Children's Hospital of Philadelphia, Philadelphia, PA
komal.rathi3.7k wrote:

Hi everyone,

I am using the TCGA portal to get mRNA expression data for various cancer studies (e.g. lung, liver, thyroid etc). I have two questions about the data:

1. Some cancer studies on TCGA have "mRNA expression RNASeq V2 RSEM" values & corresponding "z-scores". I am confused as to what the "mRNA expression z-Scores (RNA Seq V2 RSEM)" data constitutes of. How are the z-scores calculated and what do they represent? 

2. We have been on a lookout for control dataset for the cancer studies on TCGA. Does anyone know of a good place where you can find control dataset for tissues like Lung, Liver, Thyroid etc. (basically all the fore-gut tissues) ? We are working with control data from GTEx but they have RPKM values & TCGA has RSEM/RSEM z-scored values, so we have to do a lot of scaling/normalization/transformation to compare these disparate datasets. We would like to know if there is any mRNA expression data (obtained via RNASeq V2 RSEM) for controls. 


UPDATE: I have posted the second part as a separate question TCGA: Does TCGA cancer studies have mRNA expression data for Control/Normal samples? .

rsem z-scores tcga • 41k views
ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by komal.rathi3.7k
gravatar for David Fredman
6.6 years ago by
David Fredman1.1k
University of Bergen, Norway
David Fredman1.1k wrote:

I will attempt to answer part 1 of your question:

A z-score for a sample indicates the number of standard deviations away from the mean of expression in the reference.  The formula is : 

z = (expression in tumor sample - mean expression in reference sample) / standard deviation of expression in reference sample

TCGA states: "For mRNA and microRNA expression data, we typically compute the relative expression of an individual gene and tumor to the gene's expression distribution in a reference population. That reference population is either all tumors that are diploid for the gene in question, or, when available, normal adjacent tissue." (It is not always clear what the cell of origin of a tumor is, so the mRNA expression in normal adjacent tissue can sometimes be misleading, which is why expression is sometimes compared within the set of tumors only).

As for part 2, CPM (count per million) data for each gene and sample would be ideal for the cross-sample comparisons, but I am not sure where you could get such data. Maybe you should post this as a separate question?

ADD COMMENTlink written 6.6 years ago by David Fredman1.1k

Thanks! I will create another question for the second part!

ADD REPLYlink written 6.6 years ago by komal.rathi3.7k

Thanks for the information. Reading the literature and comments, my understanding of the z-score:

  1. Convert the RPKM values of each gene into log values.

  2. Calculate the mean and standard deviation of X gene log values in 20 lung tissues (suppose I have data for 20 samples).

  3. For first lung tissue sample: (gene X log value - mean of log values of 20 lung tissues)/ standard deviation of log values of 20 lung tissues.

  4. Now I have the z-score for gene X in first lung tissue sample. Using the above protocol, I can convert all genes log values into z-score.

The question is the above protocol is correct or not, please advised.

Does these z-score really have meaning. The z-score COSMIC provide:

1337808    TCGA-02-2483-01    SFMBT1    over           2.416    329
1337808    TCGA-02-2483-01    SGCE    normal          -0.274    329

If I calculate the z-score using above approach, should I be able to calculate the z-score and find out whether the gene is over regulated or normal regulated.


ADD REPLYlink modified 16 months ago by Ram32k • written 5.8 years ago by vyom8430

Dear sir: Thank you for your information. I don't know how to download diploid information from TCGA. However, I checked the wiki of TCGA, and I found that the diploid information is in each VCF file. I guess I must download VCF file to get diploid information. Is it right?

ADD REPLYlink written 4.4 years ago by biofuturecom0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2721 users visited in the last hour