TCGA: What are mRNA expression z-scores? Does TCGA have mRNA expression from controls?
1
13
8.7 years ago
komal.rathi ★ 4.1k

I am using the TCGA portal to get mRNA expression data for various cancer studies (e.g. lung, liver, thyroid etc). I have two questions about the data:

1. Some cancer studies on TCGA have "mRNA expression RNASeq V2 RSEM" values & corresponding "z-scores". I am confused as to what the "mRNA expression z-Scores (RNA Seq V2 RSEM)" data constitutes of. How are the z-scores calculated and what do they represent?
2. We have been on a lookout for control dataset for the cancer studies on TCGA. Does anyone know of a good place where you can find control dataset for tissues like Lung, Liver, Thyroid etc. (basically all the fore-gut tissues)? We are working with control data from GTEx but they have RPKM values & TCGA has RSEM/RSEM z-scored values, so we have to do a lot of scaling/normalization/transformation to compare these disparate datasets. We would like to know if there is any mRNA expression data (obtained via RNASeq V2 RSEM) for controls.

23
8.7 years ago
David Fredman ★ 1.1k

A z-score for a sample indicates the number of standard deviations away from the mean of expression in the reference. The formula is:

z = (expression in tumor sample - mean expression in reference sample) / standard deviation of expression in reference sample

TCGA states:

For mRNA and microRNA expression data, we typically compute the relative expression of an individual gene and tumor to the gene's expression distribution in a reference population. That reference population is either all tumors that are diploid for the gene in question, or, when available, normal adjacent tissue.

(It is not always clear what the cell of origin of a tumor is, so the mRNA expression in normal adjacent tissue can sometimes be misleading, which is why expression is sometimes compared within the set of tumors only).

As for part 2, CPM (count per million) data for each gene and sample would be ideal for the cross-sample comparisons, but I am not sure where you could get such data. Maybe you should post this as a separate question?

0
0
Thanks for the information. Reading the literature and comments, my understanding of the z-score:

1. Convert the RPKM values of each gene into log values.
2. Calculate the mean and standard deviation of X gene log values in 20 lung tissues (suppose I have data for 20 samples).
3. For first lung tissue sample: (gene X log value - mean of log values of 20 lung tissues)/ standard deviation of log values of 20 lung tissues.
4. Now I have the z-score for gene X in first lung tissue sample. Using the above protocol, I can convert all genes log values into z-score.

The question is the above protocol is correct or not, please advised.

Does these z-score really have meaning. The z-score COSMIC provide:

ID_SAMPLE    SAMPLE_NAME    GENE_NAME    REGULATION    Z_SCORE    ID_STUDY
1337808    TCGA-02-2483-01    SFMBT1    over           2.416    329
1337808    TCGA-02-2483-01    SGCE    normal          -0.274    329


If I calculate the z-score using above approach, should I be able to calculate the z-score and find out whether the gene is over regulated or normal regulated.

0
Thank you for your information. I don't know how to download diploid information from TCGA. However, I checked the wiki of TCGA, and I found that the diploid information is in each VCF file. I guess I must download VCF file to get diploid information. Is it right?