Question: calculate z-score from rpkm values.
gravatar for vyom84
5.4 years ago by
United States
vyom8430 wrote:

Hello everyone,

The RNA-seq data from TCGA contain the Z-SCORE, instead of RPKM values. I wanted to perform analysis of TCGA data and GTEx data. But the problem is GTEx data contains RPKM and counts values. There is a tutorial explaining the Z-score calculation (, but i am not sure it works for RNA-seq or it is the correct tutorial.

Can someone please guide me how to calculate the z-score of genes using GTEx RNA-seq data (RPKM and count data). If possible verify the above link, is it correct code.


gtex rna-seq tcga z-score • 9.8k views
ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by vyom8430

Where are you downloading your TCGA data?  TCGA Data Portal or elsewhere?  TCGA provides 2 types of RNAseq data.  The older pipeline is just called RNAseq, while the newer pipeline is referred to as RNAseqV2.  Which one are you using?  Both version provide the raw counts as well as RPKM for the old pipeline and RSEM scaled estimate counts for the RNAseqV2 data.  These data are not z-scores.  Please see this site for more details on RNAseqV2 data:

ADD REPLYlink written 5.4 years ago by alolex910

Thanks for the reply and the link. 

I have download the data directly from COSMIC. Below is the head of the expression file:

1337808    TCGA-02-2483-01    SFMBT1    over           2.416    329
1337808    TCGA-02-2483-01    SGCE    normal          -0.274    329
1337808    TCGA-02-2483-01    RCAN2    normal          -0.577    329
1337808    TCGA-02-2483-01    KIAA0895    normal          -0.333    329
1337808    TCGA-02-2483-01    LIN7C    normal           0.428    329
1337808    TCGA-02-2483-01    NDUFV2    normal           0.790    329
1337808    TCGA-02-2483-01    AKAP14    normal          -0.275    329
1337808    TCGA-02-2483-01    ST20    normal           0.808    329
1337808    TCGA-02-2483-01    CLCF1    normal          -0.640    329

I think z-score is easy to understand and explain. So, i wanted to convert the GTEx data (counts or RPKM) values into z-score. Please advised how to convert the GTEx data into z-score, i have both the reads and counts. Below is the sample of GTEx data:

TargetID    Gene_Symbol    Chr    Coord    GTEX-N7MS-0007-SM-2D7W1    GTEX-N7MS-0008-SM-4E3JI    GTEX-N7MS-0011-R10A-SM-2HMJK    GTEX-N7MS-0011-R11A-SM-2HMJS    GTEX-N7MS-0011-R1a-SM-2HMJG    GTEX-N7MS-0011-R2a-SM-2HML6
ENST00000390859.1    ENSG00000212161.1    1    159821762    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
ENST00000290551.4    ENSG00000159388.5    1    203274619    4356.000000    2685.999512    6367.449707    9156.000000    4144.000000    5143.111328
ENST00000475157.1    ENSG00000159388.5    1    203274619    0.000000    18.000416    8.550518    0.000000    0.000000    32.888611
ENST00000429660.1    ENSG00000223683.1    1    63727784    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
ENST00000473120.1    ENSG00000092850.7    1    36549676    0.000000    0.000000    19.124006    18.559824    0.000000    23.603249
ENST00000207457.3    ENSG00000092850.7    1    36549676    0.000000    0.000000    112.875992    53.440178    207.776093    112.396751
ENST00000469024.1    ENSG00000092850.7    1    36549676    0.000000    0.000000    0.000000    0.000000    52.223907    0.000000
ENST00000489568.1    ENSG00000162461.7    1    16062900    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
ENST00000294454.5    ENSG00000162461.7    1    16062900    68.000000    132.000000    696.000000    5004.000000    512.000000    312.000000



ADD REPLYlink written 5.4 years ago by vyom8430

I don't use COMIC much, so I researched this a little.  My thought would be to calculate the z-score like COSMIC does, but the problem is that the help file they link to for more information is broken.  You may want to contact them about this.  Then I found this Biostars answer by @David Fredman TCGA: What are mRNA expression z-scores? Does TCGA have mRNA expression from controls? , which I think answers your question.  It looks like COSMIC uses the z-score so that they can have comparable gene expression across various platforms in TCGA; however, it is not clear if COSMIC calculated them or they were taken straight from TCGA.  I think this is a good way to go since you are comparing yet another platform.  However, note in the response by @David Fredman that TCGA uses the distribution of the normal tissues (sometimes!) as the control distribution.  You should be careful comparing these z-scores to your data if you do not have a comparable control to use because you will get false positive results.  My suggestions would be to 1) contact COSMIC to see how they calculated theses z-scores and 2) if that doesn't work out to download the original TCGA count data and calculate the z-scores yourself so that you know what is being used as the control distribution.  

ADD REPLYlink written 5.4 years ago by alolex910

Just out of curiosity, are you trying to compare GTEx & TCGA? If yes, then why not use only TCGA as it already has data for matched Normal-Tumor samples. 

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by komal.rathi3.7k
TCGA's "tumor adjacent" normals for rna-seq are not perfectly normal, due to lymphocyte infiltration or some other tumor microenvironment or epigenetic effects that need further investigation.
ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by Cyriac Kandoth5.5k

Never gave it a thought. Thanks!

ADD REPLYlink written 5.4 years ago by komal.rathi3.7k
gravatar for Sean Davis
5.4 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

You can calculate a z-score by subtracting the mean and dividing by the standard deviation for each gene.  You may want to transform your data before doing this (log-transform or other).  

ADD COMMENTlink written 5.4 years ago by Sean Davis26k

It should probably be pointed out that while this will yield z-scores, the scores themselves may or may not be very meaningful. In fact, unless all of the samples have been normalized in a meaningful way ahead of time the resulting z-scores won't be comparable.

ADD REPLYlink written 5.4 years ago by Devon Ryan97k

I totally agree.  While the calculation can be done, the utility of doing so remains unknown.

ADD REPLYlink written 5.4 years ago by Sean Davis26k

Reading the literature and comments, my understanding of the z-score:

1. Convert the count/RPKM values of each gene into log values.

2. Calculate the mean and standard deviation of X gene log values in 20 lung tissues (suppose i have data for 20 samples).

3. For first lung tissue sample: (gene X log value - mean of log values of 20 lung tissues)/ standard deviation of log values of 20 lung tissues.

4. Now. i have the z-score for gene x in first lung tissue sample. Using the above protocol, i can convert all genes log values into z-score.

The question is the above protocol is correct or not, please advised. 

Should i calculate the z-score using reads count or RPKM values.

Does these z-score really have meaning. The z-score COSMIC provide:

1337808    TCGA-02-2483-01    SFMBT1    over           2.416    329
1337808    TCGA-02-2483-01    SGCE    normal          -0.274    329

If i calculate the z-score using above approach, should i be able to calculate the z-score and find out whether the gene is over regulated or normal regulated  .

Please advised how to proceed. 


ADD REPLYlink written 5.4 years ago by vyom8430
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1902 users visited in the last hour