Hello everyone,
The RNA-seq data from TCGA contain the Z-SCORE, instead of RPKM values. I wanted to perform analysis of TCGA data and GTEx data. But the problem is GTEx data contains RPKM and counts values. There is a tutorial explaining the Z-score calculation (http://dept.stat.lsa.umich.edu/~kshedden/Python-Workshop/gene_expression_comparison.html), but i am not sure it works for RNA-seq or it is the correct tutorial.
Can someone please guide me how to calculate the z-score of genes using GTEx RNA-seq data (RPKM and count data). If possible verify the above link, is it correct code.
Thanks
Where are you downloading your TCGA data? TCGA Data Portal or elsewhere? TCGA provides 2 types of RNAseq data. The older pipeline is just called RNAseq, while the newer pipeline is referred to as RNAseqV2. Which one are you using? Both version provide the raw counts as well as RPKM for the old pipeline and RSEM scaled estimate counts for the RNAseqV2 data. These data are not z-scores. Please see this site for more details on RNAseqV2 data: https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2
Thanks for the reply and the link.
I have download the data directly from COSMIC. Below is the head of the expression file:
ID_SAMPLE SAMPLE_NAME GENE_NAME REGULATION Z_SCORE ID_STUDY
1337808 TCGA-02-2483-01 SFMBT1 over 2.416 329
1337808 TCGA-02-2483-01 SGCE normal -0.274 329
1337808 TCGA-02-2483-01 RCAN2 normal -0.577 329
1337808 TCGA-02-2483-01 KIAA0895 normal -0.333 329
1337808 TCGA-02-2483-01 LIN7C normal 0.428 329
1337808 TCGA-02-2483-01 NDUFV2 normal 0.790 329
1337808 TCGA-02-2483-01 AKAP14 normal -0.275 329
1337808 TCGA-02-2483-01 ST20 normal 0.808 329
1337808 TCGA-02-2483-01 CLCF1 normal -0.640 329
I think z-score is easy to understand and explain. So, i wanted to convert the GTEx data (counts or RPKM) values into z-score. Please advised how to convert the GTEx data into z-score, i have both the reads and counts. Below is the sample of GTEx data:
TargetID Gene_Symbol Chr Coord GTEX-N7MS-0007-SM-2D7W1 GTEX-N7MS-0008-SM-4E3JI GTEX-N7MS-0011-R10A-SM-2HMJK GTEX-N7MS-0011-R11A-SM-2HMJS GTEX-N7MS-0011-R1a-SM-2HMJG GTEX-N7MS-0011-R2a-SM-2HML6
ENST00000390859.1 ENSG00000212161.1 1 159821762 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
ENST00000290551.4 ENSG00000159388.5 1 203274619 4356.000000 2685.999512 6367.449707 9156.000000 4144.000000 5143.111328
ENST00000475157.1 ENSG00000159388.5 1 203274619 0.000000 18.000416 8.550518 0.000000 0.000000 32.888611
ENST00000429660.1 ENSG00000223683.1 1 63727784 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
ENST00000473120.1 ENSG00000092850.7 1 36549676 0.000000 0.000000 19.124006 18.559824 0.000000 23.603249
ENST00000207457.3 ENSG00000092850.7 1 36549676 0.000000 0.000000 112.875992 53.440178 207.776093 112.396751
ENST00000469024.1 ENSG00000092850.7 1 36549676 0.000000 0.000000 0.000000 0.000000 52.223907 0.000000
ENST00000489568.1 ENSG00000162461.7 1 16062900 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
ENST00000294454.5 ENSG00000162461.7 1 16062900 68.000000 132.000000 696.000000 5004.000000 512.000000 312.000000
Thanks
I don't use COMIC much, so I researched this a little. My thought would be to calculate the z-score like COSMIC does, but the problem is that the help file they link to for more information is broken. You may want to contact them about this. Then I found this Biostars answer by @David Fredman TCGA: What are mRNA expression z-scores? Does TCGA have mRNA expression from controls? , which I think answers your question. It looks like COSMIC uses the z-score so that they can have comparable gene expression across various platforms in TCGA; however, it is not clear if COSMIC calculated them or they were taken straight from TCGA. I think this is a good way to go since you are comparing yet another platform. However, note in the response by @David Fredman that TCGA uses the distribution of the normal tissues (sometimes!) as the control distribution. You should be careful comparing these z-scores to your data if you do not have a comparable control to use because you will get false positive results. My suggestions would be to 1) contact COSMIC to see how they calculated theses z-scores and 2) if that doesn't work out to download the original TCGA count data and calculate the z-scores yourself so that you know what is being used as the control distribution.
Just out of curiosity, are you trying to compare GTEx & TCGA? If yes, then why not use only TCGA as it already has data for matched Normal-Tumor samples.
Never gave it a thought. Thanks!