I want to identify differential expressed genes in TCGA. I did my anlysis as following way:
Taking BRCA dataset as an example, I downloaded RNA-seq data (RPKM version) from TCGA.
Then I used samples marked by “Primary Tumor” as cancer sample and samples marked by “Solid Tissue Normal” as normal sample.
Calculating Z score for each gene in each sample as following: z = [(value gene X in tumor Y)-(mean gene X in normal)]/(standard deviation X in normal)
Here I define genes with Zscore>=2 or Zscore<=-2 as up-regulated or down-regulated genes in one sample. If more than 10% samples in the whole dataset contain genes with Zscore>=2 or Zscore<=-2, I denfined these genes as up-regulated or down-regualted genes in BRCA dataset.
Then I find that the percentage of up-regualted genes is much higher than down-regulated genes. There are about 20% protein coding genes (number of up-regulated protein coding genes/total number of protein coding genes) which are identified as up-regulated genes but only 5% protein coding genes as down-regulated genes.
Is it normal?
Is there something wrong in my workflow?
Can you give me some sugesstions?