Question: question about identifying differential expressed genes in TCGA
gravatar for tujuchuanli
7 months ago by
tujuchuanli30 wrote:

I want to identify differential expressed genes in TCGA. I did my anlysis as following way:

  1. Taking BRCA dataset as an example, I downloaded RNA-seq data (RPKM version) from TCGA.

  2. Then I used samples marked by “Primary Tumor” as cancer sample and samples marked by “Solid Tissue Normal” as normal sample.

  3. Calculating Z score for each gene in each sample as following: z = [(value gene X in tumor Y)-(mean gene X in normal)]/(standard deviation X in normal)

  4. Here I define genes with Zscore>=2 or Zscore<=-2 as up-regulated or down-regulated genes in one sample. If more than 10% samples in the whole dataset contain genes with Zscore>=2 or Zscore<=-2, I denfined these genes as up-regulated or down-regualted genes in BRCA dataset.

Then I find that the percentage of up-regualted genes is much higher than down-regulated genes. There are about 20% protein coding genes (number of up-regulated protein coding genes/total number of protein coding genes) which are identified as up-regulated genes but only 5% protein coding genes as down-regulated genes.

Is it normal?

Is there something wrong in my workflow?

Can you give me some sugesstions?


tcga • 367 views
ADD COMMENTlink modified 7 months ago by Shicheng Guo7.4k • written 7 months ago by tujuchuanli30

Don't use RPKM (use raw counts if possible or rsem counts) and don't use Z-score for differential expression analysis (use solid statistical tools such as DESeq2/edgeR).

ADD REPLYlink written 7 months ago by WouterDeCoster36k

@OP, the point with using arbitrary thresholding like you do is that every NGS experiment has something called a mean-variance relationship. That means that genes (or regions or whatever you measure) may have high variation/enrichments because of small counts (here lowly expressed genes). These guys then often come out as differentially expressed if only looking at raw fold changes, but they rarely come out as statistically significant. Therefore, as WDC said, use a proper tool such as DESeq2 or edgeR, start with raw counts and use only genes that are significant.

ADD REPLYlink modified 7 months ago • written 7 months ago by ATpoint13k

Thanks, WouterDeCoster and ATpoint. I will check DESeq2/edgeR.

ADD REPLYlink written 7 months ago by tujuchuanli30

All I can add is that Z-scaling RPKM data does not seem like a great idea to me. Are you sure that it's not FPKM that you have obtained? There is a R package call zFPKM, which claims to be able to convert FPKM data to the Z-scale.

I would much prefer that you obtain the RSEM counts from TCGA (GDC Data Portal), normalise those in DESeq2 or EdgeR, and then Z-scale the regularised log (DESeq2) or lofCPM (EdgeR) counts.

ADD REPLYlink written 7 months ago by Kevin Blighe37k
gravatar for Shicheng Guo
7 months ago by
Shicheng Guo7.4k
Shicheng Guo7.4k wrote:

In the step 4, for certain gene, they will be assigned to 'Up-regulation' and 'Down-regulation' at the same time.

ADD COMMENTlink written 7 months ago by Shicheng Guo7.4k

Thanks, Shicheng Guo. Do you have some suggestions to these genes? I prefer to retain these genes. I consider a gene as up-regulated gene if this gene is up-regulated in more than 10% samples. Even if the same gene with down-regulation in more than 10% sample. Under this condition, I prefer to consider it as down-regulated gene in more than 10% sample. Is it reseasonable?

ADD REPLYlink written 7 months ago by tujuchuanli30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 872 users visited in the last hour