Question: question about identifying differential expressed genes in TCGA
0
gravatar for tujuchuanli
11 months ago by
tujuchuanli40
tujuchuanli40 wrote:

I want to identify differential expressed genes in TCGA. I did my anlysis as following way:

  1. Taking BRCA dataset as an example, I downloaded RNA-seq data (RPKM version) from TCGA.

  2. Then I used samples marked by “Primary Tumor” as cancer sample and samples marked by “Solid Tissue Normal” as normal sample.

  3. Calculating Z score for each gene in each sample as following: z = [(value gene X in tumor Y)-(mean gene X in normal)]/(standard deviation X in normal)

  4. Here I define genes with Zscore>=2 or Zscore<=-2 as up-regulated or down-regulated genes in one sample. If more than 10% samples in the whole dataset contain genes with Zscore>=2 or Zscore<=-2, I denfined these genes as up-regulated or down-regualted genes in BRCA dataset.

Then I find that the percentage of up-regualted genes is much higher than down-regulated genes. There are about 20% protein coding genes (number of up-regulated protein coding genes/total number of protein coding genes) which are identified as up-regulated genes but only 5% protein coding genes as down-regulated genes.

Is it normal?

Is there something wrong in my workflow?

Can you give me some sugesstions?

Thanks

tcga • 504 views
ADD COMMENTlink modified 11 months ago by Shicheng Guo7.5k • written 11 months ago by tujuchuanli40
1

Don't use RPKM (use raw counts if possible or rsem counts) and don't use Z-score for differential expression analysis (use solid statistical tools such as DESeq2/edgeR).

ADD REPLYlink written 11 months ago by WouterDeCoster39k
1

@OP, the point with using arbitrary thresholding like you do is that every NGS experiment has something called a mean-variance relationship. That means that genes (or regions or whatever you measure) may have high variation/enrichments because of small counts (here lowly expressed genes). These guys then often come out as differentially expressed if only looking at raw fold changes, but they rarely come out as statistically significant. Therefore, as WDC said, use a proper tool such as DESeq2 or edgeR, start with raw counts and use only genes that are significant.

ADD REPLYlink modified 11 months ago • written 11 months ago by ATpoint18k

Thanks, WouterDeCoster and ATpoint. I will check DESeq2/edgeR.

ADD REPLYlink written 11 months ago by tujuchuanli40
1

All I can add is that Z-scaling RPKM data does not seem like a great idea to me. Are you sure that it's not FPKM that you have obtained? There is a R package call zFPKM, which claims to be able to convert FPKM data to the Z-scale.

I would much prefer that you obtain the RSEM counts from TCGA (GDC Data Portal), normalise those in DESeq2 or EdgeR, and then Z-scale the regularised log (DESeq2) or lofCPM (EdgeR) counts.

ADD REPLYlink written 11 months ago by Kevin Blighe44k
1
gravatar for Shicheng Guo
11 months ago by
Shicheng Guo7.5k
Shicheng Guo7.5k wrote:

In the step 4, for certain gene, they will be assigned to 'Up-regulation' and 'Down-regulation' at the same time.

ADD COMMENTlink written 11 months ago by Shicheng Guo7.5k

Thanks, Shicheng Guo. Do you have some suggestions to these genes? I prefer to retain these genes. I consider a gene as up-regulated gene if this gene is up-regulated in more than 10% samples. Even if the same gene with down-regulation in more than 10% sample. Under this condition, I prefer to consider it as down-regulated gene in more than 10% sample. Is it reseasonable?

ADD REPLYlink written 11 months ago by tujuchuanli40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1602 users visited in the last hour