Question

How to define a gene is expressed in TCGA dataset

0

Entering edit mode

5.9 years ago

tujuchuanli ▴ 100

Hi,

I have two group of genes. I want to know the percentage of genes from group A or B is expressed in TCGA breast cancer data set. For example, there are 100 genes in group A and 200 genes in group B. After summarization there are 45 genes and 100 genes are expressed (45% vs 50%).

The core question is how to define a gene is expressed in TCGA dataset. I define that a gene with median RPKM value >= 0.1, and had 0 expression in less than one fourth of patients is defined as expressed. otherwise no-expressed. this cutoff come from a paper(https://peerj.com/articles/1499/). in this paper authors used this cutoff to define which gene should be included into next step to perform survival analysis.

my question is what I do is suitable or not. do you have some better methods to define. it could be best if you can provide some reference.

Thanks!

gene expression TCGA • 1.7k views

ADD COMMENT • link 5.9 years ago by tujuchuanli ▴ 100

score 1 · Answer 1 · 2018-06-11

There are different ways of viewing what is expressed and what is not. Transcription in the cell is 'pervasive' and is constantly occurring, even in regions that we do not know to have any function. Transcription factors bind to wherever their is accessible chromatin and where there is an electromagnetic potential to bind, mediated via different motifs in the DNA sequence of the accessible chromatin. Most transcripts are in fact non-coding, as you probably know. Most transcripts are also expressed at very low levels, but they are still nevertheless expressed.

Do you want to simply gauge anything that is expressed or do you want to gauge things that are more expressed in one group over another?

After you normalise your data and filter for missingness, you can more or less assume that everything that has a value has exhibited some form of expression. If you have FPKM or RPKM, setting your threshold at 10 is a reasonable idea. Nobody can really argue against 10 as a cut-off; neither could one argue with 5, or 15.

If you want to instead determine expression in a particular group, first transform your data to the Z-scale and then choose those genes that have Z-scores greater than absolute 2 or 3.

cBioPortal most like can already do what you want.

Kevin