I have two group of genes. I want to know the percentage of genes from group A or B is expressed in TCGA breast cancer data set. For example, there are 100 genes in group A and 200 genes in group B. After summarization there are 45 genes and 100 genes are expressed (45% vs 50%).
The core question is how to define a gene is expressed in TCGA dataset. I define that a gene with median RPKM value >= 0.1, and had 0 expression in less than one fourth of patients is defined as expressed. otherwise no-expressed. this cutoff come from a paper(https://peerj.com/articles/1499/). in this paper authors used this cutoff to define which gene should be included into next step to perform survival analysis.
my question is what I do is suitable or not. do you have some better methods to define. it could be best if you can provide some reference.