I am working with differential gene expression on an affy dataset (2 different cancers). I’ve done the analysis using the limma package with multiple test correction on all probes (over 50k). No problem there.

My question: I want to use around 100 genes (for a certain pathway) to cluster cancers. Can I just pull the expression values from the affy set and do the DE analysis on those OR should I first check all the probes and check if those 100 genes adj.p.value is significant? What if only some of them have adj.p.value under 0.05?

Ok, thx! I think I was a little bit unclear in my question. I have a affy dataset (HGU plus 2.0) from GEO with 20 samples (10 for each cancer type). I use normalized data and I also changed the probes to gene IDs.

I want to know is there a statistical pitfall of just pulling e.g. 100 genes from the full set and do the DE analysis only using those genes (if e.g. some of those 100 genes would not reach adj.p.value < 0.05 when doing DE analysis on the full set)

That may not be acceptable. Having 10 replicates should help without worrying about p-value adjustments if there is real signal. You can always lower p.adj threshold (instead of 0.05, may be use 0.1) and then try to show other evidences that genes that do not reach 0.05 p.adj still matter and you just lack power.

Clustering does not have anything to do with the DE analysis. You can use normalized expression values but these need to be scaled first so each of your plus minus 100 genes would have the same effect/weight in clustering. Of course you can argue that if a gene is not differentially expressed across samples, it wouldn't have much effect on clustering. In that sense, I would expect more or less the same patterns when clustering with the full set of your plus minus 100 genes or just the statistically significant differentially expressed ones.

I understood that you want to prove that a pathway is differentially regulated between cancers and you can just use the genes of that pathway to classify cancer types.

First you need to show that the differential genes are enriched in that pathway ( using hyper geometric analysis or by doing a GSEA ) . Then if you take the pathway genes that are differentially expressed, and show that you could classify cancer just using pathway genes. The pathway genes that are not even marginally differentially expressed ( Let’s say a p-value of 0.05) , in any case they do not contribute to clustering.

Ok, thx! I think I was a little bit unclear in my question. I have a affy dataset (HGU plus 2.0) from GEO with 20 samples (10 for each cancer type). I use normalized data and I also changed the probes to gene IDs.

I want to know is there a statistical pitfall of just pulling e.g. 100 genes from the full set and do the DE analysis only using those genes (if e.g. some of those 100 genes would not reach adj.p.value < 0.05 when doing DE analysis on the full set)

That may not be acceptable. Having 10 replicates should help without worrying about p-value adjustments if there is real signal. You can always lower p.adj threshold (instead of 0.05, may be use 0.1) and then try to show other evidences that genes that do not reach 0.05 p.adj still matter and you just lack power.