Question: Statistical test for enrichment of a list of pathways after clustering them
gravatar for S_B_P
2.5 years ago by
S_B_P0 wrote:

I have ~10 biological samples for which I have microarray expression data for ~20000 probes. However, I am interested in only biological function related to ~20 metabolic pathways (as defined in KEGG), so I selected only those genes (around ~1500) which are annotated with these ~20 pathways. I did a PCA using these ~1500 genes and it showed very good clustering of samples in PC1, PC2 and PC3. So I did a feature extraction and selected ~ 500 genes that contributed best to PC1 to PC3. When I plot the heatmap of gene expression of these ~500 genes, I get clear clustering of genes into 4 neat clusters, that fit the biological categories of my samples.

Now, I am interested to see whether each of these gene clusters is enriched in any of the ~20 KEGG pathways I had already selected. When I tried to do a GO enrichment or Pathway enrichment with softwares like Gprofiler, DAVID, Gorilla etc with input list as list of genes in the cluster, and background list as list of ~500 genes that I used to cluster, I get too broad results like "metabolism", which is not useful.

Can I look for enrichment of each cluster for each pathway one by one? For example, if cluster 1 has 100 genes, and 20 of them belong to Pathway A, and if out of the 500 genes I originally used for clustering there are 100 genes belonging to Pathway A, I can conclude there is no enrichment in cluster 1 for pathway-A. What would be the appropriate statistical test and multiple correction I will have to do? Can I do a simple fischer exact test or hypergeometric test to look for enrichment and do a multiple correction? Can you please point out resources to do this in R?

enrichment R gene • 1.4k views
ADD COMMENTlink modified 2.5 years ago by WouterDeCoster44k • written 2.5 years ago by S_B_P0

You should change the type of the post. People assume you have created a tool once they read your post title. Please convert "tool" into "Question".

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by arta580

Thanks for noticing this, I've adapted the post type.

ADD REPLYlink written 2.5 years ago by WouterDeCoster44k
gravatar for arta
2.5 years ago by
arta580 wrote:

You can apply hypergeometric test for the significantly enriched pathways. There is already built-in package in R. I assume you will run around 80 test (4 clusters * ~20 KEGG pathways.) I would have corrected p-values. You can apply p.adjust function in R.

Beside hypergeometric test, I would run overrepresentation test, fold-change.

The ‘expected value’ is the number of genes that would be expected to be present in the test list for a particular PANTHER category on the basis of the reference list. For example, out of a total of 20,000 genes in the human genome, 440 map to the GO term ‘induction of apoptosis’. Therefore, 2.2% (440 divided by 20,000) of the genes in the reference list are involved in the induction of apoptosis. If a test list that contains 500 genes is uploaded to PANTHER, after analysis, 11 genes (500 multiplied by 2.2%) would be expected to be involved in induction of apoptosis. But if one of the cluster has 22 genes, by dividing 22/11, you get 2 fold-change. To make more convenient, i would take log2 of (22/11=2). Example taken here

Once i get both p-values and fold changes, i could plot in 2D.

ADD COMMENTlink written 2.5 years ago by arta580
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 735 users visited in the last hour