Question

Statistical test for enrichment of a list of pathways after clustering them

0

Entering edit mode

6.3 years ago

S_B_P • 0

I have ~10 biological samples for which I have microarray expression data for ~20000 probes. However, I am interested in only biological function related to ~20 metabolic pathways (as defined in KEGG), so I selected only those genes (around ~1500) which are annotated with these ~20 pathways. I did a PCA using these ~1500 genes and it showed very good clustering of samples in PC1, PC2 and PC3. So I did a feature extraction and selected ~ 500 genes that contributed best to PC1 to PC3. When I plot the heatmap of gene expression of these ~500 genes, I get clear clustering of genes into 4 neat clusters, that fit the biological categories of my samples.

Now, I am interested to see whether each of these gene clusters is enriched in any of the ~20 KEGG pathways I had already selected. When I tried to do a GO enrichment or Pathway enrichment with softwares like Gprofiler, DAVID, Gorilla etc with input list as list of genes in the cluster, and background list as list of ~500 genes that I used to cluster, I get too broad results like "metabolism", which is not useful.

Can I look for enrichment of each cluster for each pathway one by one? For example, if cluster 1 has 100 genes, and 20 of them belong to Pathway A, and if out of the 500 genes I originally used for clustering there are 100 genes belonging to Pathway A, I can conclude there is no enrichment in cluster 1 for pathway-A. What would be the appropriate statistical test and multiple correction I will have to do? Can I do a simple fischer exact test or hypergeometric test to look for enrichment and do a multiple correction? Can you please point out resources to do this in R?

R gene enrichment • 2.4k views

ADD COMMENT • link updated 6.3 years ago by WouterDeCoster 47k • written 6.3 years ago by S_B_P • 0

1

Entering edit mode

You should change the type of the post. People assume you have created a tool once they read your post title. Please convert "tool" into "Question".

ADD REPLY • link 6.3 years ago by arta ▴ 670

1

Entering edit mode

Thanks for noticing this, I've adapted the post type.

ADD REPLY • link 6.3 years ago by WouterDeCoster 47k

score 2 · Answer 1 · 2018-01-02

You can apply hypergeometric test for the significantly enriched pathways. There is already built-in package in R. I assume you will run around 80 test (4 clusters * ~20 KEGG pathways.) I would have corrected p-values. You can apply p.adjust function in R.

Beside hypergeometric test, I would run overrepresentation test, fold-change.

The ‘expected value’ is the number of genes that would be expected to be present in the test list for a particular PANTHER category on the basis of the reference list. For example, out of a total of 20,000 genes in the human genome, 440 map to the GO term ‘induction of apoptosis’. Therefore, 2.2% (440 divided by 20,000) of the genes in the reference list are involved in the induction of apoptosis. If a test list that contains 500 genes is uploaded to PANTHER, after analysis, 11 genes (500 multiplied by 2.2%) would be expected to be involved in induction of apoptosis. But if one of the cluster has 22 genes, by dividing 22/11, you get 2 fold-change. To make more convenient, i would take log2 of (22/11=2). Example taken here

Once i get both p-values and fold changes, i could plot in 2D.