Tutorial: Enrichment Analysis, Clustering and Scoring with pathfindR
gravatar for egeulgen
2.1 years ago by
egeulgen980 wrote:

In this tutorial, I'll try to provide a brief overview of the capabilities of our enrichment analysis R package pathfindR. The tool is in CRAN and its introductory vignette can be found here. We also have a detailed wiki.

This tool is designed to improve enrichment analysis by firstly identifying active subnetworks in differential expression/methylation data using a protein-protein interaction network. It then performs over-representation analysis using the identified active subnetworks. By utilizing the interaction information, the tool identifies numerous gene sets relevant to the condition under study.

I'll be using the example data provided in the package: a gene expression data comparing rheumatoid arthritis patients vs. controls.

Active-Subnetwork-Oriented Pathway Enrichment Analysis

As illustrated below, this workflow takes in a data frame of 3 columns containing: Gene Symbols, Change Values (optional) and p values. It then maps these genes onto a protein-protein interaction network and identifies and filters active subnetworks. Using the genes in each active subnetwork, it performs enrichment analyses. The results are returned as a data frame as well as presented in an HTML report.

pathfindR enrichment overview

We'll first load the library and the example data:


  Gene.symbol    logFC  adj.P.Val
1     FAM110A -0.69394 3.4087e-06
2      RNASE2  1.35350 1.0085e-05
3      S100A8  1.54483 3.4664e-05
4      S100A9  1.02809 2.2626e-04
5      TEX261 -0.32360 2.2626e-04
6    ARHGAP17 -0.69193 2.7081e-04

Then we simply run the wrapper function run_pathfindR for active-subnetwork oriented pathway enrichment analysis:

RA_output <- run_pathfindR(RA_input)

By default the gene sets used for enrichment analysis is KEGG pathways. The available gene sets are "KEGG", "Reactome", "BioCarta", "GO-All", "GO-BP", "GO-CC", "GO-MF" and "mmu_KEGG". The gene sets can be specified using the gene_sets argument. Detailed information on all the arguments of run_pathfindR is provided here.

The output of run_pathfindR() is a data frame containing 8 (or 9) columns:

  • __ID:__ ID of the enriched term
  • __Term_Description:__ Description of the enriched term
  • __Fold_Enrichment:__ Fold enrichment value for the enriched term (Calculated using ONLY the input genes)
  • __occurrence:__ the number of iterations that the given term was found to enriched over all iterations
  • __lowest_p:__ the lowest adjusted-p value of the given term over all iterations
  • __highest_p:__ the highest adjusted-p value of the given term over all iterations
  • __non_Signif_Snw_Genes (OPTIONAL):__ the non-significant active subnetwork genes, comma-separated
  • __Up_regulated:__ the up-regulated genes (as determined by ‘change value' > 0, if the 'change column' was provided) in the input involved in the given term’s gene set, comma-separated. If change column not provided, all affected are listed here.
  • __Down_regulated:__ the down-regulated genes (as determined by ‘change value' < 0, if the 'change column' was provided) in the input involved in the given term’s gene set, comma-separated

By default, the function also plots a bubble chart of enrichment results. The example bubble chart of enrichment results for RA_output is presented below:

bubble chart of enrichment results

The function also creates an HTML report named results.html under the directory specified by the argument output_dir (default = 'pathfindR_Results'). The report contains a table of the active subnetwork-oriented pathway enrichment results. This table contains the same information as the returned data frame. If KEGG pathways were chosen, each enriched pathway name is linked to the visualization of that KEGG pathway with the gene nodes colored according to log-fold-change values. The colored human KEGG pathway diagrams are created using the input genes. If another gene set is being used, the graph visualization of interactions of pathway genes (in the PIN) are plotted instead.

Enrichment Term Clustering

Enrichment analysis usually yields a large number of related terms. In order to establish representative terms among similar groups of terms, we provide functionality to cluster the enriched terms, as outlined below:

pathfindR clustering overview

The wrapper function cluster_enriched_terms() is used for clustering the enriched terms. This function first calculates the pairwise kappa statistics between the terms in the data frame obtained from run_pathfindR(). It then clusters the terms and partitions them into relevant clusters. As can be seen in the diagram, there are two options for clustering: (1) hierarchical clustering and (2) fuzzy clustering.

Detailed information on the clustering functionality can be found here.

Hierarchical Clustering

By default, cluster_enriched_terms() performs hierarchical clustering (using 1 - kappa statistic as the distance metric), determines the optimal number of clusters by maximizing the average silhouette width and returns a data frame with cluster assignments. The representative terms are chosen as the ones with the lowest p value in each cluster.

RA_clustered <- cluster_enriched_terms(RA_output)
head(RA_clustered, 2)
        ID                        Term_Description Fold_Enrichment occurrence   lowest_p  highest_p Up_regulated
4 hsa04722 Neurotrophin signaling pathway          24.849          4 5.0626e-05 0.00066481             
7 hsa04218            Cellular senescence          20.060          2 1.6810e-03 0.00168099             
                                                Down_regulated Cluster         Status
4 CRKL, FASLG, SH2B3, ABL1, MAGED1, IRAK2, IKBKB, CALM1, CALM3       1 Representative
7       MTOR, ETS1, RBL2, CALM1, CALM3, NFATC3, SLC25A5, VDAC1       1         Member

An enrichment bubble chart grouped by clusters can be plotted via:

enrichment_chart(RA_clustered, plot_by_cluster = TRUE)

bubble chart by clusters

Heuristic Fuzzy Multiple-linkage Partitioning

Alternatively, the fuzzy clustering method (as described in Huang DW, Sherman BT, Tan Q, et al. The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 2007;8(9):R183.) can be used to obtain fuzzy cluster assignments and representative terms per each cluster:

RA_clustered <- cluster_enriched_terms(RA_output, method = "fuzzy")

Aggregated Term Scores per Sample

The function score_terms() can be used to calculate the agglomerated z score of each enriched term per sample. This allows the user to individually examine the scores and infer how a term is overall altered (activated or repressed) in a given sample or a group of samples. The function uses:

  1. The output data frame obtain of run_pathfindR(),
  2. The expression matrix used during analysis (differential expression/methylation),
  3. (Optional for ordering samples) a vector of "case" IDs.

Example usage is presented below:

## Vector of "Case" IDs cases <- c("GSM389703", "GSM389704", "GSM389706", "GSM389708", 
           "GSM389711", "GSM389714", "GSM389716", "GSM389717", 
           "GSM389719", "GSM389721", "GSM389722", "GSM389724", 
           "GSM389726", "GSM389727", "GSM389730", "GSM389731", 
           "GSM389733", "GSM389735")

## Calculate scores for representative terms 
## and plot heat map using term descriptions
score_matrix <- score_terms(enrichment_table = RA_clustered[RA_clustered$Status == "Representative", ],
                            exp_mat = RA_exp_mat,
                            cases = cases,
                            use_description = TRUE, # default FALSE
                            label_samples = FALSE) # default = TRUE

Pathway scoring

pathfindR Analysis with Custom Gene Sets

As of the latest version, pathfindR offers utility functions for obtaining organism-specific PIN data and organism-specific gene sets data via get_pin_file() and get_gene_sets_list(), respectively.

It is possible to use run_pathfindR() with custom gene sets (including gene sets for non-Homo-sapiens species). Here, we provide an example application of active-subnetwork-oriented enrichment analysis of the target genes of two transcription factors.

We first load and prepare the gene sets:

## CREB target genes
CREB_target_genes <- normalizePath(system.file("extdata/CREB.txt", package = "pathfindR"))
CREB_target_genes <- readLines(CREB_target_genes)[-c(1, 2)] # skip the first two lines

## MYC target genes
MYC_target_genes <- normalizePath(system.file("extdata/MYC.txt", package = "pathfindR"))
MYC_target_genes <- readLines(MYC_target_genes)[-c(1, 2)] # skip the first two lines

## Prep for use
custom_genes <- list(TF1 = CREB_target_genes, TF2 = MYC_target_genes)
custom_descriptions <- c(TF1 = "CREB target genes", TF2 = "MYC target genes")

We next prepare the example input data frame. Because of the way we choose genes, we expect significant enrichment for MYC targets (40 MYC target genes + 10 CREB target genes). Because this is only an example, we also assign each genes random p-values between 0.001 and 0.05.


## Select 40 random genes from MYC gene sets and 10 from CREB gene sets
selected_genes <- sample(MYC_target_genes, 40)
selected_genes <- c(selected_genes, 
                    sample(CREB_target_genes, 10))

## Assign random p value between 0.001 and 0.05 for each selected gene
rand_p_vals <- sample(seq(0.001, 0.05, length.out = 5),
                      size = length(selected_genes),
                      replace = TRUE)

input_df <- data.frame(Gene_symbol = selected_genes,
                       p_val = rand_p_vals)

Finally, we perform active-subnetwork-oriented enrichment analysis via run_pathfindR() using the custom genes as the gene sets:

custom_result <- run_pathfindR(input_df,
                               gene_sets = "Custom",
                               custom_genes = custom_genes,
                               custom_descriptions = custom_descriptions,
                               max_gset_size = Inf, # DO NOT LIMIT GENE SET SIZE
                               iterations = 1, # for demo, setting number of iterations to 1
                               output_dir = "misc/v1.4/CREB_MYC")

ID  Term_Description Fold_Enrichment occurrence   lowest_p  highest_p
1 TF2  MYC target genes          15.133          1 9.3559e-14 9.3559e-14
2 TF1 CREB target genes          17.346          1 5.0187e-05 5.0187e-05
2                                                                                                                                                                                                  BRAF, DIO2, ELAVL1, EPB41, FAM65A, FOXD3, NEUROD6, NOC4L, NUPL2, PPP1R15A, SYNGR3, TIPRL

Closing remarks

In this tutorial I tried to provide an overview of pathfindR. I try to update this tutorial as certain changes and additions to the package are made. To keep this tutorial as brief as possible, I had to omit certain frequently asked issues (such as pathfindR analysis using non-human data). For such questions and any issues, please visit our website or submit an issue to our issues page on GitHub.

Also check out the vignettes included in the package via browseVignettes("pathfindR") or find them here:

  1. Introduction to pathfindR
  2. Step-by-Step Execution of the pathfindR Enrichment Workflow
  3. pathfindR Analysis for non-Homo-sapiens organisms
  4. Visualization of pathfindR Enrichment Results
  5. Comparing Two pathfindR Results
  6. Obtaining PIN and Gene Sets Data
ADD COMMENTlink modified 8 weeks ago • written 2.1 years ago by egeulgen980

Question, some error code like: "java development kit is needed" pumped up when I was running the run_pathfindR. Wondering how can I fix this? I already installed java on my Mac before!

ADD REPLYlink written 2.1 years ago by shenwei13760

Hi, can you paste the error here? Do you have JDK installed? Actually JRE should have been sufficient but please give it a try if not installed.

ADD REPLYlink written 2.1 years ago by ozanytu0

Hello there! I cannot go past this passage: pws_table <- RA_clustered[RA_clustered$Status == "Representative", ] RA_exp_mat <- data.frame(RA_input$logFC, row.names=RA_input$Gene.symbol) score_matrix <- calculate_pw_scores(RA_clustered, RA_exp_mat)

I set up my RA_exp_mat as in the example, but I think that my expression (which is in log2foldchange) is not cut for the job.. Isn't it? My score_Matrix is full of NAs

ADD REPLYlink written 21 months ago by lucatucciarone50


The expression matrix is the matrix you used to perform the differential expression test. Something like:

        GSM389703 GSM389704 GSM389705
FAM110A    9.8591    10.529   11.3828
RNASE2    10.3410    10.400    9.1157
S100A8    14.9554    15.012   12.9907

For the example data this can be loaded by:

data(RA_exp_mat, package = "pathfindR")

You provided the LFC values only therefore the scores could not be calculated.

ADD REPLYlink written 21 months ago by egeulgen980

I'm sorry but the "data(RA_exp_mat, package = "pathfindR")" is an example data, isn't it?

I'm guessing I should use my own expression data, isn't it? Does pathfinderR creates this expression data? Shall I calculate it myself? Could you happen to tell me how?

ADD REPLYlink written 21 months ago by lucatucciarone50

Yes you should use your own expression data. This is the data you used to perform differential expression analysis and obtain the logFC and p values. pathfindR does not (cannot) create the expression data as it only uses the logFC values provided for enrichment. The expression matrix (input of limma, DESeq2 or EdgeR etc.) should be provided if you want to calculate pathway scores.

ADD REPLYlink written 21 months ago by egeulgen980

aaaah! you are talking about the HtseqCounts! Got it, thank you ☺️

ADD REPLYlink written 21 months ago by lucatucciarone50

For some datasets it doesn't find any pathway. I think this is the related error 🤔 WARNING: Unexpected number format in experiment file line 1, discarded

P.S. pathFindeR does work in my hands so I guess it is just not able to find any related pathway to this dataset.

ADD REPLYlink written 21 months ago by lucatucciarone50

In some cases (especially if the number of input genes is low), pathfindR may not find any active subnetworks and therefore no enriched pathways. The message stating "Could not find any interactions for XX (XX%) genes in the PIN." is more relevant for genes not mapping onto the PIN.

ADD REPLYlink written 21 months ago by egeulgen980

Your post is great and I need this information and tutorial. I appreciate.

ADD REPLYlink written 20 months ago by majuang6690

At first place, thank you so much for this tutorial. I find it extremely useful. Would the analysis work with p-values instead of adjusted p-values?

ADD REPLYlink written 19 months ago by Francisco Muñoz 10

thank you. glad you like the tutorial. Performing the analysis with p-values would be wrong, I would suggest instead to change the p value threshold (i.e. to change the argument p_val_threshold) to a higher value (e.g. 0.1).

ADD REPLYlink written 19 months ago by egeulgen980


Thanks for the nice tutorial. I have been using pathfindR successfully and was able to generate nice enrichments from my datasets. I have a question regarding the gene lists that are not found in the interaction – i.e. “Could not find any interactions for 36 (14.63%) genes in the PIN”.

I would like to know what these genes are. I have had a look on all the output .html files including: - “conversion.html”, - “results.html” and - "enriched_pathway.html" but all thee files are empty (they only contain the heading).

Do you know what the issue could be and how extract the 36 genes?

Thanks a lot in advance!

ADD REPLYlink written 12 months ago by Fil0

Hey Fil,

Glad you like the tutorial. I can't immediately think of any reason why the HTML outputs are empty but do open up an issue on our GitHub page if this persists.

For determining the 36 genes without interactions in the PIN, you may use the following:

input_df <- RA_input # your input data frame
p_val_threshold <- 0.05 # the preferred p value threshold
pin_path <- return_pin_path("Biogrid") # the preferred PIN

# process the input data frame 
processed_df <- input_processing(input_df, p_val_threshold, pin_path)
# data frame of genes without any interactions in the PIN
missing_df <- input_df[!input_df[, 1] %in% processed_df$old_GENE, ]

Hope this helps, -E

ADD REPLYlink modified 12 months ago • written 12 months ago by egeulgen980

that's great, Thanks a lot!!!

ADD REPLYlink written 12 months ago by Fil0

Hello @egeulgen Thank you for providing an interesting and easy-to-follow tutorial. I tried the given script for custom gene set analysis and I am getting the following error, could you please guide me if I am doing something wrong? I am copying-and-pasting the exact code that you shared:

## Testing input

The input looks OK

## Processing input. Converting gene symbols, if necessary (and if human gene symbols provied)

Could not find any interactions for 4 (7.27%) genes in the PIN

## Performing Active Subnetwork Search and Enrichment

Sep 17, 2019 5:54:41 PM ActiveSubnetworkSearchMisc.ScoreCalculations fillNodeToPValueMap
WARNING: 0 genes in experiment file does not exist in the network
Warning messages:
1: In pathfindR::input_processing(input, p_val_threshold, pin_path,  :
  The gene column was turned into character from factor.
2: In run_pathfindR(input, gene_sets = "Custom", custom_genes = custom_genes,  :
  Did not find any enriched pathways!
ADD REPLYlink modified 10 months ago • written 10 months ago by Bioinformatist Newbie250

Hello, Glad you found the tutorial useful. The function did not raise an error but rather gives a warning that no enriched gene sets were identified. I cannot tell, given only the above warnings, what might be the issue but I think the custom gene set you use for the analysis might be the problem. If you think that given your input and the custom gene set, you should find enriched term(s), please create an issue on our github page

ADD REPLYlink modified 10 months ago • written 10 months ago by egeulgen980

Hi, thank you for your prompt response. Please note that I have actually tried your given (also above-mentioned) example of custome geneset/pathway creation and finding its enrichment. And I am getting the warning message that I posted in my question, however, I think I should get a table of enrichment for "MYC and CREB target genes".

ADD REPLYlink written 10 months ago by Bioinformatist Newbie250

hello again. Thank you for pointing that out! After the initial tutorial post, we made some changes on how the active subnetworks are filtered (that is more stringent). That's why the example above didn't work. I've updated the above example and made sure it works, should be fine now :)

ADD REPLYlink written 10 months ago by egeulgen980

Hello, really liking the tool so far. What base level is the log in logFC? Do you suggest log10 or log2? Thanks, A

ADD REPLYlink written 10 months ago by Adrian Pelin2.4k

Hey. Great to hear you liked the package. The logFC values are optional and are only used during KEGG pathway diagram visualization via pathview, which scales the values between -1 and 1. So the base should not matter. The active subnetwork search only utilizes the gene symbols and associated p values.

ADD REPLYlink modified 10 months ago • written 10 months ago by egeulgen980

So the logFCs are only used to provide a direction of the expression change e.g. separate upregulated and down-regulated genes? I'm asking since I used spearman rhos instead of the fold changes which works very well by the way ;)

ADD REPLYlink modified 9 months ago • written 9 months ago by Jimbou770

only used to provide a direction of the expression change

That is correct. The change values are only used for visualization purposes (indicating up/down change).

ADD REPLYlink modified 9 months ago • written 9 months ago by egeulgen980
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1087 users visited in the last hour