Tutorial:Enrichment Analysis, Clustering and Scoring with pathfindR
0
77
Entering edit mode
4.5 years ago
egeulgen ★ 1.3k

In this tutorial, I'll try to provide a brief overview of the capabilities of our enrichment analysis R package pathfindR. The tool is in CRAN and its introductory vignette can be found here. We also have a detailed wiki.

This tool is designed to improve enrichment analysis by firstly identifying active subnetworks in differential expression/methylation data using a protein-protein interaction network. It then performs over-representation analysis using the identified active subnetworks. By utilizing the interaction information, the tool identifies numerous gene sets relevant to the condition under study.

I'll be using the example data provided in the package: a gene expression data comparing rheumatoid arthritis patients vs. controls.

# Active-Subnetwork-Oriented Pathway Enrichment Analysis

As illustrated below, this workflow takes in a data frame of 3 columns containing: Gene Symbols, Change Values (optional) and p values. It then maps these genes onto a protein-protein interaction network and identifies and filters active subnetworks. Using the genes in each active subnetwork, it performs enrichment analyses. The results are returned as a data frame as well as presented in an HTML report.

We'll first load the library and the example data:

library(pathfindR)

1     FAM110A -0.69394 3.4087e-06
2      RNASE2  1.35350 1.0085e-05
3      S100A8  1.54483 3.4664e-05
4      S100A9  1.02809 2.2626e-04
5      TEX261 -0.32360 2.2626e-04
6    ARHGAP17 -0.69193 2.7081e-04


Then we simply run the wrapper function run_pathfindR for active-subnetwork oriented pathway enrichment analysis:

RA_output <- run_pathfindR(RA_input)


By default the gene sets used for enrichment analysis is KEGG pathways. The available gene sets are "KEGG", "Reactome", "BioCarta", "GO-All", "GO-BP", "GO-CC", "GO-MF" and "mmu_KEGG". The gene sets can be specified using the gene_sets argument. Detailed information on all the arguments of run_pathfindR is provided here.

The output of run_pathfindR() is a data frame containing 8 (or 9) columns:

• ID: ID of the enriched term
• Term_Description: Description of the enriched term
• Fold_Enrichment: Fold enrichment value for the enriched term (Calculated using ONLY the input genes)
• occurrence: the number of iterations that the given term was found to enriched over all iterations
• lowest_p: the lowest adjusted-p value of the given term over all iterations
• highest_p: the highest adjusted-p value of the given term over all iterations
• non_Signif_Snw_Genes (OPTIONAL): the non-significant active subnetwork genes, comma-separated
• Up_regulated: the up-regulated genes (as determined by ‘change value' > 0, if the 'change column' was provided) in the input involved in the given term’s gene set, comma-separated. If change column not provided, all affected are listed here.
• Down_regulated: the down-regulated genes (as determined by ‘change value' < 0, if the 'change column' was provided) in the input involved in the given term’s gene set, comma-separated

By default, the function also plots a bubble chart of enrichment results. The example bubble chart of enrichment results for RA_output is presented below:

The function also creates an HTML report named results.html under the directory specified by the argument output_dir (default = 'pathfindR_Results'). The report contains a table of the active subnetwork-oriented pathway enrichment results. This table contains the same information as the returned data frame. If KEGG pathways were chosen, each enriched pathway name is linked to the visualization of that KEGG pathway with the gene nodes colored according to log-fold-change values. The colored human KEGG pathway diagrams are created using the input genes. If another gene set is being used, the graph visualization of interactions of pathway genes (in the PIN) are plotted instead.

# Enrichment Term Clustering

Enrichment analysis usually yields a large number of related terms. In order to establish representative terms among similar groups of terms, we provide functionality to cluster the enriched terms, as outlined below:

The wrapper function cluster_enriched_terms() is used for clustering the enriched terms. This function first calculates the pairwise kappa statistics between the terms in the data frame obtained from run_pathfindR(). It then clusters the terms and partitions them into relevant clusters. As can be seen in the diagram, there are two options for clustering: (1) hierarchical clustering and (2) fuzzy clustering.

Detailed information on the clustering functionality can be found here.

### Hierarchical Clustering

By default, cluster_enriched_terms() performs hierarchical clustering (using 1 - kappa statistic as the distance metric), determines the optimal number of clusters by maximizing the average silhouette width and returns a data frame with cluster assignments. The representative terms are chosen as the ones with the lowest p value in each cluster.

RA_clustered <- cluster_enriched_terms(RA_output)
ID                        Term_Description Fold_Enrichment occurrence   lowest_p  highest_p Up_regulated
4 hsa04722 Neurotrophin signaling pathway          24.849          4 5.0626e-05 0.00066481
7 hsa04218            Cellular senescence          20.060          2 1.6810e-03 0.00168099
Down_regulated Cluster         Status
4 CRKL, FASLG, SH2B3, ABL1, MAGED1, IRAK2, IKBKB, CALM1, CALM3       1 Representative
7       MTOR, ETS1, RBL2, CALM1, CALM3, NFATC3, SLC25A5, VDAC1       1         Member


An enrichment bubble chart grouped by clusters can be plotted via:

enrichment_chart(RA_clustered, plot_by_cluster = TRUE)


Alternatively, the fuzzy clustering method (as described in Huang DW, Sherman BT, Tan Q, et al. The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 2007;8(9):R183.) can be used to obtain fuzzy cluster assignments and representative terms per each cluster:

RA_clustered <- cluster_enriched_terms(RA_output, method = "fuzzy")


# Aggregated Term Scores per Sample

The function score_terms() can be used to calculate the agglomerated z score of each enriched term per sample. This allows the user to individually examine the scores and infer how a term is overall altered (activated or repressed) in a given sample or a group of samples. The function uses:

1. The output data frame obtain of run_pathfindR(),
2. The expression matrix used during analysis (differential expression/methylation),
3. (Optional for ordering samples) a vector of "case" IDs.

Example usage is presented below:

## Vector of "Case" IDs cases <- c("GSM389703", "GSM389704", "GSM389706", "GSM389708",
"GSM389711", "GSM389714", "GSM389716", "GSM389717",
"GSM389719", "GSM389721", "GSM389722", "GSM389724",
"GSM389726", "GSM389727", "GSM389730", "GSM389731",
"GSM389733", "GSM389735")

## Calculate scores for representative terms
## and plot heat map using term descriptions
score_matrix <- score_terms(enrichment_table = RA_clustered[RA_clustered$Status == "Representative", ], exp_mat = RA_exp_mat, cases = cases, use_description = TRUE, # default FALSE label_samples = FALSE) # default = TRUE  # pathfindR Analysis with Custom Gene Sets As of the latest version, pathfindR offers utility functions for obtaining organism-specific PIN data and organism-specific gene sets data via get_pin_file() and get_gene_sets_list(), respectively. It is possible to use run_pathfindR() with custom gene sets (including gene sets for non-Homo-sapiens species). Here, we provide an example application of active-subnetwork-oriented enrichment analysis of the target genes of two transcription factors. We first load and prepare the gene sets: ## CREB target genes CREB_target_genes <- normalizePath(system.file("extdata/CREB.txt", package = "pathfindR")) CREB_target_genes <- readLines(CREB_target_genes)[-c(1, 2)] # skip the first two lines ## MYC target genes MYC_target_genes <- normalizePath(system.file("extdata/MYC.txt", package = "pathfindR")) MYC_target_genes <- readLines(MYC_target_genes)[-c(1, 2)] # skip the first two lines ## Prep for use custom_genes <- list(TF1 = CREB_target_genes, TF2 = MYC_target_genes) custom_descriptions <- c(TF1 = "CREB target genes", TF2 = "MYC target genes")  We next prepare the example input data frame. Because of the way we choose genes, we expect significant enrichment for MYC targets (40 MYC target genes + 10 CREB target genes). Because this is only an example, we also assign each genes random p-values between 0.001 and 0.05. set.seed(123) ## Select 40 random genes from MYC gene sets and 10 from CREB gene sets selected_genes <- sample(MYC_target_genes, 40) selected_genes <- c(selected_genes, sample(CREB_target_genes, 10)) ## Assign random p value between 0.001 and 0.05 for each selected gene rand_p_vals <- sample(seq(0.001, 0.05, length.out = 5), size = length(selected_genes), replace = TRUE) input_df <- data.frame(Gene_symbol = selected_genes, p_val = rand_p_vals)  Finally, we perform active-subnetwork-oriented enrichment analysis via run_pathfindR() using the custom genes as the gene sets: custom_result <- run_pathfindR(input_df, gene_sets = "Custom", custom_genes = custom_genes, custom_descriptions = custom_descriptions, max_gset_size = Inf, # DO NOT LIMIT GENE SET SIZE iterations = 1, # for demo, setting number of iterations to 1 output_dir = "misc/v1.4/CREB_MYC") ID Term_Description Fold_Enrichment occurrence lowest_p highest_p 1 TF2 MYC target genes 15.133 1 9.3559e-14 9.3559e-14 2 TF1 CREB target genes 17.346 1 5.0187e-05 5.0187e-05 Up_regulated 1 AGRP, ATP6V1C1, C19orf54, CD3EAP, CNPY3, EPB41, FOXD3, FXR1, GLA, HNRNPD, HOXA7, IL1RAPL1, KDM6A, LONP1, LTBR, MDN1, MICU1, NET1, NEUROD6, NMNAT2, NOL6, NUDC, PEPD, PKN1, PSMB3, RPS19, RPS28, RRS1, SLC9A5, SMC3, STC2, TESK2, TNPO2, TOPORS, TPM2, TSSK3, WBP2, ZBTB8OS, ZFYVE26, ZHX2 2 BRAF, DIO2, ELAVL1, EPB41, FAM65A, FOXD3, NEUROD6, NOC4L, NUPL2, PPP1R15A, SYNGR3, TIPRL Down_regulated 1 2  # Closing remarks In this tutorial I tried to provide an overview of pathfindR. I try to update this tutorial as certain changes and additions to the package are made. To keep this tutorial as brief as possible, I had to omit certain frequently asked issues (such as pathfindR analysis using non-human data). For such questions and any issues, please visit our website or submit an issue to our issues page on GitHub. Also check out the vignettes included in the package via browseVignettes("pathfindR") or find them here: R pathway enrichment active subnetwork Tutorial • 14k views ADD COMMENT 1 Entering edit mode Your post is great and I need this information and tutorial. I appreciate. ADD REPLY 0 Entering edit mode Question, some error code like: "java development kit is needed" pumped up when I was running the run_pathfindR. Wondering how can I fix this? I already installed java on my Mac before! ADD REPLY 0 Entering edit mode Hi, can you paste the error here? Do you have JDK installed? Actually JRE should have been sufficient but please give it a try if not installed. ADD REPLY 0 Entering edit mode Hello there! I cannot go past this passage: pws_table <- RA_clustered[RA_clustered$Status == "Representative", ] RA_exp_mat <- data.frame(RA_input$logFC, row.names=RA_input$Gene.symbol) score_matrix <- calculate_pw_scores(RA_clustered, RA_exp_mat)

I set up my RA_exp_mat as in the example, but I think that my expression (which is in log2foldchange) is not cut for the job.. Isn't it? My score_Matrix is full of NAs

0
Entering edit mode

Hey,

The expression matrix is the matrix you used to perform the differential expression test. Something like:

        GSM389703 GSM389704 GSM389705
FAM110A    9.8591    10.529   11.3828
RNASE2    10.3410    10.400    9.1157
S100A8    14.9554    15.012   12.9907


For the example data this can be loaded by:

data(RA_exp_mat, package = "pathfindR")


You provided the LFC values only therefore the scores could not be calculated.

0
Entering edit mode

I'm sorry but the "data(RA_exp_mat, package = "pathfindR")" is an example data, isn't it?

I'm guessing I should use my own expression data, isn't it? Does pathfinderR creates this expression data? Shall I calculate it myself? Could you happen to tell me how?

0
Entering edit mode

Yes you should use your own expression data. This is the data you used to perform differential expression analysis and obtain the logFC and p values. pathfindR does not (cannot) create the expression data as it only uses the logFC values provided for enrichment. The expression matrix (input of limma, DESeq2 or EdgeR etc.) should be provided if you want to calculate pathway scores.

0
Entering edit mode

aaaah! you are talking about the HtseqCounts! Got it, thank you ☺️

0
Entering edit mode

For some datasets it doesn't find any pathway. I think this is the related error 🤔 WARNING: Unexpected number format in experiment file line 1, discarded

P.S. pathFindeR does work in my hands so I guess it is just not able to find any related pathway to this dataset.

0
Entering edit mode

In some cases (especially if the number of input genes is low), pathfindR may not find any active subnetworks and therefore no enriched pathways. The message stating "Could not find any interactions for XX (XX%) genes in the PIN." is more relevant for genes not mapping onto the PIN.

0
Entering edit mode

At first place, thank you so much for this tutorial. I find it extremely useful. Would the analysis work with p-values instead of adjusted p-values?

2
Entering edit mode

thank you. glad you like the tutorial. Performing the analysis with p-values would be wrong, I would suggest instead to change the p value threshold (i.e. to change the argument p_val_threshold) to a higher value (e.g. 0.1).

0
Entering edit mode

Hello,

Thanks for the nice tutorial. I have been using pathfindR successfully and was able to generate nice enrichments from my datasets. I have a question regarding the gene lists that are not found in the interaction – i.e. “Could not find any interactions for 36 (14.63%) genes in the PIN”.

I would like to know what these genes are. I have had a look on all the output .html files including: - “conversion.html”, - “results.html” and - "enriched_pathway.html" but all thee files are empty (they only contain the heading).

Do you know what the issue could be and how extract the 36 genes?

1
Entering edit mode

Hey Fil,

Glad you like the tutorial. I can't immediately think of any reason why the HTML outputs are empty but do open up an issue on our GitHub page if this persists.

For determining the 36 genes without interactions in the PIN, you may use the following:

input_df <- RA_input # your input data frame
p_val_threshold <- 0.05 # the preferred p value threshold
pin_path <- return_pin_path("Biogrid") # the preferred PIN

# process the input data frame
processed_df <- input_processing(input_df, p_val_threshold, pin_path)
# data frame of genes without any interactions in the PIN
missing_df <- input_df[!input_df[, 1] %in% processed_df$old_GENE, ]  Hope this helps, -E ADD REPLY 0 Entering edit mode that's great, Thanks a lot!!! ADD REPLY 0 Entering edit mode Hello @egeulgen Thank you for providing an interesting and easy-to-follow tutorial. I tried the given script for custom gene set analysis and I am getting the following error, could you please guide me if I am doing something wrong? I am copying-and-pasting the exact code that you shared: ## Testing input The input looks OK ## Processing input. Converting gene symbols, if necessary (and if human gene symbols provied) Could not find any interactions for 4 (7.27%) genes in the PIN ## Performing Active Subnetwork Search and Enrichment Sep 17, 2019 5:54:41 PM ActiveSubnetworkSearchMisc.ScoreCalculations fillNodeToPValueMap WARNING: 0 genes in experiment file does not exist in the network Warning messages: 1: In pathfindR::input_processing(input, p_val_threshold, pin_path, : The gene column was turned into character from factor. 2: In run_pathfindR(input, gene_sets = "Custom", custom_genes = custom_genes, : Did not find any enriched pathways!  ADD REPLY 0 Entering edit mode Hello, Glad you found the tutorial useful. The function did not raise an error but rather gives a warning that no enriched gene sets were identified. I cannot tell, given only the above warnings, what might be the issue but I think the custom gene set you use for the analysis might be the problem. If you think that given your input and the custom gene set, you should find enriched term(s), please create an issue on our github page ADD REPLY 0 Entering edit mode Hi, thank you for your prompt response. Please note that I have actually tried your given (also above-mentioned) example of custome geneset/pathway creation and finding its enrichment. And I am getting the warning message that I posted in my question, however, I think I should get a table of enrichment for "MYC and CREB target genes". ADD REPLY 0 Entering edit mode hello again. Thank you for pointing that out! After the initial tutorial post, we made some changes on how the active subnetworks are filtered (that is more stringent). That's why the example above didn't work. I've updated the above example and made sure it works, should be fine now :) ADD REPLY 0 Entering edit mode Hello, really liking the tool so far. What base level is the log in logFC? Do you suggest log10 or log2? Thanks, A ADD REPLY 0 Entering edit mode Hey. Great to hear you liked the package. The logFC values are optional and are only used during KEGG pathway diagram visualization via pathview, which scales the values between -1 and 1. So the base should not matter. The active subnetwork search only utilizes the gene symbols and associated p values. ADD REPLY 0 Entering edit mode So the logFCs are only used to provide a direction of the expression change e.g. separate upregulated and down-regulated genes? I'm asking since I used spearman rhos instead of the fold changes which works very well by the way ;) ADD REPLY 0 Entering edit mode only used to provide a direction of the expression change That is correct. The change values are only used for visualization purposes (indicating up/down change). ADD REPLY 0 Entering edit mode Hello @egeulgen, many thanks for your great tutorials. I really like PathfindR! I have just one problem, how can I change the colors of the bubbles of the enrichment chart? From white to red is not really applicable for me. Changing colors for score_terms() is super easy. But how is it working for enrichment_chart()? Sorry this question is maybe very basic, but I am also very new in the field of RNA seq and using R. Many thanks in advance. Best, Kamil. ADD REPLY 2 Entering edit mode Hey Kamil, The result of enrichment_chart() is also a ggplot object so you can change the gradient colors a scale_color_gradient(): library(ggplot2) g <- enrichment_chart(output_df) g <- g + scale_color_gradient(low = "green", high = "blue") g  ADD REPLY 0 Entering edit mode Wow that was super easy! Many thanks. ADD REPLY 0 Entering edit mode Dear, thanks a lot for this tutorial. But, If I want to get the p adjusted for the pathways in the output result as you know the output contain lowest p value and highest p value not the p adjusted. How I can do that ? I tried to use p.adjust function in R, is it right way or not ? if not, is there any other way? ADD REPLY 0 Entering edit mode Hi, I have two similar DEG lists, one worked very well with pathfindR, but the 2nd one failed for no apparent reason. Do you have any idea why I am getting this error? > Tuj1p_output <- run_pathfindR(Tuj1p_input) ## Testing input The input looks OK ## Processing input. Converting gene symbols, if necessary (and if human gene symbols provided) Number of genes provided in input: 161 Number of genes in input after p-value filtering: 161 pathfindR cannot handle p values < 1e-13. These were changed to 1e-13 Could not find any interactions for 15 (9.32%) genes in the PIN Final number of genes in input: 146 ## Performing Active Subnetwork Search and Enrichment Feb 25, 2021 4:33:22 PM Network.Network addInteraction WARNING: Self interactions are discarded. Found 200 active subnetworks ## Processing the enrichment results over all iterations ## Annotating involved genes and visualizing enriched terms Error in $<-.data.frame(*tmp*, "KEGG_ID", value = "hsa:") :
replacement has 1 row, data has 0

0
Entering edit mode

This looks like an issue with the code. Would you mind opening an issue here: https://github.com/egeulgen/pathfindR/issues

0
Entering edit mode

Hi, sorry for the late reply. It seems ok so far, now working. I guess pathfindR doesn't like a short list of DEGs (< 150 genes). If anything happens, I'll let you know by opening a case in github. Thank you.

0
Entering edit mode

Hi, Thanks for the useful package and a wonderful tutorial. I have a small doubt. I assume that we only need to provide a list of differentially expressed genes along with Fold change and p values. What if I provide the complete output of DESeq as Input? Will it only consider the differentially expressed genes as there is a p_val_threshold?

0
Entering edit mode

Hey, thank you for your kind words. You're correct, if you provide the complete (not filtered) differential expression results, pathfindR filters the list per the p_val_threshold and will only consider those genes.

0
Entering edit mode

On my system (Ubuntu 18.04) Pathfinder is unable to install due to an error raised by magick

> pak::pkg_install("pathfindR")
✔ Updated metadata database: 2.57 MB in 6 files.
✔ Updating metadata database ... done
> → Will install 3 packages.
> + KEGGgraph   1.56.0 [bld][dl]
> + magick      2.7.3  [bld][cmp][dl] (4.81 MB)
> + pathfindR   1.6.3  [bld][dl] (3.50 MB)
> ℹ Getting 2 pkgs (8.31 MB) and 1 pkg with unknown size
✔ Got KEGGgraph 1.56.0 (source) (1.34 MB)
✔ Got pathfindR 1.6.3 (source) (3.50 MB)
✔ Got magick 2.7.3 (source) (4.81 MB)
ℹ Building magick 2.7.3
ℹ Building KEGGgraph 1.56.0
✖ Failed to build magick 2.7.3
<callr_status_error/callr_error/rlib_error_3_0/rlib_error/error>
Error:
! error in callr subprocess
Caused by error in stop_task_build(state, worker):
! Failed to build source package 'magick'
Failed to build source package 'magick', stdout + stderr:
OE> * installing *source* package ‘magick’ ...
OE> ** package ‘magick’ successfully unpacked and MD5 sums checked
OE> staged installation is only possible with locking
OE> ** using non-staged installation
OE> Perhaps you should add the directory containing Magick++.pc'
OE> to the PKG_CONFIG_PATH environment variable
OE> No package 'Magick++' found
OE> Using PKG_CFLAGS=
OE> Using PKG_LIBS=-lMagick++-6.Q16
OE> --------------------------- [ANTICONF] --------------------------------
OE> Configuration failed to find the Magick++ library. Try installing:
OE>  - deb: libmagick++-dev (Debian, Ubuntu)
OE>  - rpm: ImageMagick-c++-devel (Fedora, CentOS, RHEL)
OE>  - csw: imagemagick_dev (Solaris)
OE>  - brew imagemagick@6 (MacOS)
OE> For Ubuntu versions Trusty (14.04) and Xenial (16.04) use our PPA:
OE>    sudo apt-get update
OE>    sudo apt-get install -y libmagick++-dev
OE> If Magick++ is already installed, check that 'pkg-config' is in your
OE> PATH and PKG_CONFIG_PATH contains a Magick++.pc file. If pkg-config
OE> is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:
OE> R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
OE> -------------------------- [ERROR MESSAGE] ---------------------------
OE> <stdin>:1:10:fatal error: Magick++.h: No such file or directory
OE> compilation terminated.
OE> --------------------------------------------------------------------
OE> ERROR: configuration failed for package ‘magick’
OE> * removing ‘/tmp/RtmpYAlKoB/pkg-lib42e43baecf6f/magick’
NULL
Error: ! error in callr subprocess
Caused by error in stop_task_build(state, worker):
! Failed to build source package 'magick'
Type .Last.error.trace to see where the error occurred
`

Do you have advices how to solve this error?

0
Entering edit mode