In this tutorial, I'll try to provide a brief overview of the capabilities of our pathway analysis R package pathfindR. The tool is in CRAN and its vignette can be found here. We also have a detailed wiki.
This tool is designed to improve pathway enrichment analysis by firstly identifying active subnetworks in differential expression/methylation data using a protein-protein interaction network. It then performs over-representation analysis using the identified active subnetworks. By utilizing the interaction information, the tool identifies numerous pathways relevant to the condition under study.
I'll be using the example data provided in the package: rheumatoid arthritis gene expression data.
Active-Subnetwork-Oriented Pathway Enrichment Analysis
As illustrated below, this workflow takes in a data frame of 3 columns: Gene Symbol, log-fold-change and adjusted p value. It then maps these genes onto a protein-protein interaction network and identifies active subnetworks. Using the genes in the active subnetwork, it performs pathway enrichment analyses. The results are returned as a data frame as well as presented in an HTML report.
We'll first load the library and the example data:
library(pathfindR) data("RA_input") head(RA_input) Gene.symbol logFC adj.P.Val 1 FAM110A -0.69394 3.4087e-06 2 RNASE2 1.35350 1.0085e-05 3 S100A8 1.54483 3.4664e-05 4 S100A9 1.02809 2.2626e-04 5 TEX261 -0.32360 2.2626e-04 6 ARHGAP17 -0.69193 2.7081e-04
Then we simply run the wrapper function
run_pathfindR for active-subnetwork oriented pathway enrichment analysis:
RA_output <- run_pathfindR(RA_input)
By default the gene sets used for enrichment analysis is KEGG pathways. The available gene sets are KEGG, Reactome, BioCarta, GO-BP, GO-CC and GO-MF. The gene sets can be specified using the
gene_sets argument. Detailed information on all the arguments of
run_pathfindR is provided here.
The output data frame
RA_output consists of 8 columns:
- ID: KEGG ID of enriched pathway
- Pathway: Description the pathway
- Fold_Enrichment: Fold enrichment value for the pathway.
- occurrence: The number of times the pathway was found to be enriched over all iterations
- lowest_p: the lowest adjusted-p value of the pathway over all iterations
- highest_p: the highest adjusted-p value of the pathway over all iterations
- Up_regulated: the up-regulated genes involved in the pathway
- Down_regulated: the down-regulated genes involved in the pathway
By default, the function also plots a bubble chart of enrichment results. The example bubble chart of enrichment results for
RA_output is presented below:
The function also creates an HTML report named
results.html under the directory specified by the argument
output_dir (default = 'pathfindR_Results'). The report contains a table of the active subnetwork-oriented pathway enrichment results. This table contains the same information as the returned data frame. If KEGG pathways were chosen, each enriched pathway name is linked to the visualization of that KEGG pathway with the gene nodes colored according to log-fold-change values. The KEGG pathway diagrams are created using the R package
Enrichment analysis usually yields a large number of related pathways. In order to establish representative pathways among similar groups of pathways, we provide functionality to cluster the enriched pathways, as outlined below.
choose_clusters is used for clustering the pathways. This function first calculates the pairwise distances between the pathways in the data frame obtained from
run_pathfindR. As can be seen in the diagram, there are two options for partitioning the hierarchical clustering dendrogram into clusters: (1) automatic (default), (2) manual.
Detailed information on the clustering functionality can be found here.
choose_clusters automatically determines the optimal number of clusters, by maximizing the average silhouette width and returns a data frame with cluster assignments. The representative pathways are chosen as the ones with the lowest p value in each cluster.
RA_clustered <- choose_clusters(RA_output) head(RA_clustered,2) ID Pathway Fold_Enrichment occurrence lowest_p highest_p Up_regulated 4 hsa04722 Neurotrophin signaling pathway 24.849 4 5.0626e-05 0.00066481 7 hsa04218 Cellular senescence 20.060 2 1.6810e-03 0.00168099 Down_regulated Cluster Status 4 CRKL, FASLG, SH2B3, ABL1, MAGED1, IRAK2, IKBKB, CALM1, CALM3 1 Representative 7 MTOR, ETS1, RBL2, CALM1, CALM3, NFATC3, SLC25A5, VDAC1 1 Member
An enrichment bubble chart grouped by clusters can be plotted via:
enrichment_chart(RA_clustered, plot_by_cluster = TRUE)
Alternatively, manual selection of the height at which to cut the dendrogram can be performed. For this, the user should set the
auto argument of
FALSE. Via a shiny app, presented as an HTML document, the hierarchical clustering dendrogram is visualized. In this HTML document, the user can select the agglomeration method and the distance value at which to cut the tree. The dendrogram with the cut-off value marked with a red line is dynamically visualized and the resulting cluster assignments of the pathways along with annotation of representative pathways (chosen by smallest lowest p value) are presented as a table. This table can be saved as a csv file via pressing the button
Get Pathways w\ Cluster Info.
choose_clusters(RA_output, auto = FALSE)
calculate_pw_scores can be used to calculate the pathway scores per sample. This allows the user to individually examine the scores and infer whether a pathway is activated or repressed in a given sample. The function uses:
- The pathway data frame obtain from
- The expression matrix used during differential expression analysis,
- (Optional for ordering samples) a vector of "case" IDs.
Example usage is presented below:
## selecting "Representative" pathways for clear visualization pws_table <- RA_clustered[RA_clustered$Status == "Representative", ] ## Expression matrix data("RA_exp_mat") ## Vector of "Case" IDs cases <- c("GSM389703", "GSM389704", "GSM389706", "GSM389708", "GSM389711", "GSM389714", "GSM389716", "GSM389717", "GSM389719", "GSM389721", "GSM389722", "GSM389724", "GSM389726", "GSM389727", "GSM389730", "GSM389731", "GSM389733", "GSM389735") ## Calculate pathway scores and plot heatmap score_matrix <- calculate_pw_scores(pws_table, RA_exp_mat, cases)
pathfindR Analysis with Custom Gene Sets
It is possible to use
run_pathfindR() with custom gene sets. Here we provide an example application with a gene set consisting of targets of two transcription factors.
We first load and prepare the gene sets:
## CREB target genes CREB_targets <- normalizePath(system.file("extdata/CREB.txt", package = "pathfindR")) CREB_targets <- read.delim(CREB_targets, skip = 2, header = FALSE, stringsAsFactors = FALSE) CREB_targets <- CREB_targets$V1 ## MYC target genes MYC_targets <- normalizePath(system.file("extdata/MYC.txt", package = "pathfindR")) MYC_targets <- read.delim(MYC_targets, skip = 2, header = FALSE, stringsAsFactors = FALSE) MYC_targets <- MYC_targets$V1 ## Prep for use custom_genes <- list(CREB_targets, MYC_targets) names(custom_genes) <- c("TF1", "TF2") custom_pathways <- c("CREB target genes", "MYC target genes") names(custom_pathways) <- c("TF1", "TF2")
We next prepare an example input, because of the way we choose genes we expect significant enrichment for MYC targets (50 MYC target genes + 5 CREB target genes). Because this is only an example, we set the logFC values to 1.5 and the adj.P.Vals to 0.001:
set.seed(123) selected_genes <- sample(MYC_targets, 50) selected_genes <- c(selected_genes, sample(CREB_targets, 5)) input <- data.frame(Gene.symbol = selected_genes, logFC = 1.5, adj.P.Val = 0.001)
Finally, we run the wrapper script
run_pathfindR using the custom genes as the gene sets:
custom_result <- run_pathfindR(input, gene_sets = "Custom", custom_genes = custom_genes, custom_pathways = custom_pathways, iterations = 1)
The most significantly enriched gene set is, unsurprisingly, "MYC target genes". Interestingly, through interaction information, pathfindR was also able to identify significant enrichment for "CREB target genes" as well.
ID Pathway Fold_Enrichment occurrence lowest_p highest_p 1 TF2 MYC target genes 14.550 1 4.0284e-09 4.0284e-09 2 TF1 CREB target genes 35.353 1 3.7408e-03 3.7408e-03 Up_regulated 1 AGMAT, ANGPT2, ANKRD13B, BATF3, C19orf26, C2CD2L, COMMD8, CTSF, CUL5, DOPEY1, ELAVL3, ENO3, FCHSD2, GATA5, HMGN2, HMHA1, HOXA1, IGF2R, IL1RAPL1, IPO4, KAT5, MAFF, MCM8, MEPCE, MPV17, NCOA6, NIT1, OSR1, PDP2, PICALM, PPP1R3B, PPP1R3C, PRMT1, PUS1, RNF43, SEC11C, SLC12A5, SMC3, TCF4, TEF, TFB2M, THUMPD2, TIAL1, UBR4, UBXN1, UTY, YPEL5, ZNF771 2 AREG, CCBL1, MAFF, OSR1, SPAG9 Down_regulated 1 2