Tutorial: Pathway Enrichment Analysis, Clustering and Scoring with pathfindR
39
gravatar for egeulgen
8 months ago by
egeulgen670
Istanbul
egeulgen670 wrote:

In this tutorial, I'll try to provide a brief overview of the capabilities of our pathway analysis R package pathfindR. The tool is in CRAN and its introductory vignette can be found here. We also have a detailed wiki.

This tool is designed to improve pathway enrichment analysis by firstly identifying active subnetworks in differential expression/methylation data using a protein-protein interaction network. It then performs over-representation analysis using the identified active subnetworks. By utilizing the interaction information, the tool identifies numerous pathways relevant to the condition under study.

I'll be using the example data provided in the package: a gene expression data comparing rheumatoid arthritis patients vs. controls.


Active-Subnetwork-Oriented Pathway Enrichment Analysis

As illustrated below, this workflow takes in a data frame of 3 columns: Gene Symbol, log-fold-change and adjusted p value. It then maps these genes onto a protein-protein interaction network and identifies active subnetworks. Using the genes in the active subnetwork, it performs pathway enrichment analyses. The results are returned as a data frame as well as presented in an HTML report.

pathfindR enrichment overview

We'll first load the library and the example data:

library(pathfindR)
data("RA_input")
head(RA_input)
  Gene.symbol    logFC  adj.P.Val
1     FAM110A -0.69394 3.4087e-06
2      RNASE2  1.35350 1.0085e-05
3      S100A8  1.54483 3.4664e-05
4      S100A9  1.02809 2.2626e-04
5      TEX261 -0.32360 2.2626e-04
6    ARHGAP17 -0.69193 2.7081e-04

Then we simply run the wrapper function run_pathfindR for active-subnetwork oriented pathway enrichment analysis:

RA_output <- run_pathfindR(RA_input)

By default the gene sets used for enrichment analysis is KEGG pathways. The available gene sets are KEGG, Reactome, BioCarta, GO-BP, GO-CC and GO-MF. The gene sets can be specified using the gene_sets argument. Detailed information on all the arguments of run_pathfindR is provided here.

The output data frame RA_output consists of 8 columns:

  1. ID: KEGG ID of enriched pathway
  2. Pathway: Description the pathway
  3. Fold_Enrichment: Fold enrichment value for the pathway.
  4. occurrence: The number of times the pathway was found to be enriched over all iterations
  5. lowest_p: the lowest adjusted-p value of the pathway over all iterations
  6. highest_p: the highest adjusted-p value of the pathway over all iterations
  7. Up_regulated: the up-regulated genes involved in the pathway
  8. Down_regulated: the down-regulated genes involved in the pathway

By default, the function also plots a bubble chart of enrichment results. The example bubble chart of enrichment results for RA_output is presented below:

bubble chart of enrichment results

The function also creates an HTML report named results.html under the directory specified by the argument output_dir (default = 'pathfindR_Results'). The report contains a table of the active subnetwork-oriented pathway enrichment results. This table contains the same information as the returned data frame. If KEGG pathways were chosen, each enriched pathway name is linked to the visualization of that KEGG pathway with the gene nodes colored according to log-fold-change values. The human KEGG pathway diagrams are created using the R package pathview. If another gene set is being used, the graph visualization of interactions of pathway genes (in the PIN) are plotted using igraph.


Enrichment Term Clustering

Enrichment analysis usually yields a large number of related terms. In order to establish representative terms among similar groups of terms, we provide functionality to cluster the enriched terms, as outlined below.

pathfindR clustering overview

The function cluster_pathways is used for clustering the pathways. This function first calculates the pairwise kappa statistics between the terms in the data frame obtained from run_pathfindR. As can be seen in the diagram, there are two options for clustering: (1) hierarchical clustering and (2) fuzzy clustering.

Detailed information on the clustering functionality can be found here.

Hierarchical Clustering

By default, cluster_pathways performs hierarchical clustering (using -kappa statistic as distance), determines the optimal number of clusters by maximizing the average silhouette width and returns a data frame with cluster assignments. The representative terms are chosen as the ones with the lowest p value in each cluster.

RA_clustered <- choose_clusters(RA_output)
head(RA_clustered,2)
        ID                        Pathway Fold_Enrichment occurrence   lowest_p  highest_p Up_regulated
4 hsa04722 Neurotrophin signaling pathway          24.849          4 5.0626e-05 0.00066481             
7 hsa04218            Cellular senescence          20.060          2 1.6810e-03 0.00168099             
                                                Down_regulated Cluster         Status
4 CRKL, FASLG, SH2B3, ABL1, MAGED1, IRAK2, IKBKB, CALM1, CALM3       1 Representative
7       MTOR, ETS1, RBL2, CALM1, CALM3, NFATC3, SLC25A5, VDAC1       1         Member

An enrichment bubble chart grouped by clusters can be plotted via:

enrichment_chart(RA_clustered, plot_by_cluster = TRUE)

bubble chart by clusters

Manual Clustering

Alternatively, the fuzzy clustering method (as described in Huang DW, Sherman BT, Tan Q, et al. The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 2007;8(9):R183.) can be used to obtain fuzzy cluster assignments and representative terms per each cluster:

RA_clustered <- cluster_pathways(RA_output, method = "fuzzy")

Pathway Scoring

The function calculate_pw_scores can be used to calculate the pathway scores per sample. This allows the user to individually examine the scores and infer whether a pathway is activated or repressed in a given sample. The function uses:

  1. The pathway data frame obtain from run_pathfindR,
  2. The expression matrix used during differential expression analysis,
  3. (Optional for ordering samples) a vector of "case" IDs.

Example usage is presented below:

## selecting "Representative" pathways for clear visualization
pws_table <- RA_clustered[RA_clustered$Status == "Representative", ]

## Expression matrix
data("RA_exp_mat")

## Vector of "Case" IDs
cases <- c("GSM389703", "GSM389704", "GSM389706", "GSM389708", 
               "GSM389711", "GSM389714", "GSM389716", "GSM389717", 
               "GSM389719", "GSM389721", "GSM389722", "GSM389724", 
               "GSM389726", "GSM389727", "GSM389730", "GSM389731", 
               "GSM389733", "GSM389735")

## Calculate pathway scores and plot heatmap
score_matrix <- calculate_pw_scores(pws_table, RA_exp_mat, cases)

Pathway scoring


pathfindR Analysis with Custom Gene Sets

It is possible to use run_pathfindR() with custom gene sets. Here we provide an example application with a gene set consisting of targets of two transcription factors. The data for these gene sets are provided in the package.

We first load and prepare the gene sets:

## CREB target genes
CREB_targets <- normalizePath(system.file("extdata/CREB.txt",
                                      package = "pathfindR"))
CREB_targets <- read.delim(CREB_targets, skip = 2, header = FALSE, stringsAsFactors = FALSE)
CREB_targets <- CREB_targets$V1

## MYC target genes
MYC_targets <- normalizePath(system.file("extdata/MYC.txt",
                                      package = "pathfindR"))
MYC_targets <- read.delim(MYC_targets, skip = 2, header = FALSE, stringsAsFactors = FALSE)
MYC_targets <- MYC_targets$V1

## Prep for use
custom_genes <- list(CREB_targets, MYC_targets)
names(custom_genes) <- c("TF1", "TF2")

custom_pathways <- c("CREB target genes", "MYC target genes")
names(custom_pathways) <- c("TF1", "TF2")

We next prepare an example input, because of the way we choose genes we expect significant enrichment for MYC targets (50 MYC target genes + 5 CREB target genes). Because this is only an example, we set the logFC values to 1.5 and the adj.P.Vals to 0.001:

set.seed(123)
selected_genes <- sample(MYC_targets, 50)
selected_genes <- c(selected_genes,
                    sample(CREB_targets, 5))
input <- data.frame(Gene.symbol = selected_genes,
                    logFC = 1.5,
                    adj.P.Val = 0.001)

Finally, we run the wrapper script run_pathfindR using the custom genes as the gene sets, setting the iteration number to 1 (as this is for only example purposes):

custom_result <- run_pathfindR(input, 
                               gene_sets = "Custom", 
                               custom_genes = custom_genes, 
                               custom_pathways = custom_pathways, 
                               iterations = 1)

The most significantly enriched gene set is, unsurprisingly, "MYC target genes". Interestingly, through interaction information, pathfindR was also able to identify significant enrichment for "CREB target genes" as well.

   ID           Pathway Fold_Enrichment occurrence   lowest_p  highest_p
1 TF2  MYC target genes          14.550          1 4.0284e-09 4.0284e-09
2 TF1 CREB target genes          35.353          1 3.7408e-03 3.7408e-03
                                                                                                                                                                                                                                                                                                                                            Up_regulated
1 AGMAT, ANGPT2, ANKRD13B, BATF3, C19orf26, C2CD2L, COMMD8, CTSF, CUL5, DOPEY1, ELAVL3, ENO3, FCHSD2, GATA5, HMGN2, HMHA1, HOXA1, IGF2R, IL1RAPL1, IPO4, KAT5, MAFF, MCM8, MEPCE, MPV17, NCOA6, NIT1, OSR1, PDP2, PICALM, PPP1R3B, PPP1R3C, PRMT1, PUS1, RNF43, SEC11C, SLC12A5, SMC3, TCF4, TEF, TFB2M, THUMPD2, TIAL1, UBR4, UBXN1, UTY, YPEL5, ZNF771
2                                                                                                                                                                                                                                                                                                                         AREG, CCBL1, MAFF, OSR1, SPAG9
  Down_regulated
1               
2

Closing remarks

I hope I was successful in providing an overview of pathfindR. I try to update this tutorial as certain changes and additions to the package are made. To keep this tutorial as brief as possible, I had to omit certain frequently asked issues (such as pathfindR analysis using non-human data). For such questions and any issues, please visit our wiki or submit an issue to our issues page on GitHub.

ADD COMMENTlink modified 27 days ago • written 8 months ago by egeulgen670

Question, some error code like: "java development kit is needed" pumped up when I was running the run_pathfindR. Wondering how can I fix this? I already installed java on my Mac before!

ADD REPLYlink written 8 months ago by shenwei13760

Hi, can you paste the error here? Do you have JDK installed? Actually JRE should have been sufficient but please give it a try if not installed.

ADD REPLYlink written 8 months ago by ozanytu0

Hello there! I cannot go past this passage: pws_table <- RA_clustered[RA_clustered$Status == "Representative", ] RA_exp_mat <- data.frame(RA_input$logFC, row.names=RA_input$Gene.symbol) score_matrix <- calculate_pw_scores(RA_clustered, RA_exp_mat)

I set up my RA_exp_mat as in the example, but I think that my expression (which is in log2foldchange) is not cut for the job.. Isn't it? My score_Matrix is full of NAs

ADD REPLYlink written 4 months ago by lucatucciarone50

Hey,

The expression matrix is the matrix you used to perform the differential expression test. Something like:

        GSM389703 GSM389704 GSM389705
FAM110A    9.8591    10.529   11.3828
RNASE2    10.3410    10.400    9.1157
S100A8    14.9554    15.012   12.9907

For the example data this can be loaded by:

data(RA_exp_mat, package = "pathfindR")

You provided the LFC values only therefore the scores could not be calculated.

ADD REPLYlink written 4 months ago by egeulgen670

I'm sorry but the "data(RA_exp_mat, package = "pathfindR")" is an example data, isn't it?

I'm guessing I should use my own expression data, isn't it? Does pathfinderR creates this expression data? Shall I calculate it myself? Could you happen to tell me how?

ADD REPLYlink written 4 months ago by lucatucciarone50

Yes you should use your own expression data. This is the data you used to perform differential expression analysis and obtain the logFC and p values. pathfindR does not (cannot) create the expression data as it only uses the logFC values provided for enrichment. The expression matrix (input of limma, DESeq2 or EdgeR etc.) should be provided if you want to calculate pathway scores.

ADD REPLYlink written 4 months ago by egeulgen670

aaaah! you are talking about the HtseqCounts! Got it, thank you ☺️

ADD REPLYlink written 4 months ago by lucatucciarone50

For some datasets it doesn't find any pathway. I think this is the related error 🤔 WARNING: Unexpected number format in experiment file line 1, discarded

P.S. pathFindeR does work in my hands so I guess it is just not able to find any related pathway to this dataset.

ADD REPLYlink written 4 months ago by lucatucciarone50

In some cases (especially if the number of input genes is low), pathfindR may not find any active subnetworks and therefore no enriched pathways. The message stating "Could not find any interactions for XX (XX%) genes in the PIN." is more relevant for genes not mapping onto the PIN.

ADD REPLYlink written 4 months ago by egeulgen670

Your post is great and I need this information and tutorial. I appreciate.

ADD REPLYlink written 3 months ago by majuang6680

At first place, thank you so much for this tutorial. I find it extremely useful. Would the analysis work with p-values instead of adjusted p-values?

ADD REPLYlink written 3 months ago by Francisco Muñoz 0
1

thank you. glad you like the tutorial. Performing the analysis with p-values would be wrong, I would suggest instead to change the p value threshold (i.e. to change the argument p_val_threshold) to a higher value (e.g. 0.1).

ADD REPLYlink written 3 months ago by egeulgen670
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1947 users visited in the last hour