Hi there,
You're spot on—ChEA3 (the tool behind the "ChEA2024" interface at https://maayanlab.cloud/ChEA2024/) is fantastic for transcription factor (TF) enrichment analysis, but the web UI is geared toward batch-submitting lists of genes to find enriched TFs across the set. For your use case—getting the full list of candidate TFs per individual gene (i.e., reverse lookup: which TFs likely regulate each of your 500 genes), along with p-values for filtering—you're right that you'd otherwise have to submit one gene at a time manually. That would take forever!
The good news is ChEA3 exposes a public REST API that we can hit programmatically from R. This lets us loop over your gene list, submit each gene as a singleton "query set," fetch the results (which include raw p-values from Fisher's Exact Test across multiple libraries like ChIP-seq, co-expression, etc.), and then compute adjusted p-values (FDR via Benjamini-Hochberg) per gene for easy filtering. The API returns results from ~8 libraries + 2 integrated summaries (MeanRank and TopRank), which aggregate ranks across libraries for a consensus view.
I'll provide a complete, self-contained R script below. It:
- Reads your gene list from a text file (one HGNC symbol per line).
- Submits each gene via POST to the API.
- Parses the JSON response into data frames.
- Extracts results from the MeanRank integration (a robust combined ranking; you can swap to others if preferred).
- Computes adjusted p-values per gene.
- Compiles everything into a single output data frame (genes × TFs, with columns for TF, raw p-value, adj p-value, rank, etc.).
- Saves to CSV for inspection/filtering (e.g., keep adj p < 0.05).
- Includes progress tracking and a 1-sec delay between queries to be kind to their server (500 queries should take ~10-15 mins total).
Assumptions/Notes:
- Genes are human HGNC symbols (case-insensitive; API handles mouse too via orthologs).
- For a single-gene query, p-values reflect how strongly each TF "targets" that gene based on pre-built libraries (background ~20k genes). Not all TFs will hit p < 1, but you'll get the top candidates.
- API returns top ~100 TFs per library by default—plenty for most genes.
- No auth needed; it's open.
- If a gene fails (e.g., invalid symbol), it's skipped with a warning.
- Test it first with a tiny gene list (e.g., c("TP53", "BRCA1")) to verify.
- Libraries needed:
httr, jsonlite, dplyr (for tidying). Install if missing: BiocManager::install(c("httr", "jsonlite", "dplyr")).
Here's the script—copy-paste into R/RStudio and run:
# Clear workspace and load libraries
rm(list = ls())
library(httr)
library(jsonlite)
library(dplyr)
# Step 1: Read your gene list from a file (one symbol per line, e.g., "genes.txt")
# Or hardcode: genes <- c("TP53", "BRCA1", "MYC") # for testing
genes <- readLines("path/to/your/genes.txt") # replace with your file path
genes <- unique(genes[genes != ""]) # clean: unique non-empty
# Step 2: Function to query API for a single gene and return tidy DF
query_chea <- function(gene) {
url <- "https://maayanlab.cloud/chea3/api/enrich/"
payload <- list(query_name = gene, gene_set = gene)
response <- POST(url = url,
body = payload,
encode = "json",
config = timeout(30)) # 30s timeout per query
if (status_code(response) != 200) {
warning(paste("API error for", gene, "- skipping"))
return(NULL)
}
json_text <- content(response, "text")
results <- fromJSON(json_text)
# Extract MeanRank integration (change to "TopRank" or a specific library like "ENCODE_ChIP-seq" if preferred)
# Results is a list of DFs, one per library/integration
meanrank_idx <- which(sapply(results, function(x) "MeanRank" %in% names(x) || grepl("MeanRank", names(x))))
if (length(meanrank_idx) == 0) {
warning(paste("No MeanRank for", gene, "- trying first result"))
meanrank_df <- results[[1]]
} else {
meanrank_df <- results[[meanrank_idx[1]]] # take first match
}
# Assume DF columns: Term (TF), PValue, Combined Score (rank-ish), Overlapping Genes, etc.
# Adapt column names if needed after testing (print(colnames(meanrank_df)) )
tf_df <- as.data.frame(meanrank_df) %>%
select(Term, PValue, `Combined Score`, `Overlap %`, `Overlap`) %>% # adjust cols as per actual output
mutate(Gene = gene,
adjPValue = p.adjust(PValue, method = "BH")) %>% # FDR per gene
filter(PValue < 1) %>% # drop non-significant (p=1 means no overlap)
arrange(PValue)
return(tf_df)
}
# Step 3: Loop over genes, collect results
cat("Querying", length(genes), "genes...\n")
all_results <- lapply(genes, query_chea)
names(all_results) <- genes
# Remove NULLs (failed queries)
all_results <- all_results[!sapply(all_results, is.null)]
# Step 4: Bind into one big DF
final_df <- bind_rows(all_results)
cat("Collected", nrow(final_df), "TF-gene pairs\n")
# Step 5: Save and preview
write.csv(final_df, "chea_tfs_per_gene.csv", row.names = FALSE)
head(final_df)
# Optional: Filter example (adj p < 0.05)
filtered <- final_df %>% filter(adjPValue < 0.05)
write.csv(filtered, "chea_tfs_filtered.csv", row.names = FALSE)
cat("Filtered to", nrow(filtered), "pairs (adj p < 0.05)\n")
Quick tweaks if needed:
- Column names: Run a test query first and
print(colnames(results[[1]])) to confirm (API might vary slightly; common ones are Term for TF, PValue, Adjusted PValue—hey, if it already has adj p, skip the p.adjust step!).
- Which library? Swap
"MeanRank" to "ENCODE_ChIP-seq" for ChIP-focused results, etc. Print names(results) in the function to see options.
- All libraries? If you want to merge across all (with duplicates de-duped by TF), change the extraction to
lapply(results, as.data.frame) then bind_rows and distinct(Gene, Term, .keep_all=TRUE).
- Speed: If 500 is too slow, parallelize with
parallel::mclapply (but watch for rate limits—API is generous but don't hammer it).
- Errors: If genes don't map (e.g., non-HGNC), API discards silently—cross-check your input.
This should give you a clean table like:
| Gene |
Term |
PValue |
adjPValue |
Combined Score |
Overlap % |
| TP53 |
MDM2 |
1.2e-05 |
0.003 |
15.3 |
85 |
| TP53 |
p53 |
3.4e-04 |
0.012 |
22.1 |
92 |
| ... |
... |
... |
... |
... |
... |
Filter on adjPValue as you like, then maybe feed into network viz (e.g., igraph). If you hit snags or need tweaks (e.g., mouse genes, full libraries), drop a comment.
Kevin