R code for getting TFs from ChEA2024 by gene list
1
0
Entering edit mode
16 days ago
mastoreh • 0

i have a gene list with like 500 genes that i want to get each related TFs for each one but i have a problem, in ChEA2024 site, i can upload just one gene in there and if i enter my genes one by one, it has many time i need R code that it use ChEA2024 site that give my gene list and return it with all TFs for each gene also it show me adjust p value of them that i can filter them with it

TF ChEA2024 Rprograming Genelist • 479 views
ADD COMMENT
0
Entering edit mode
1 day ago

Hi there,

You're spot on—ChEA3 (the tool behind the "ChEA2024" interface at https://maayanlab.cloud/ChEA2024/) is fantastic for transcription factor (TF) enrichment analysis, but the web UI is geared toward batch-submitting lists of genes to find enriched TFs across the set. For your use case—getting the full list of candidate TFs per individual gene (i.e., reverse lookup: which TFs likely regulate each of your 500 genes), along with p-values for filtering—you're right that you'd otherwise have to submit one gene at a time manually. That would take forever!

The good news is ChEA3 exposes a public REST API that we can hit programmatically from R. This lets us loop over your gene list, submit each gene as a singleton "query set," fetch the results (which include raw p-values from Fisher's Exact Test across multiple libraries like ChIP-seq, co-expression, etc.), and then compute adjusted p-values (FDR via Benjamini-Hochberg) per gene for easy filtering. The API returns results from ~8 libraries + 2 integrated summaries (MeanRank and TopRank), which aggregate ranks across libraries for a consensus view.

I'll provide a complete, self-contained R script below. It:

  • Reads your gene list from a text file (one HGNC symbol per line).
  • Submits each gene via POST to the API.
  • Parses the JSON response into data frames.
  • Extracts results from the MeanRank integration (a robust combined ranking; you can swap to others if preferred).
  • Computes adjusted p-values per gene.
  • Compiles everything into a single output data frame (genes × TFs, with columns for TF, raw p-value, adj p-value, rank, etc.).
  • Saves to CSV for inspection/filtering (e.g., keep adj p < 0.05).
  • Includes progress tracking and a 1-sec delay between queries to be kind to their server (500 queries should take ~10-15 mins total).

Assumptions/Notes:

  • Genes are human HGNC symbols (case-insensitive; API handles mouse too via orthologs).
  • For a single-gene query, p-values reflect how strongly each TF "targets" that gene based on pre-built libraries (background ~20k genes). Not all TFs will hit p < 1, but you'll get the top candidates.
  • API returns top ~100 TFs per library by default—plenty for most genes.
  • No auth needed; it's open.
  • If a gene fails (e.g., invalid symbol), it's skipped with a warning.
  • Test it first with a tiny gene list (e.g., c("TP53", "BRCA1")) to verify.
  • Libraries needed: httr, jsonlite, dplyr (for tidying). Install if missing: BiocManager::install(c("httr", "jsonlite", "dplyr")).

Here's the script—copy-paste into R/RStudio and run:

# Clear workspace and load libraries
rm(list = ls())
library(httr)
library(jsonlite)
library(dplyr)

# Step 1: Read your gene list from a file (one symbol per line, e.g., "genes.txt")
# Or hardcode: genes <- c("TP53", "BRCA1", "MYC")  # for testing
genes <- readLines("path/to/your/genes.txt")  # replace with your file path
genes <- unique(genes[genes != ""])  # clean: unique non-empty

# Step 2: Function to query API for a single gene and return tidy DF
query_chea <- function(gene) {
  url <- "https://maayanlab.cloud/chea3/api/enrich/"
  payload <- list(query_name = gene, gene_set = gene)

  response <- POST(url = url, 
                   body = payload, 
                   encode = "json",
                   config = timeout(30))  # 30s timeout per query

  if (status_code(response) != 200) {
    warning(paste("API error for", gene, "- skipping"))
    return(NULL)
  }

  json_text <- content(response, "text")
  results <- fromJSON(json_text)

  # Extract MeanRank integration (change to "TopRank" or a specific library like "ENCODE_ChIP-seq" if preferred)
  # Results is a list of DFs, one per library/integration
  meanrank_idx <- which(sapply(results, function(x) "MeanRank" %in% names(x) || grepl("MeanRank", names(x))))
  if (length(meanrank_idx) == 0) {
    warning(paste("No MeanRank for", gene, "- trying first result"))
    meanrank_df <- results[[1]]
  } else {
    meanrank_df <- results[[meanrank_idx[1]]]  # take first match
  }

  # Assume DF columns: Term (TF), PValue, Combined Score (rank-ish), Overlapping Genes, etc.
  # Adapt column names if needed after testing (print(colnames(meanrank_df)) )
  tf_df <- as.data.frame(meanrank_df) %>%
    select(Term, PValue, `Combined Score`, `Overlap %`, `Overlap`) %>%  # adjust cols as per actual output
    mutate(Gene = gene,
           adjPValue = p.adjust(PValue, method = "BH")) %>%  # FDR per gene
    filter(PValue < 1) %>%  # drop non-significant (p=1 means no overlap)
    arrange(PValue)

  return(tf_df)
}

# Step 3: Loop over genes, collect results
cat("Querying", length(genes), "genes...\n")
all_results <- lapply(genes, query_chea)
names(all_results) <- genes

# Remove NULLs (failed queries)
all_results <- all_results[!sapply(all_results, is.null)]

# Step 4: Bind into one big DF
final_df <- bind_rows(all_results)
cat("Collected", nrow(final_df), "TF-gene pairs\n")

# Step 5: Save and preview
write.csv(final_df, "chea_tfs_per_gene.csv", row.names = FALSE)
head(final_df)

# Optional: Filter example (adj p < 0.05)
filtered <- final_df %>% filter(adjPValue < 0.05)
write.csv(filtered, "chea_tfs_filtered.csv", row.names = FALSE)
cat("Filtered to", nrow(filtered), "pairs (adj p < 0.05)\n")

Quick tweaks if needed:

  • Column names: Run a test query first and print(colnames(results[[1]])) to confirm (API might vary slightly; common ones are Term for TF, PValue, Adjusted PValue—hey, if it already has adj p, skip the p.adjust step!).
  • Which library? Swap "MeanRank" to "ENCODE_ChIP-seq" for ChIP-focused results, etc. Print names(results) in the function to see options.
  • All libraries? If you want to merge across all (with duplicates de-duped by TF), change the extraction to lapply(results, as.data.frame) then bind_rows and distinct(Gene, Term, .keep_all=TRUE).
  • Speed: If 500 is too slow, parallelize with parallel::mclapply (but watch for rate limits—API is generous but don't hammer it).
  • Errors: If genes don't map (e.g., non-HGNC), API discards silently—cross-check your input.

This should give you a clean table like:

Gene Term PValue adjPValue Combined Score Overlap %
TP53 MDM2 1.2e-05 0.003 15.3 85
TP53 p53 3.4e-04 0.012 22.1 92
... ... ... ... ... ...

Filter on adjPValue as you like, then maybe feed into network viz (e.g., igraph). If you hit snags or need tweaks (e.g., mouse genes, full libraries), drop a comment.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 5219 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6