Hi,
I see you're trying to batch query for related lncRNAs from a gene list using the lncHUB2 site, but it only supports single queries via the web interface, and there's no public API for batch processing. Scraping the site for 500 queries isn't ideal (it could be slow and might violate terms of use), so a better approach is to use the underlying data source, ARCHS4, which lncHUB2 is based on for co-expression correlations.
ARCHS4 provides an API for gene correlations, and we can filter the results to lncRNAs. The API returns the top 100 correlated genes with Pearson correlation coefficients (PCC), but not p-values. We can calculate p-values and adjusted p-values in R using an approximate number of samples (ARCHS4 uses ~300,000 human RNA-seq samples for humans; I'll use n = 200000 as a conservative estimate for effective samples where both genes are expressed). With such large n, even moderate correlations will have very small p-values, so filtering by adj p < 0.05 will likely keep most top correlates.
Here's an R code to do this. It assumes your gene list is protein-coding genes, and you're looking for co-expressed lncRNAs. If your "genes" are actually lncRNAs, you can adjust accordingly. The code:
Uses biomaRt to get a list of human lncRNA symbols.
Loops over your gene list, queries the ARCHS4 API.
Filters to lncRNAs.
Calculates p-value and adj p-value for the correlations.
Outputs a data frame for each gene with the results, and you can filter by adj p.
library(httr)
library(jsonlite)
library(dplyr)
library(biomaRt)
# Your gene list (replace with your 500 genes)
gene_list <- c("TP53", "MYC", "EGFR") # example
# Get list of human lncRNA symbols using biomaRt
mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
lncRNAs <- getBM(attributes = c("hgnc_symbol"),
filters = "biotype",
values = "lncRNA",
mart = mart) %>%
pull(hgnc_symbol) %>%
unique() %>%
na.omit()
# Function to calculate p-value from PCC and n
calc_pvalue <- function(r, n) {
t <- abs(r) * sqrt(n - 2) / sqrt(1 - r^2)
p <- 2 * pt(-t, df = n - 2)
return(p)
}
# Approximate n from ARCHS4 (conservative estimate for effective samples)
approx_n <- 200000
# List to store results
results <- list()
for (gene in gene_list) {
# Query ARCHS4 API for top 100 correlated genes
url <- paste0("https://amp.pharm.mssm.edu/archs4/gene/", gene, "?species=human")
response <- GET(url)
if (status_code(response) == 200) {
data <- fromJSON(content(response, "text"))$results
# Filter to lncRNAs, add p and adj p
lnc_data <- data %>%
filter(gene %in% lncRNAs) %>%
mutate(pvalue = calc_pvalue(correlation, approx_n),
adj_pvalue = p.adjust(pvalue, method = "fdr")) %>%
select(gene, correlation, pvalue, adj_pvalue) %>%
arrange(adj_pvalue)
results[[gene]] <- lnc_data
} else {
warning(paste("Query failed for", gene))
}
}
# Example: view results for first gene
results[[1]]
# To save all to CSV
all_results <- bind_rows(results, .id = "input_gene")
write.csv(all_results, "gene_lncRNA_correlations.csv", row.names = FALSE)
This will give you a CSV with columns: input_gene, gene (lncRNA), correlation, pvalue, adj_pvalue.
You can filter in the code or after, e.g., filter(adj_pvalue < 0.05).
Note: The API is for human by default; change to ?species=mouse if needed. If you need more than top 100, you'd need to contact the ARCHS4 team or download the full matrix (huge, ~100GB), but for now, this is efficient for batch.
Kevin