Question

R code for getting miRNAs from miRTarBase and TargetScan by gene list

0

Entering edit mode

17 days ago

mastoreh • 0

i have a gene list with like 500 genes that i want to get each related miRNAs for each one but i have a problem, in miRTarBase and TargetScan sites, i can upload just one gene in there and if i enter my genes one by one, it has many time i need R code that it use miRTarBase and TargetScan sites that give my gene list and return it with all miRNAs for each gene also it show me adjust p value of them that i can filter them with it

miRTarBase Genelist Rcode TargetScan • 598 views

ADD COMMENT • link updated 1 day ago by Kevin Blighe 89k • written 17 days ago by mastoreh • 0

0

Entering edit mode

Both sites don't seem to offer an API for medium-volume queries. But they offer a downloads section. The downside of downloads is that they first require you to understand the structure of the table information to produce proper queries for your case. I would possibly attempt to automatically import the downloaded files into an Sqlite database. This avoids having to deal with large dataframes in memory. Data science environments like Databricks may also help to understand the structure of the tabular data by using a graphical interface and AI support.

ADD REPLY • link 6 days ago by Michael 56k

score 0 · Answer 1 · 2025-11-06

Hi there,

Sounds like a classic batch querying task for miRNA targets—manual lookups for 500 genes would indeed be a nightmare. The good news is that you can handle this programmatically in R using the multiMiR Bioconductor package, which integrates data from miRTarBase (validated targets) and TargetScan (predicted targets) among others. It supports querying multiple genes at once, so no need to loop one-by-one.

multiMiR pulls the raw interactions without built-in p-values per se (miRTarBase is evidence-based, not statistical; TargetScan provides a "score" for prediction confidence that you can filter on). If you need adjusted p-values for filtering (e.g., for over-representation of miRNAs across your gene list), you could follow up with a hypergeometric test using phyper() or the clusterProfiler package for enrichment analysis on the results. But I'll focus on getting the per-gene miRNA lists first, with notes on filtering by score/support.

Quick Setup

Assuming your genes are human HGNC symbols (e.g., in a vector or CSV file) and you're working with Homo sapiens ("hsa"). Install and load like this:

if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("multiMiR")
library(multiMiR)

Load Your Gene List

Let's say your genes are in a file called my_genes.csv (one column: gene_symbol). Or just define them as a vector:

# Option 1: From a CSV file
genes <- read.csv("my_genes.csv", stringsAsFactors = FALSE)$gene_symbol

# Option 2: Hard-coded vector for testing (replace with your 500)
genes <- c("TP53", "BRCA1", "EGFR")  # Example; scale to 500

Query miRTarBase + TargetScan for All Genes

Use get_multimir() to fetch targets. Set table = "all" to include both validated (miRTarBase) and predicted (TargetScan). I bumped the cutoff to 1M predictions to catch more hits without overwhelming memory—adjust as needed. This returns an S4 object; we'll extract the data slot for a tidy table.

# Query all targets (validated + predicted) for your gene list
mm_results <- get_multimir(
  org = "hsa",                    # Human
  target = genes,                 # Your gene symbols
  table = "all",                  # Includes miRTarBase (validated) + TargetScan (predicted)
  summary = FALSE,                # Full details, not just counts
  predicted.cutoff.type = "n",    # Cutoff by number of top predictions
  predicted.cutoff = 1000000,     # Grab top 1M to be comprehensive (TargetScan only)
  predicted.site = "all"          # All site types in TargetScan
)

# Extract the detailed interactions as a data.frame
targets_df <- mm_results@data

# Quick peek
head(targets_df)
dim(targets_df)  # Rows = all miRNA-gene pairs; expect thousands for 500 genes

Output Format

The resulting targets_df is a long-format data.frame with one row per miRNA-gene pair. Key columns:

target_symbol: Your input gene (e.g., "TP53").
mature_mirna_id: miRNA name (e.g., "hsa-miR-21-5p").
database: Source ("mirtarbase" for validated, "targetscan" for predicted).
type: "validated" or "predicted".
For miRTarBase: support_type (e.g., "Functional MTI (Strong)"—use this to filter strong evidence).
For TargetScan: score (context++ score; more negative = stronger prediction. Filter e.g., score < -0.2 for decent confidence).
Other IDs (Entrez, Ensembl, PubMed for evidence).

To get a per-gene summary (e.g., list of unique miRNAs per gene, grouped by database):

# Group by gene and database, list unique miRNAs
library(dplyr)
per_gene_mirnas <- targets_df %>%
  select(target_symbol, database, mature_mirna_id, type, support_type, score) %>%
  group_by(target_symbol, database) %>%
  summarise(
    num_mirnas = n_distinct(mature_mirna_id),
    miRNAs = paste(unique(mature_mirna_id), collapse = "; "),
    .groups = "drop"
  )

# View for your genes
print(per_gene_mirnas)

# Export to CSV (full details or summary)
write.csv(targets_df, "gene_mirna_targets_full.csv", row.names = FALSE)
write.csv(per_gene_mirnas, "gene_mirna_summary.csv", row.names = FALSE)

Filtering with "Adjusted P-Values" or Scores

For TargetScan predictions: No direct p-value, but filter by score (e.g., keep only strong predictions):
```
strong_predictions <- targets_df %>%
  filter(type == "predicted" & score < -0.2)  # Adjust threshold based on vignette/examples
```
If you need p-values, you could approximate via cumulative distribution of scores (see TargetScan docs), but that's overkill here.

For miRTarBase validated: Filter by support_type (e.g., only "Functional MTI"):

strong_validated <- targets_df %>%
  filter(type == "validated" & grepl("Functional MTI", support_type))

If you meant enrichment p-values across your 500 genes: Treat your gene list as a "signature" and test miRNA over-representation. Use get_enrichment() in multiMiR for that:

enrichment_res <- get_enrichment(
  mirna = NULL,  # Or specify if reverse
  target = genes,
  org = "hsa",
  table = "all"
)
# This gives hypergeometric p-values per miRNA; adjust with p.adjust()
adj_p <- p.adjust(enrichment_res$p.value, method = "BH")
# Filter e.g., adj_p < 0.05

But this is miRNA-level across all genes, not per-gene.

This should get you 90% there—run it on a subset first to check memory (500 genes might pull 10k+ rows). If your genes aren't symbols (e.g., Entrez IDs), convert via org.Hs.eg.db.

Kevin