Question

biomaRt: Timeout on getBM().

0

Entering edit mode

22 months ago

jon.klonowski ▴ 150

My Biomart getBM() command is timing out, and I do not know why.

Failed <- getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id", "transcript_tsl"), mart = ensembl)
Error in curl::curl_fetch_memory(url, handle = handle) : Timeout was reached: [dec2021.archive.ensembl.org:443] Operation timed out after 300000 milliseconds with 9960752 bytes received

to compare,

rawr <- getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id", "transcript_tsl"), mart = ensembl)

works fine

How i get my mart:

library(biomaRt)
mart=useMart("ensembl", host = "https://dec2021.archive.ensembl.org")
ensembl = useDataset("hsapiens_gene_ensembl", mart = mart)

ensembl biomart • 1.9k views

ADD COMMENT • link 21 months ago by jon.klonowski ▴ 150

0

Entering edit mode

Any reason in particular you are using https://dec2021.archive.ensembl.org?

ADD REPLY • link 22 months ago by rpolicastro 13k

0

Entering edit mode

My genomic variant annotation was done with ensembl v 105 so I am keeping all my versions consistent

ADD REPLY • link 22 months ago by jon.klonowski ▴ 150

1

Entering edit mode

You can specify the ensembl version directly with the version argument.

ensembl <- useEnsembl("genes", "hsapiens_gene_ensembl", version=105)

ADD REPLY • link 22 months ago by rpolicastro 13k

score 3 · Accepted Answer · 2022-07-02

The issue here is that you're essentially doing a bulk data download of annotation for the entire genome. The Ensembl BioMart service isn't really designed for that, it's more aimed at asking for additional data points on a "small" set of genes or transcripts. Hence you hit a timeout limit when asking for too much information. I can't see a difference between the query that works and that which fails, but I guess the working implementation was querying the current version of Ensembl rather than an archive. I suspect it works because you get slightly better performance out of the main site and it manages to return you a result before the 5 minute limit is reached.

If you really want whole genome data you're probably better off trying to download the annotation from the Ensembl FTP (http://www.ensembl.org/info/data/ftp/index.html/) and working with those files locally or using a genome annotation package for example ensembldb.

That said, you can "trick" biomaRt into helping with this by first asking for all possible gene ids. Then provide these as a filter and biomaRt will break your query down into several smaller parts, each of which works within the timelimit, and then stitches the results back into a single table for you e.g.

library(biomaRt)
ensembl <- useEnsembl("genes", "hsapiens_gene_ensembl", version=105)

gene_ids <- getBM(attributes = c("ensembl_gene_id"), mart = ensembl)
all_data <- getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id", "transcript_tsl"), 
      filters = "ensembl_gene_id", 
      values = gene_ids, 
      mart = ensembl)

head(all_data)
#>   ensembl_transcript_id ensembl_gene_id                        transcript_tsl
#> 1       ENST00000469599 ENSG00000012817                                  tsl2
#> 2       ENST00000317961 ENSG00000012817 tsl1 (assigned to previous version 8)
#> 3       ENST00000541639 ENSG00000012817                                  tsl1
#> 4       ENST00000382806 ENSG00000012817                                  tsl1
#> 5       ENST00000492117 ENSG00000012817                                  tsl2
#> 6       ENST00000440077 ENSG00000012817                                  tsl5
dim(all_data)
#> [1] 266615      3