Question: Biomart Very Slow
5
gravatar for Chris Miller
9.1 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

I'm using the biomaRt package in R to pull down ensembl annotations for a set of Affy probeIDs. The features seem nice, but the service is very slow, taking 10 minutes or so for a single query (of no more than a dozen probes), and frequently times out completely:

library('biomaRt')
ensembl = useMart("ensembl")
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
probePos = getBM(attributes=c("affy_huex_1_0_st_v2",
                              "hgnc_symbol",
                              "chromosome_name",
                              "start_position",
                              "end_position",
                              "strand"),   
                 filters="affy_huex_1_0_st_v2",
                 values=probeList,
                 mart=ensembl)

The slow returns have persisted over a week or two, and don't seem dependent on time of day or anything like that. If I add a few more attributes, I get timeouts:

Request to BioMart web service failed. Verify if you are still connected to the
internet.  Alternatively the BioMart web service is temporarily down.

So, my questions are:

  • Is this a common problem for others?
  • Is likely that the server is overloaded, or are my queries just too big? If the former, are there any mirrors available?
  • If the latter, the next step is setting up a local server, I guess. Anyone have experience (good or bad) with that? Is it worth the hassle?
biomart R • 5.3k views
ADD COMMENTlink modified 7 months ago by dvitsios20 • written 9.1 years ago by Chris Miller21k

Argh just in case anyone is reading today, I'm getting serious timeouts as well! Yesterday it seemed really flaky, too. It's a shame, because it is such a good resource!

ADD REPLYlink written 9.0 years ago by Mike Dewar1.5k

As I posted below, I sort of solved the problem by breaking my queries up into very small chunks, especially on the fields that return a lot of data. It's a pain to set up the first time, but once you automate it, doing it piecemeal isn't so bad.

ADD REPLYlink written 9.0 years ago by Chris Miller21k
7
gravatar for Neilfws
9.1 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:
  1. I'd say it is an occasional, not a common problem. I've experienced it a couple of times in the several months that I've been using biomaRt regularly.
  2. Queries of a dozen probes or so are unlikely to be too big. I recently retrieved Affymetrix exon probesets for every HGNC symbol (~ 30 000 queries) in well under 10 minutes.
  3. I'm unsure whether there are mirrors for the latest release. I know that it is possible to specify different servers as arguments to useMart(). For example, to use the NCBI36 build, according to this mailing list thread:

    mart <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl",
        host = "may2009.archive.ensembl.org", path = "/biomart/martservice",
        archive = FALSE)
    
  4. I have no experience with setting up a local server - and I don't know anyone else who does. It involves many, many Perl modules and I suspect, is rather difficult.

ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.1 years ago by Neilfws48k

Hrmmm - thanks for the info. After some more mucking around this afternoon, I'm finding that many short queries seem to work better than single large queries. This is completely contrary to what I'd expect, but I can make it work for now.

ADD REPLYlink written 9.1 years ago by Chris Miller21k
4
gravatar for Chris Fields
9.1 years ago by
Chris Fields2.1k
University of Illinois Urbana-Champaign
Chris Fields2.1k wrote:

I remember something about very large queries taking a long time, thus you may run into timeouts on the server. The workaround, which seems reasonable, is to simply [?]install the mart locally[?] if possible (faster b/c it is local, no timeout issues, etc). Ensembl, UniProt, and others make their mart data freely available, see the link above.

This is of course assuming biomaRt works with a local database, but I would be very surprised if it doesn't.

ADD COMMENTlink written 9.1 years ago by Chris Fields2.1k
1
gravatar for dvitsios
7 months ago by
dvitsios20
AstraZeneca, Cambridge, UK
dvitsios20 wrote:

The biomaRt R package is really slow indeed, especially for large queries..

Try using instead Biomart's REST API which is much faster (relative to biomaRt always) and more robust:

https://rest.ensembl.org/

You can wrap your requests over chunks of data or individual IDs each time.

For example, in order to retrieve the full annotation for a list of GO IDs using Python3, you can iterate through each GO ID:

import requests, sys

server = "https://rest.ensembl.org" 
ext_prefix = "/ontology/id/"

def get_go_term_by_id(id):

    ext = ext_prefix + id + "?content-type=application/json"        
    r = requests.get(server+ext, headers={ "Content-Type" : "application/json"})

    if not r.ok:
        r.raise_for_status()
        sys.exit()

    decoded = r.json()
    print(repr(decoded))

go_ids = ['GO:0006958', 'GO:0031902', 'GO:0050776'] 
for id in go_ids:
    get_go_term_by_id(id)

Each individual call gets served in less than a second or max. a couple of seconds so you won't have any timeout issues.

Besides, you can wrap your code in try/except blocks so that in case an individual call crashes it can continue processing the rest as normal.

Simple multi-threading with map and pool in Python3

For real speed, you can submit multiple calls to Biomart using multiple threads in Python (e.g. with 50 threads):

from multiprocessing.dummy import Pool as ThreadPool

num_threads = 50
pool = ThreadPool(num_threads) 

server = "https://rest.ensembl.org"
ext_prefix = "/ontology/id/"

go_id_terms_dict = {}

def get_go_term_by_id(id):

    try:
        ext = ext_prefix + id + "?content-type=application/json"
        r = requests.get(server+ext, headers={ "Content-Type" : "application/json"})

        if not r.ok:
            r.raise_for_status()
            return ''

        decoded = r.json()
        go_term = repr(decoded['name'])
        go_term = go_term.replace("'", "")
        go_term = go_term.replace("\"", "")

    except:
        go_term = ''
        print('[Warning] Could not fetch GO term for ID:', id)

    go_id_terms_dict[id] = go_term
    result = id + '||' + go_term
    print(result)

# all_human_go_ids: your list with GO IDs
pool.map(get_go_term_by_id, all_human_go_ids)
ADD COMMENTlink modified 7 months ago • written 7 months ago by dvitsios20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1850 users visited in the last hour