Question

Tool:genomepy: download genomes the easy way

2

Entering edit mode

6.7 years ago

simon.vanheeringen ▴ 270

I would like to share a little utility that I wrote to make downloading genome sequences less of a hassle.

Genomepy is a simple software package to download genome sequences that contains both command-line tools as well as a Python application programming interface (API). It supports several providers for genomes, which currently include UCSC, NCBI and Ensembl. Downloaded genome sequences can be soft- or hard-masked and specific chromosomes or scaffolds can be either included or excluded based on regular expressions. Genomepy is a free and open source software and can be installed through standard package managers (bioconda, pip).

The github repository, including documentation, is here: https://github.com/simonvh/genomepy

JOSS publication here: http://dx.doi.org/10.21105/joss.00320

Hope you find it useful.

next-gen-sequencing genome alignment ChIP-Seq • 1.8k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 6.7 years ago by simon.vanheeringen ▴ 270

0

Entering edit mode

I just happened to install it. This is taking too long, is this expected?

 genomepy search rattus

Also, is this search case sensitive? Last question, do I need to provide complete name i.e. Rattus norvegicus instead of just rattus

EDIT 1: And finally this error, any clues?

Traceback (most recent call last):
  File "/usr/local/bin/genomepy", line 56, in <module>
    cli()
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/bin/genomepy", line 19, in search
    for row in genomepy.search(term, provider):
  File "/usr/local/lib/python2.7/site-packages/genomepy/functions.py", line 120, in search
    for row in p.search(term):
  File "/usr/local/lib/python2.7/site-packages/genomepy/provider.py", line 415, in search
    for name,description in self.list_available_genomes():
  File "/usr/local/lib/python2.7/site-packages/genomepy/provider.py", line 381, in list_available_genomes
    self.genomes = self._get_genomes()
  File "<decorator-gen-2>", line 2, in _get_genomes
  File "/usr/local/lib/python2.7/site-packages/bucketcache/utilities.py", line 177, in wrapper
    ret, called = load_or_call(f, key_hash, args, kwargs, varargs, callargs)
  File "/usr/local/lib/python2.7/site-packages/bucketcache/utilities.py", line 129, in load_or_call
    result = call_and_cache()
  File "/usr/local/lib/python2.7/site-packages/bucketcache/utilities.py", line 115, in call_and_cache
    res = f(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/genomepy/provider.py", line 387, in _get_genomes
    response = urlopen(self.das_url)
  File "/usr/local/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/local/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/local/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/local/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/local/lib/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/local/lib/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 110] Connection timed out>

ADD REPLY • link 6.7 years ago by lakhujanivijay 5.8k

0

Entering edit mode

The first time you run the search, it will take a long time as downloading the genome info from some providers will take time. This list will be cached locally, so subsequent queries should be faster. This cached list will expire after a week, so once in a while it will take longer again.

The search is not case-sensitive and you can be as (un)specific as you want. It will do a full-text search of all fields. For instance, for rattus I get the following:

NCBI    Rnor_6.0    Rattus norvegicus; Rat Genome Sequencing Consortium
NCBI    Rn_Celera   Rattus norvegicus; Celera Genomics
NCBI    ViralProj290310 Rattus norvegicus polyomavirus 1; NCBI RefSeq Genome Project
NCBI    ViralProj304315 Rattus norvegicus papillomavirus 3; NCBI RefSeq Genome Project
NCBI    ViralProj356437 Rattus norvegicus polyomavirus 2; NCBI RefSeq Genome Project
NCBI    RGSC_v3.4   Rattus norvegicus; 
NCBI    RGSC_v3.4   Rattus norvegicus; 
NCBI    Rnor_5.0    Rattus norvegicus; Rat Genome Sequencing Consortium
NCBI    Rn_Celera   Rattus norvegicus; Celera Genomics
Ensembl Rnor_6.0    Rat

With regards to the error, can you reach this url in your web browser: http://genome.ucsc.edu/cgi-bin/das/dsn ?

ADD REPLY • link 6.7 years ago by simon.vanheeringen ▴ 270

0

Entering edit mode

I could access that link and here is what I can see

Click Here for screenshot

ADD REPLY • link 6.7 years ago by lakhujanivijay 5.8k

0

Entering edit mode

Do you know if you are behind a proxy by any chance?

ADD REPLY • link 6.7 years ago by simon.vanheeringen ▴ 270