Question: Easiest way to download all Enterobacteria
1
gravatar for Joe
2.4 years ago by
Joe17k
United Kingdom
Joe17k wrote:

Does anyone have a simple solution to downloading all the refseq genomes for a particular taxon?

Using ncbi-genome-download its possible to specify the species or genus TaxIDs and download them, but apparently you can't go higher up the taxonomic ranks (even though enterobacteria has a TaxID of 543 for instance).

If anyone knows of a way to download all the Enterobacteria I'm all ears.

Alternatively, if there is a method of extracting the species TaxIDs from the Enterobacterial taxid in NCBI such that I can pass them all directly to ncbi-genome-download that would work too.

download refseq genome ncbi • 1.2k views
ADD COMMENTlink modified 2.4 years ago by tdmurphy190 • written 2.4 years ago by Joe17k
4
gravatar for Joe
2.4 years ago by
Joe17k
United Kingdom
Joe17k wrote:

From speaking with a few other pros, this was the solution in the end (though only very rough at the mo):

Use the ete3 toolkit to get a list of IDs:

from ete3 import NCBITaxa
import sys

taxon_name = sys.argv[1]

ncbi = NCBITaxa()
ncbi.update_taxonomy_database()
ebact = ncbi.get_descendant_taxa(taxon_name)

with open('./taxids', 'w') as ofh:
    for i in ebact:
        ofh.write("%s\n" % i)

# At this point, one could import ncbi-genome-download as a python method and continue

Which gave me a list of IDs (though this includes ALL descendent taxa, even ones without complete genomes etc).

I passed these to the latest version of ncbi-genome-download which accepts a --taxid 12345,65890 format for specifiying the IDs.

So I just ran:

for file in * ; 
do python ~/bin/ncbi-genome-download/ncbi-genome-download-runner.py -l complete -v -p 10  --taxid $(paste -s -d ',' "$file") bacteria ; 
done

I had to run this iteratively on many files after I split my taxids file up as there is a limit to how many args can be passed to --taxid at once.

EDIT Sept 2018:

I contributed a script to the ncbi-genome-download repo to make getting the TaxIDs nice and easy. It uses the approach above, but there’s no need to rewrite it for oneself now.

ADD COMMENTlink modified 23 months ago • written 2.4 years ago by Joe17k
1

Wow, thanks for this answer. I've just learned about this useful ncbi.get_descendant_taxa() funcion. Funny, I use the same variable name ofh for an output file and I always read it as output file handle.

ADD REPLYlink written 2.4 years ago by a.zielezinski9.2k

That’s exactly the way I intend it! It’s quite possible I’ve picked up the habit from some of your answers!

ADD REPLYlink written 2.4 years ago by Joe17k
3
gravatar for Sej Modha
2.4 years ago by
Sej Modha4.7k
Glasgow, UK
Sej Modha4.7k wrote:

Maybe this would do the trick!

esearch -db genome -query "txid543 [Organism]"|elink -target nuccore|efilter -query "RefSeq"|efetch -format fasta
ADD COMMENTlink written 2.4 years ago by Sej Modha4.7k
0
gravatar for tdmurphy
2.4 years ago by
tdmurphy190
tdmurphy190 wrote:

You can easily do this from NCBI's Assembly resource: https://www.ncbi.nlm.nih.gov/assembly/?term=Enterobacteria%5Borgn%5D+latest_refseq%5Bfilter%5D

click the blue "Download Assemblies" button, pick "refseq" and the filetype you're after (e.g. genomic FASTA), and it should work. It might take a while for that many genomes.

ADD COMMENTlink written 2.4 years ago by tdmurphy190

Yeah, I should have sepecified I was after a command line tool, but this would work for sure

ADD REPLYlink written 2.4 years ago by Joe17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 900 users visited in the last hour