Question: Easiest way to download all Enterobacteria
1
gravatar for Joe
18 months ago by
Joe14k
United Kingdom
Joe14k wrote:

Does anyone have a simple solution to downloading all the refseq genomes for a particular taxon?

Using ncbi-genome-download its possible to specify the species or genus TaxIDs and download them, but apparently you can't go higher up the taxonomic ranks (even though enterobacteria has a TaxID of 543 for instance).

If anyone knows of a way to download all the Enterobacteria I'm all ears.

Alternatively, if there is a method of extracting the species TaxIDs from the Enterobacterial taxid in NCBI such that I can pass them all directly to ncbi-genome-download that would work too.

download refseq genome ncbi • 933 views
ADD COMMENTlink modified 18 months ago by tdmurphy160 • written 18 months ago by Joe14k
4
gravatar for Joe
18 months ago by
Joe14k
United Kingdom
Joe14k wrote:

From speaking with a few other pros, this was the solution in the end (though only very rough at the mo):

Use the ete3 toolkit to get a list of IDs:

from ete3 import NCBITaxa
import sys

taxon_name = sys.argv[1]

ncbi = NCBITaxa()
ncbi.update_taxonomy_database()
ebact = ncbi.get_descendant_taxa(taxon_name)

with open('./taxids', 'w') as ofh:
    for i in ebact:
        ofh.write("%s\n" % i)

# At this point, one could import ncbi-genome-download as a python method and continue

Which gave me a list of IDs (though this includes ALL descendent taxa, even ones without complete genomes etc).

I passed these to the latest version of ncbi-genome-download which accepts a --taxid 12345,65890 format for specifiying the IDs.

So I just ran:

for file in * ; 
do python ~/bin/ncbi-genome-download/ncbi-genome-download-runner.py -l complete -v -p 10  --taxid $(paste -s -d ',' "$file") bacteria ; 
done

I had to run this iteratively on many files after I split my taxids file up as there is a limit to how many args can be passed to --taxid at once.

EDIT Sept 2018:

I contributed a script to the ncbi-genome-download repo to make getting the TaxIDs nice and easy. It uses the approach above, but there’s no need to rewrite it for oneself now.

ADD COMMENTlink modified 12 months ago • written 18 months ago by Joe14k
1

Wow, thanks for this answer. I've just learned about this useful ncbi.get_descendant_taxa() funcion. Funny, I use the same variable name ofh for an output file and I always read it as output file handle.

ADD REPLYlink written 18 months ago by a.zielezinski8.8k

That’s exactly the way I intend it! It’s quite possible I’ve picked up the habit from some of your answers!

ADD REPLYlink written 18 months ago by Joe14k
3
gravatar for Sej Modha
18 months ago by
Sej Modha4.4k
Glasgow, UK
Sej Modha4.4k wrote:

Maybe this would do the trick!

esearch -db genome -query "txid543 [Organism]"|elink -target nuccore|efilter -query "RefSeq"|efetch -format fasta
ADD COMMENTlink written 18 months ago by Sej Modha4.4k
0
gravatar for tdmurphy
18 months ago by
tdmurphy160
tdmurphy160 wrote:

You can easily do this from NCBI's Assembly resource: https://www.ncbi.nlm.nih.gov/assembly/?term=Enterobacteria%5Borgn%5D+latest_refseq%5Bfilter%5D

click the blue "Download Assemblies" button, pick "refseq" and the filetype you're after (e.g. genomic FASTA), and it should work. It might take a while for that many genomes.

ADD COMMENTlink written 18 months ago by tdmurphy160

Yeah, I should have sepecified I was after a command line tool, but this would work for sure

ADD REPLYlink written 18 months ago by Joe14k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1506 users visited in the last hour