Easiest way to download all Enterobacteria
3
1
Entering edit mode
6.6 years ago
Joe 21k

Does anyone have a simple solution to downloading all the refseq genomes for a particular taxon?

Using ncbi-genome-download its possible to specify the species or genus TaxIDs and download them, but apparently you can't go higher up the taxonomic ranks (even though enterobacteria has a TaxID of 543 for instance).

If anyone knows of a way to download all the Enterobacteria I'm all ears.

Alternatively, if there is a method of extracting the species TaxIDs from the Enterobacterial taxid in NCBI such that I can pass them all directly to ncbi-genome-download that would work too.

ncbi genome refseq • 2.9k views
ADD COMMENT
4
Entering edit mode
6.6 years ago
Sej Modha 5.3k

Maybe this would do the trick!

esearch -db genome -query "txid543 [Organism]"|elink -target nuccore|efilter -query "RefSeq"|efetch -format fasta
ADD COMMENT
4
Entering edit mode
6.6 years ago
Joe 21k

From speaking with a few other pros, this was the solution in the end (though only very rough at the mo):

Use the ete3 toolkit to get a list of IDs:

from ete3 import NCBITaxa
import sys

taxon_name = sys.argv[1]

ncbi = NCBITaxa()
ncbi.update_taxonomy_database()
ebact = ncbi.get_descendant_taxa(taxon_name)

with open('./taxids', 'w') as ofh:
    for i in ebact:
        ofh.write("%s\n" % i)

# At this point, one could import ncbi-genome-download as a python method and continue

Which gave me a list of IDs (though this includes ALL descendent taxa, even ones without complete genomes etc).

I passed these to the latest version of ncbi-genome-download which accepts a --taxid 12345,65890 format for specifiying the IDs.

So I just ran:

for file in * ; 
do python ~/bin/ncbi-genome-download/ncbi-genome-download-runner.py -l complete -v -p 10  --taxid $(paste -s -d ',' "$file") bacteria ; 
done

I had to run this iteratively on many files after I split my taxids file up as there is a limit to how many args can be passed to --taxid at once.

EDIT Sept 2018:

I contributed a script to the ncbi-genome-download repo to make getting the TaxIDs nice and easy. It uses the approach above, but there’s no need to rewrite it for oneself now.

ADD COMMENT
1
Entering edit mode

Wow, thanks for this answer. I've just learned about this useful ncbi.get_descendant_taxa() funcion. Funny, I use the same variable name ofh for an output file and I always read it as output file handle.

ADD REPLY
0
Entering edit mode

That’s exactly the way I intend it! It’s quite possible I’ve picked up the habit from some of your answers!

ADD REPLY
0
Entering edit mode
6.6 years ago
tdmurphy ▴ 230

You can easily do this from NCBI's Assembly resource: https://www.ncbi.nlm.nih.gov/assembly/?term=Enterobacteria%5Borgn%5D+latest_refseq%5Bfilter%5D

click the blue "Download Assemblies" button, pick "refseq" and the filetype you're after (e.g. genomic FASTA), and it should work. It might take a while for that many genomes.

ADD COMMENT
0
Entering edit mode

Yeah, I should have sepecified I was after a command line tool, but this would work for sure

ADD REPLY

Login before adding your answer.

Traffic: 814 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6