Question

Programatically Downloading Complete Animal genomes - Entrez Utils

0

Entering edit mode

9.2 years ago

moranr ▴ 290

Hi,

My goal is to download all the complete nucleotide genome for metazoans.

I can about half of these very easily from Ensembl Metazoa. However, for the rest of the species I am thinking I need to use Entrez Utilities on NCBI with python.

My problem is selecting only completed genomes. Even if it is a case where all assemblies are downloaded for each species - that would be ok. I want a single fasta/gb file for a genome/assembly.

At the moment I am:

#Search Entrez and get ID for each species

with open('SpeciesList.csv', 'rU') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for sp in reader:
        search_term = str(sp[0])+'[orgn] complete genome[title]NOT mitochondria[title]'
        handle = Entrez.esearch(db='genome', term=search_term)
        genome_ids = Entrez.read(handle)['IdList']

##get gb files using ids

for genome_id in genome_ids:
    record = Entrez.efetch(db="nucleotide", id=genome_id, rettype="gb", retmode="text")
    filename = 'genBankRecord_{}.gb'.format(genome_id)
    print('Writing:{}'.format(filename))
    with open(filename, 'w') as f:
        f.write(record.read())

##Parse gb files

My problem is only grabbing gb files for completed genomes. Can anyone help with my search query here please?

sequence Entrez genome Python • 1.8k views

ADD COMMENT • link updated 24 months ago by Ram 43k • written 9.2 years ago by moranr ▴ 290