Question

Downloading Gene Annotation protein.faa files from Genome Accession numbers from NCBI via Entrez

1

Entering edit mode

8.0 years ago

ijc2 ▴ 10

I have a list of NCBI genome accession numbers of the form: NC_####### and I want to download the protein fasta files corresponding to the genome annotations of the accession numbers.

I have tried (using Python 2.7):

import os
from Bio import Entrez, SeqIO
Entrez.email = "email@example.com"
id_list = "NC_004757"
handle = Entrez.esearch(db="nuccore", term = id_list)
record = Entrez.read(handle)
gi_list = record["IdList"]
gi_str = ",".join(gi_list)    
handle = Entrez.efetch(db="nuccore", id=gi_str, rettype="fasta_cds_aa")
records = list(SeqIO.parse(handle, "fasta"))
for item in records:
  printitem.id)

But the runtime is so long I believe there must be an issue. Any idea on how I can access these genome annotation fasta files in bulk?

entrez ncbi fasta python • 3.1k views

ADD COMMENT • link 8.0 years ago by ijc2 ▴ 10

score 0 · Answer 1 · 2016-05-05

Try to start without python to make sure everything can be found where it should be found.

See my answer inside this post.

A: where can I get environmental bacteria genome in fasta format (as many as possib

It's Ok for any organism in NCBI, not only for bacteria.

If NC_004757 is a real number, it's a bacterium, so no problem.

NCBI has been changed a lot, so make sure your files exist where you are looking for them.

Find the name of your bacterium in this file:

ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt

Copy the respective url to any browser.

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000009145.1_ASM914v1/

You can download your faa-files from the site above.