Find and Download GenBank File for Whole Genome using Entrez
0
0
Entering edit mode
4.1 years ago
kmyers2 ▴ 80

I am trying to add an option to a python program I have to allow the user to search and download the Genbank file for the genome of an organism, such as Saccharomyces cerevisiae S288C. I have the following code:

handle = Entrez.esearch(db="assembly", term = "Saccharomyces cerevisiae S288C", retmax = "100000")
record = Entrez.read(handle)
ids = record['IdList']
print(f'found {len(ids)} ids')

found 2 ids

print(ids)

['285498', '245838']

for each in ids:
    esummary_handle = Entrez.esummary(db="assembly", id=each, report = "full")
    esummary_record = Entrez.read(esummary_handle)
    summary = esummary_record
    url = summary['DocumentSummarySet']['DocumentSummary'][0]['FtpPath_GenBank']
    print(url)
    label = os.path.basename(url)
    link = os.path.join(url, label+'_genomic.gbff.gz')
    urllib.request.urlretrieve(link, f'{label}.gbff.gz')

For Saccharomyces cerevisiae S288C there are two IDs found. For the first one (285498) there is a FtpPath_GenBank and it downloads just fine, but for the other (245838) which is the general and common genome to use, there is not FtpPath_GenBank in the summary result, so the code fails. Manually searching NCBI shows me the FTP site for this genome, complete with address and the file I want: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/146/045/GCA_000146045.2_R64/

I'm really confused as to why the summary record doesn't show the FtpPath_GenBank, even though it is in NCBI. Is there an easier way to go about this? Basically I'd like the user to be able to search for an organism and be able to download the Genbank file to use later in my program. I am super new to the Entrez suite and find it a little confusing, so any help would be greatly appreciated.

ncbi entrez genbank python biopython • 1.1k views
ADD COMMENT
1
Entering edit mode

There are top level assembly report files that exist for all genomes in NCBI. Here is an example for GenBank genomes. There is one for RefSeq genomes. You can parse out FTP paths from this file directly.

If you are open to using already existing tools that do this then ncbi-genome-downloader is an option. You can find the tool here.

ADD REPLY
0
Entering edit mode

Thank you! That file helps a ton! I really appreciate it.

ADD REPLY

Login before adding your answer.

Traffic: 1928 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6