Question: argparse to pull fasta files from GenBank
gravatar for mac03pat
5 months ago by
mac03pat10 wrote:

I pulled the code below from an old biostars post (

import argparse
import sys
import os

import Bio.Entrez

RETMAX = 10**9
GB_EXT = ".gb"

def parse_args(arg_lst):
    parser = argparse.ArgumentParser()
    parser.add_argument("-i", "--input", type=str, required=True,
                        help="A file with accessions to download")
    parser.add_argument("-d", "--database", type=str, required=True,
                        help="NCBI database ID")
    parser.add_argument("-e", "--email", type=str, required=False,
                        help="An e-mail address")
    parser.add_argument("-b", "--batch", type=int, required=False, default=100,
                        help="The number of accessions to process per request")
    parser.add_argument("-o", "--output_dir", type=str, required=True,
                        help="The directory to write downloaded files to")

    return parser.parse_args(arg_lst)

def read_accessions(fp):
    with open(fp) as acc_lines:
        return [line.strip() for line in acc_lines]

def accessions_to_gb(accessions, db, batchsize, retmax):
    def batch(sequence, size):
        l = len(accessions)
        for start in range(0, l, size):
            yield sequence[start:min(start + size, l)]

    def extract_records(records_handle):
        buffer = []
        for line in records_handle:
            if line.startswith("LOCUS") and buffer:
                # yield accession number and record
                yield buffer[0].split()[1], "".join(buffer)
                buffer = [line]
        yield buffer[0].split()[1], "".join(buffer)

    def process_batch(accessions_batch):
        # get GI for query accessions
        query = " ".join(accessions_batch)
        query_handle = Bio.Entrez.esearch(db=db, term=query, retmax=retmax)
        gi_list =['IdList']

        # get GB files
        search_handle =, id=",".join(gi_list))
        search_results =
        webenv, query_key = search_results["WebEnv"], search_results["QueryKey"]
        records_handle = Bio.Entrez.efetch(db=db, rettype="gb", retmax=batchsize,
                                           webenv=webenv, query_key=query_key)
        yield from extract_records(records_handle)

    accession_batches = batch(accessions, batchsize)
    for acc_batch in accession_batches:
        yield from process_batch(acc_batch)

def write_record(dir, accession, record):
    with open(os.path.join(dir, accession + GB_EXT), "w") as output:
        print(record, file=output)

def main(argv):
    args = parse_args(argv)
    accessions = read_accessions(os.path.abspath(args.input))
    op_dir = os.path.abspath(args.output_dir)
    if not os.path.exists(op_dir):
    dbase = args.database =
    batchsize = args.batch

    for acc, record in accessions_to_gb(accessions, dbase, batchsize, RETMAX):
        write_record(op_dir, acc, record)

if __name__ == "__main__":

Part of the program I'm writing pulls about 80 FASTA files from GenBank via accession numbers. I saved the code in a file and ran this in my windows command line:

C:\Users\mac03\AppData\Local\Programs\Python\Python37\MBSProject>python -i HQ823621 -d genbank -o C:\Users\mac03\AppData\Local\Programs\Python\Python37\MBSProject\fastafiles

This error was returned:

Traceback (most recent call last):
  File "", line 90, in <module>
  File "", line 77, in main
    accessions = read_accessions(os.path.abspath(args.input))
  File "", line 30, in read_accessions
    with open(fp) as acc_lines:
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\mac03\\AppData\\Local\\Programs\\Python\\Python37\\MBSProject\\HQ823621'

It seems to be looking for the accession number "HQ823621" as a file in the folder MBSProject. I had thought this program would pull directly from GenBank. I only entered one of the accession numbers as I wasn't sure how to properly use the program and figured I would try with one first.

I've been coding for about 7 weeks and have never used argparse before so help is greatly appreciated!

Thanks, -Mac

argparase genbank fasta • 255 views
ADD COMMENTlink modified 5 months ago by SMK1.9k • written 5 months ago by mac03pat10
gravatar for SMK
5 months ago by
SMK1.9k wrote:


  1. Put all the accession numbers that you want to query in a file, for example: list.txt;
  2. Change database from genbank to nuccore;
  3. If you need to fetch fasta format, change GB_EXT = ".gb" to GB_EXT = ".fa", and rettype="gb" to rettype="fasta".
$ cat list.txt

$ python -i list.txt -d nuccore -o fastafiles

$ head fastafiles/Cladonia.fa
>HQ823621.1 Cladonia grayi isolate PKS15 putative polyketide synthase gene, complete cds
ADD COMMENTlink modified 5 months ago • written 5 months ago by SMK1.9k

Seems to have worked out great! Thank you.

ADD REPLYlink written 5 months ago by mac03pat10

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

ADD REPLYlink written 5 months ago by lieven.sterck6.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1889 users visited in the last hour