Question: argparse to pull fasta files from GenBank
gravatar for mac03pat
21 months ago by
mac03pat30 wrote:

I pulled the code below from an old biostars post (

import argparse
import sys
import os

import Bio.Entrez

RETMAX = 10**9
GB_EXT = ".gb"

def parse_args(arg_lst):
    parser = argparse.ArgumentParser()
    parser.add_argument("-i", "--input", type=str, required=True,
                        help="A file with accessions to download")
    parser.add_argument("-d", "--database", type=str, required=True,
                        help="NCBI database ID")
    parser.add_argument("-e", "--email", type=str, required=False,
                        help="An e-mail address")
    parser.add_argument("-b", "--batch", type=int, required=False, default=100,
                        help="The number of accessions to process per request")
    parser.add_argument("-o", "--output_dir", type=str, required=True,
                        help="The directory to write downloaded files to")

    return parser.parse_args(arg_lst)

def read_accessions(fp):
    with open(fp) as acc_lines:
        return [line.strip() for line in acc_lines]

def accessions_to_gb(accessions, db, batchsize, retmax):
    def batch(sequence, size):
        l = len(accessions)
        for start in range(0, l, size):
            yield sequence[start:min(start + size, l)]

    def extract_records(records_handle):
        buffer = []
        for line in records_handle:
            if line.startswith("LOCUS") and buffer:
                # yield accession number and record
                yield buffer[0].split()[1], "".join(buffer)
                buffer = [line]
        yield buffer[0].split()[1], "".join(buffer)

    def process_batch(accessions_batch):
        # get GI for query accessions
        query = " ".join(accessions_batch)
        query_handle = Bio.Entrez.esearch(db=db, term=query, retmax=retmax)
        gi_list =['IdList']

        # get GB files
        search_handle =, id=",".join(gi_list))
        search_results =
        webenv, query_key = search_results["WebEnv"], search_results["QueryKey"]
        records_handle = Bio.Entrez.efetch(db=db, rettype="gb", retmax=batchsize,
                                           webenv=webenv, query_key=query_key)
        yield from extract_records(records_handle)

    accession_batches = batch(accessions, batchsize)
    for acc_batch in accession_batches:
        yield from process_batch(acc_batch)

def write_record(dir, accession, record):
    with open(os.path.join(dir, accession + GB_EXT), "w") as output:
        print(record, file=output)

def main(argv):
    args = parse_args(argv)
    accessions = read_accessions(os.path.abspath(args.input))
    op_dir = os.path.abspath(args.output_dir)
    if not os.path.exists(op_dir):
    dbase = args.database =
    batchsize = args.batch

    for acc, record in accessions_to_gb(accessions, dbase, batchsize, RETMAX):
        write_record(op_dir, acc, record)

if __name__ == "__main__":

Part of the program I'm writing pulls about 80 FASTA files from GenBank via accession numbers. I saved the code in a file and ran this in my windows command line:

C:\Users\mac03\AppData\Local\Programs\Python\Python37\MBSProject>python -i HQ823621 -d genbank -o C:\Users\mac03\AppData\Local\Programs\Python\Python37\MBSProject\fastafiles

This error was returned:

Traceback (most recent call last):
  File "", line 90, in <module>
  File "", line 77, in main
    accessions = read_accessions(os.path.abspath(args.input))
  File "", line 30, in read_accessions
    with open(fp) as acc_lines:
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\mac03\\AppData\\Local\\Programs\\Python\\Python37\\MBSProject\\HQ823621'

It seems to be looking for the accession number "HQ823621" as a file in the folder MBSProject. I had thought this program would pull directly from GenBank. I only entered one of the accession numbers as I wasn't sure how to properly use the program and figured I would try with one first.

I've been coding for about 7 weeks and have never used argparse before so help is greatly appreciated!

Thanks, -Mac

argparase genbank fasta • 618 views
ADD COMMENTlink modified 21 months ago by AK2.0k • written 21 months ago by mac03pat30
gravatar for AK
21 months ago by
AK2.0k wrote:


  1. Put all the accession numbers that you want to query in a file, for example: list.txt;
  2. Change database from genbank to nuccore;
  3. If you need to fetch fasta format, change GB_EXT = ".gb" to GB_EXT = ".fa", and rettype="gb" to rettype="fasta".
$ cat list.txt

$ python -i list.txt -d nuccore -o fastafiles

$ head fastafiles/Cladonia.fa
>HQ823621.1 Cladonia grayi isolate PKS15 putative polyketide synthase gene, complete cds
ADD COMMENTlink modified 21 months ago • written 21 months ago by AK2.0k

Seems to have worked out great! Thank you.

ADD REPLYlink written 21 months ago by mac03pat30

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

ADD REPLYlink written 21 months ago by lieven.sterck10.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1862 users visited in the last hour