Question: GBK to CSV -Biopython parsing for locus_tag and seq only
gravatar for mswyers
15 months ago by
mswyers0 wrote:

Hey everyone, I'm very new to bioinformatics and python in general, and haven't had any luck with finding a good way to do this.

My current pipeline outputs a multi-entry .gbk file that contains multiple CDS features per contig annotated. From this, I would like to pull out the locus_tag and the nucleotide sequence for each CDS feature (not contig) and have that output to a simple csv file.

What I've tried is from this github link:

But I'm getting an indexing error that I'm not sure how to correct since I don't know much about the code:

 Traceback (most recent call last):
  File "/home/seq/", line 128, in <module>
  File "/home/seq/", line 50, in ntgenbank
    nm_version = (nm_and_version.split('.')[1]).strip('\n')
IndexError: list index out of range

I'm sure there's a much more concise way to do this, but reading through the cookbook and stumbling around hasn't gotten me anywhere so far.

output of python -V to show it's not due to a py3/py2 difference:

$ python -V
Python 2.7.15

Thanks so much in advance.

question python coding • 567 views
ADD COMMENTlink modified 15 months ago by Joe18k • written 15 months ago by mswyers0
gravatar for Joe
15 months ago by
United Kingdom
Joe18k wrote:

Give this a try?

# python genbank.gbk outfile.csv

from Bio import SeqIO
import sys

with open(sys.argv[2], "w") as nfh:
    for rec in SeqIO.parse(sys.argv[1], "genbank"):
        if rec.features:
            for feature in rec.features:
                if feature.type == "CDS":
                    nfh.write(">%s from %s,%s\n" %

This will extract all the CDS features nucleic acid sequences. If you want the protein sequences its actually easier.

To get this as as a csv or something instead, change the formatting of the write() call to something like: ”%s from %s,%s\n” (essentially swap one of the new lines to a comma.

Code now updated to output based on locus_tag and as a CSV.

ADD COMMENTlink modified 15 months ago • written 15 months ago by Joe18k

This is very close! It just isn't working perfectly for me. I actually searched this forum first and have this script you gave in a previous answer, albeit it was specific to a certain feature type. I wasn't sure how to edit it to make it more broad. I'm very glad you saw my question haha.

My first .gbk file is a cluster containing 11 CDSs, of which 3 are full genes. I want the nucleotide sequence for all of these, not just the 3 genes.

Right now it's just pulling the 3 "gene" tags and their sequences, but is ignoring the entries that have no "gene" but only a "locus_tag". Is there a way to parse them based on the "locus_tag" object?

An example of a gene in my file:

 gene            2975..4210
 CDS             2975..4210
                 /inference="ab initio prediction:Prodigal:2.6"
                 /inference="similar to AA sequence:UniProtKB:Q01578"

An example of CDS without a "gene" tag:

 gene            complement(4231..4512)
 CDS             complement(4231..4512)
                 /inference="ab initio prediction:Prodigal:2.6"
                 /product="hypothetical protein"
ADD REPLYlink modified 15 months ago • written 15 months ago by mswyers0

Yep, try my edited code above. I changed it before you replied, but possible after you initially saw it. I noticed it was still parsing out ”gene” features, so I’ve switched that over now. Try the newer code above.

ADD REPLYlink written 15 months ago by Joe18k

Ah I see it now. This is perfect. Thanks so much!

ADD REPLYlink written 15 months ago by mswyers0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1135 users visited in the last hour