Question

GBK to CSV -Biopython parsing for locus_tag and seq only

0

Entering edit mode

4.8 years ago

mswyers • 0

Hey everyone, I'm very new to bioinformatics and python in general, and haven't had any luck with finding a good way to do this.

My current pipeline outputs a multi-entry .gbk file that contains multiple CDS features per contig annotated. From this, I would like to pull out the locus_tag and the nucleotide sequence for each CDS feature (not contig) and have that output to a simple csv file.

What I've tried is from this github link: https://github.com/dewshr/NCBI-GenBank-file-parser

But I'm getting an indexing error that I'm not sure how to correct since I don't know much about the code:

 Traceback (most recent call last):
  File "/home/seq/gbknuctest.py", line 128, in <module>
    ntgenbank()
  File "/home/seq/gbknuctest.py", line 50, in ntgenbank
    nm_version = (nm_and_version.split('.')[1]).strip('\n')
IndexError: list index out of range

I'm sure there's a much more concise way to do this, but reading through the cookbook and stumbling around hasn't gotten me anywhere so far.

Output of python -V to show it's not due to a py3/py2 difference:

$ python -V
Python 2.7.15

Thanks so much in advance.

python • 2.3k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 4.8 years ago by mswyers • 0

score 3 · Accepted Answer · 2019-07-18

3

Entering edit mode

4.8 years ago

Joe 21k

Give this a try?

Usage:
# python extractNAfeatures.py genbank.gbk outfile.csv

from Bio import SeqIO
import sys

with open(sys.argv[2], "w") as nfh:
    for rec in SeqIO.parse(sys.argv[1], "genbank"):
        if rec.features:
            for feature in rec.features:
                if feature.type == "CDS":
                    nfh.write(">%s from %s,%s\n" %
                             (feature.qualifiers["locus_tag"][0],
                              rec.name,
                              feature.location.extract(rec).seq))

This will extract all the CDS features nucleic acid sequences. If you want the protein sequences its actually easier.

~~To get this as as a csv or something instead, change the formatting of the write() call to something like: ”%s from %s,%s\n” (essentially swap one of the new lines to a comma.~~

Code now updated to output based on locus_tag and as a CSV.

ADD COMMENT • link 4.8 years ago by Joe 21k

0

Entering edit mode

This is very close! It just isn't working perfectly for me. I actually searched this forum first and have this script you gave in a previous answer, albeit it was specific to a certain feature type. I wasn't sure how to edit it to make it more broad. I'm very glad you saw my question haha.

My first .gbk file is a cluster containing 11 CDSs, of which 3 are full genes. I want the nucleotide sequence for all of these, not just the 3 genes.

Right now it's just pulling the 3 "gene" tags and their sequences, but is ignoring the entries that have no "gene" but only a "locus_tag". Is there a way to parse them based on the "locus_tag" object?

An example of a gene in my file:

 gene            2975..4210
                 /gene="gnl_1"
                 /locus_tag="CPIDKGOP_01341"
 CDS             2975..4210
                 /EC_number="3.1.1.17"
                 /codon_start=1
                 /gene="gnl_1"
                 /inference="ab initio prediction:Prodigal:2.6"
                 /inference="similar to AA sequence:UniProtKB:Q01578"
                 /locus_tag="CPIDKGOP_01341"
                 /product="Gluconolactonase"
                 /transl_table=11
                 /translation="MEQGMRDAQVLNGALARRRVLRGVGAVLGSAMLAPQLVRAQAAGA
                 AIAPPSTVTQPPRDFGPNGAPTTYFTDPDVLTVDPAFDGLRQPNAAIQRLWTGALWSEG
              PAWNSVGRFLVWSDIPNNRQLRWSEDDGHVSVFRSPSNNSNGNTFDYQGRQLSCEHLTR
                RVVRYELDGSTTILASTFNGKRLNSPNDVVPHPDGSYWFTDPPYGAQFYEGTVDAAGGP
                 ANKAGRMNPRLGQPPEIGFYKRELPTAVYRLDKSGTLTQVAGEDLVPDPNGLCFSPDFK
               KLYIVSTGQGPGDSIAGGKGDMYAFDVGADNKLSNGKLFSNFMIDGVKCGPDGVRADVD
                 GNLWCSSNAGRSVGYSGVTVWTPQGRLIGRIRLPEICGNVCFGGPKRNRLFMAASQSLY
                 AVYTGTQGAAPG"

An example of CDS without a "gene" tag:

 gene            complement(4231..4512)
                 /locus_tag="CPIDKGOP_01342"
 CDS             complement(4231..4512)
                 /codon_start=1
                 /inference="ab initio prediction:Prodigal:2.6"
                 /locus_tag="CPIDKGOP_01342"
                 /product="hypothetical protein"
                 /transl_table=11
                 /translation="MNTLFQKLVLSRLWLSFIVLGLAFLAFGAGTLNLGLLFIANARLL
                 GAHGWQAVMDGALWQLLELIVTGYLSIAAYVVLKACEHRLSQWLAHER"

ADD REPLY • link 4.8 years ago by mswyers • 0

0

Entering edit mode

Yep, try my edited code above. I changed it before you replied, but possible after you initially saw it. I noticed it was still parsing out ”gene” features, so I’ve switched that over now. Try the newer code above.