GBK to CSV -Biopython parsing for locus_tag and seq only
1
0
Entering edit mode
4.8 years ago
mswyers • 0

Hey everyone, I'm very new to bioinformatics and python in general, and haven't had any luck with finding a good way to do this.

My current pipeline outputs a multi-entry .gbk file that contains multiple CDS features per contig annotated. From this, I would like to pull out the locus_tag and the nucleotide sequence for each CDS feature (not contig) and have that output to a simple csv file.

What I've tried is from this github link: https://github.com/dewshr/NCBI-GenBank-file-parser

But I'm getting an indexing error that I'm not sure how to correct since I don't know much about the code:

 Traceback (most recent call last):
  File "/home/seq/gbknuctest.py", line 128, in <module>
    ntgenbank()
  File "/home/seq/gbknuctest.py", line 50, in ntgenbank
    nm_version = (nm_and_version.split('.')[1]).strip('\n')
IndexError: list index out of range

I'm sure there's a much more concise way to do this, but reading through the cookbook and stumbling around hasn't gotten me anywhere so far.

Output of python -V to show it's not due to a py3/py2 difference:

$ python -V
Python 2.7.15

Thanks so much in advance.

python • 2.3k views
ADD COMMENT
3
Entering edit mode
4.8 years ago
Joe 21k

Give this a try?

Usage:
# python extractNAfeatures.py genbank.gbk outfile.csv

from Bio import SeqIO
import sys

with open(sys.argv[2], "w") as nfh:
    for rec in SeqIO.parse(sys.argv[1], "genbank"):
        if rec.features:
            for feature in rec.features:
                if feature.type == "CDS":
                    nfh.write(">%s from %s,%s\n" %
                             (feature.qualifiers["locus_tag"][0],
                              rec.name,
                              feature.location.extract(rec).seq))

This will extract all the CDS features nucleic acid sequences. If you want the protein sequences its actually easier.

To get this as as a csv or something instead, change the formatting of the write() call to something like: ”%s from %s,%s\n” (essentially swap one of the new lines to a comma.

Code now updated to output based on locus_tag and as a CSV.

ADD COMMENT
0
Entering edit mode

This is very close! It just isn't working perfectly for me. I actually searched this forum first and have this script you gave in a previous answer, albeit it was specific to a certain feature type. I wasn't sure how to edit it to make it more broad. I'm very glad you saw my question haha.

My first .gbk file is a cluster containing 11 CDSs, of which 3 are full genes. I want the nucleotide sequence for all of these, not just the 3 genes.

Right now it's just pulling the 3 "gene" tags and their sequences, but is ignoring the entries that have no "gene" but only a "locus_tag". Is there a way to parse them based on the "locus_tag" object?

An example of a gene in my file:

 gene            2975..4210
                 /gene="gnl_1"
                 /locus_tag="CPIDKGOP_01341"
 CDS             2975..4210
                 /EC_number="3.1.1.17"
                 /codon_start=1
                 /gene="gnl_1"
                 /inference="ab initio prediction:Prodigal:2.6"
                 /inference="similar to AA sequence:UniProtKB:Q01578"
                 /locus_tag="CPIDKGOP_01341"
                 /product="Gluconolactonase"
                 /transl_table=11
                 /translation="MEQGMRDAQVLNGALARRRVLRGVGAVLGSAMLAPQLVRAQAAGA
                 AIAPPSTVTQPPRDFGPNGAPTTYFTDPDVLTVDPAFDGLRQPNAAIQRLWTGALWSEG
              PAWNSVGRFLVWSDIPNNRQLRWSEDDGHVSVFRSPSNNSNGNTFDYQGRQLSCEHLTR
                RVVRYELDGSTTILASTFNGKRLNSPNDVVPHPDGSYWFTDPPYGAQFYEGTVDAAGGP
                 ANKAGRMNPRLGQPPEIGFYKRELPTAVYRLDKSGTLTQVAGEDLVPDPNGLCFSPDFK
               KLYIVSTGQGPGDSIAGGKGDMYAFDVGADNKLSNGKLFSNFMIDGVKCGPDGVRADVD
                 GNLWCSSNAGRSVGYSGVTVWTPQGRLIGRIRLPEICGNVCFGGPKRNRLFMAASQSLY
                 AVYTGTQGAAPG"

An example of CDS without a "gene" tag:

 gene            complement(4231..4512)
                 /locus_tag="CPIDKGOP_01342"
 CDS             complement(4231..4512)
                 /codon_start=1
                 /inference="ab initio prediction:Prodigal:2.6"
                 /locus_tag="CPIDKGOP_01342"
                 /product="hypothetical protein"
                 /transl_table=11
                 /translation="MNTLFQKLVLSRLWLSFIVLGLAFLAFGAGTLNLGLLFIANARLL
                 GAHGWQAVMDGALWQLLELIVTGYLSIAAYVVLKACEHRLSQWLAHER"
ADD REPLY
0
Entering edit mode

Yep, try my edited code above. I changed it before you replied, but possible after you initially saw it. I noticed it was still parsing out ”gene” features, so I’ve switched that over now. Try the newer code above.

ADD REPLY
0
Entering edit mode

Ah I see it now. This is perfect. Thanks so much!

ADD REPLY

Login before adding your answer.

Traffic: 2989 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6