Question

Parsing a genbank file and outputting specific feature information to a csv using BioPython

0

Entering edit mode

3.1 years ago

Zach G • 0

Hi. This question is pertaining to parsing a genbank file. I would like to parse through the file using SeqIO and output specific feature information to a CSV file (one column for each piece of information). My example input looks like the following:

SBxxxxxx.LargeContigs.gbk:

LOCUS       scaffold_31            38809 bp    DNA              UNK 01-JAN-1980
DEFINITION  scaffold_31.
ACCESSION   scaffold_31
VERSION     scaffold_31
KEYWORDS    .
SOURCE      .
      ORGANISM  .
COMMENT     ##antiSMASH-Data-START##
            Version      :: 6.1.1
            Run date     :: 2022-09-21 11:09:55
            ##antiSMASH-Data-END##
FEATURES             Location/Qualifiers
            protocluster    26198..38809
                            /aStool="rule-based-clusters"
                            /category="terpene"
                            /contig_edge="True"
                            /core_location="[36197:37079](-)"
                            /cutoff="20000"
                            /detection_rule="(Terpene_synth or Terpene_synth_C or
                            phytoene_synt or Lycopene_cycl or terpene_cyclase or NapT7
                            or fung_ggpps or fung_ggpps2 or trichodiene_synth or TRI5)"
                            /neighbourhood="10000"
                            /product="terpene"
                            /protocluster_number="1"
                            /tool="antismash"

Now for the output file, I want to create a csv with 3 columns. one column will have the Scaffold information (i.e., scaffold_31), the second column will have the category value in the protocluster feature (i.e., /category = "terpene") and the third column will have the product value in the protocluster feature (i.e., /product="terpene")

This is what I have so far for code. I know it is incomplete but any guidance much appreciated. I am completely new to parsing through gene bank files so have little knowledge in this domain. Thanks!

import Bio
from Bio import SeqIO
import os

input = "/Path to SBxxxxxx.LargeContigs.gbk"
output = open("output.csv", "w")

if not os.path.exists(output):
     for record in SeqIO.parse(input, "genbank")
          for feature in record.features:
              if feature.type == "protocluster" and "category" and "product" in feature.qualifiers:
                  outfile = feature.qualifiers["category"][0] + "," + feature.qualifiers["product"][0] + "\n"
                  output.write(outfile)

Bio.SeqIO genbank biopython • 1.9k views

ADD COMMENT • link updated 3.1 years ago by Andrzej Zielezinski 11k • written 3.1 years ago by Zach G • 0

score 2 · Answer 1 · 2022-09-22

2

Entering edit mode

3.1 years ago

Andrzej Zielezinski 11k

Overall, you're on the right track with your code. I made some minor changes.

import os
from Bio import SeqIO

infile = 'my_scaffold.gb'

oh = open('output.csv', 'w')
for record in SeqIO.parse(infile, 'genbank'):
    for feature in record.features:
        if feature.type == "protocluster":
            category = 'Unknown'
            product = 'Unknown'
            if 'category' in feature.qualifiers:
                category = feature.qualifiers['category'][0]
            if 'product' in feature.qualifiers:
                product = feature.qualifiers['product'][0]
            line = f'{record.id},{category},{product}\n'
            oh.write(line)
oh.close()

ADD COMMENT • link 3.1 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Thank you, this is very helpful! I will go ahead and try implementing some of these changes. Now quick follow up questions. Will this print the scaffold/contig as well? I'm assuming that is what the record.id that you added is for, correct me if I'm wrong though. Also, could you give me a brief overview as to what exactly this addition is doing:

if feature.type == "protocluster":
    category = 'Unknown'
    product = 'Unknown'

ADD REPLY • link 3.1 years ago by Zach G • 0

0

Entering edit mode

Yes, the name of the scaffold/contig will also be present in the CSV file. You are right – the scaffold/contig name is under record.id. As for the additional if statement, I added it to just handle potential cases of GenBank records where the proteocluster feature is present, but it doesn’t have category or product qualifiers. If this is the case, the category and/or product variables will be set to ’Unknown’.

ADD REPLY • link 3.1 years ago by Andrzej Zielezinski 11k