Hi. This question is pertaining to parsing a genbank file. I would like to parse through the file using SeqIO and output specific feature information to a CSV file (one column for each piece of information). My example input looks like the following:
SBxxxxxx.LargeContigs.gbk:
LOCUS       scaffold_31            38809 bp    DNA              UNK 01-JAN-1980
DEFINITION  scaffold_31.
ACCESSION   scaffold_31
VERSION     scaffold_31
KEYWORDS    .
SOURCE      .
      ORGANISM  .
COMMENT     ##antiSMASH-Data-START##
            Version      :: 6.1.1
            Run date     :: 2022-09-21 11:09:55
            ##antiSMASH-Data-END##
FEATURES             Location/Qualifiers
            protocluster    26198..38809
                            /aStool="rule-based-clusters"
                            /category="terpene"
                            /contig_edge="True"
                            /core_location="[36197:37079](-)"
                            /cutoff="20000"
                            /detection_rule="(Terpene_synth or Terpene_synth_C or
                            phytoene_synt or Lycopene_cycl or terpene_cyclase or NapT7
                            or fung_ggpps or fung_ggpps2 or trichodiene_synth or TRI5)"
                            /neighbourhood="10000"
                            /product="terpene"
                            /protocluster_number="1"
                            /tool="antismash"
Now for the output file, I want to create a csv with 3 columns. one column will have the Scaffold information (i.e., scaffold_31), the second column will have the category value in the protocluster feature (i.e., /category = "terpene") and the third column will have the product value in the protocluster feature (i.e., /product="terpene")
This is what I have so far for code. I know it is incomplete but any guidance much appreciated. I am completely new to parsing through gene bank files so have little knowledge in this domain. Thanks!
import Bio
from Bio import SeqIO
import os
input = "/Path to SBxxxxxx.LargeContigs.gbk"
output = open("output.csv", "w")
if not os.path.exists(output):
     for record in SeqIO.parse(input, "genbank")
          for feature in record.features:
              if feature.type == "protocluster" and "category" and "product" in feature.qualifiers:
                  outfile = feature.qualifiers["category"][0] + "," + feature.qualifiers["product"][0] + "\n"
                  output.write(outfile)
Thank you, this is very helpful! I will go ahead and try implementing some of these changes. Now quick follow up questions. Will this print the scaffold/contig as well? I'm assuming that is what the record.id that you added is for, correct me if I'm wrong though. Also, could you give me a brief overview as to what exactly this addition is doing:
Yes, the name of the scaffold/contig will also be present in the CSV file. You are right – the scaffold/contig name is under
record.id. As for the additional if statement, I added it to just handle potential cases of GenBank records where theproteoclusterfeature is present, but it doesn’t havecategoryorproductqualifiers. If this is the case, thecategoryand/orproductvariables will be set to’Unknown’.