7 weeks ago

Hello I used antismash from the CLI and I got 700 gbk files (1 gbk file per each analyzed genome).

I used the following script to retrieve the predicted products from the gbk files:

    from Bio import SeqIO
import glob

for files in glob.glob("*.gbk"):
    out_files = "products/"+files.replace(".gbk","_output.tsv")
    cluster_out = open(out_files, "w")

# Extract Cluster info, write to file
    for seq_record in SeqIO.parse(files, "genbank"):
     for seq_feat in seq_record.features:
      if seq_feat.type == "protocluster":
       cluster_number = seq_feat.qualifiers["protocluster_number"][0].replace(" ","_").replace(":","")
       cluster_type = seq_feat.qualifiers["product"][0]

       cluster_out.write("#"+cluster_number+"\tCluster Type:"+cluster_type+"\n") 

So on this way, from those gbk files I produced ".tsv" files that contain info about the products per each genome.

Here an example:

cat cluster1_bin1.tsv

1 Cluster Type:TfuA-related

1 Cluster Type:terpene

1 Cluster Type:NRPS-like

1 Cluster Type:terpene

1 Cluster Type:terpene

from those ".tsv" files I want to generate a table like this:


How can I produce that table? I can do this manually buy there are 700 ".tsv" files so I want to know if I can automate that.

Thanks for your time :)

import glob
import pandas as pd

df = {}

for fileName in glob.glob("*.tsv"): #Prefix "*.tsv" with the target directory
    genomeIdentifier = fileName.split("/")[-1].replace(".tsv", "")
    for line in open(fileName):
        line = line.strip()
        line = line.split("\t")[-1].replace("Cluster Type:", "")
        if(not line in df):
            df[line] = {}
        df[line][genomeIdentifier] = df[line].get(genomeIdentifier, 0) + 1
df = pd.DataFrame(df)
df = df.fillna(0)   #To handle missing value
df.to_csv("table.tsv", sep = "\t")

This would do the job but you can apply this logic in your main code, I mean while creating .tsv files.

Cheers!! :)

thanks so much! this helped me alot!

cheers :)


