parsing gbk files (antismash result)
0
0
Entering edit mode
2.6 years ago

Hello I used antismash from the CLI and I got 700 gbk files (1 gbk file per each analyzed genome).

I used the following script to retrieve the predicted products from the gbk files:

    from Bio import SeqIO
import glob

for files in glob.glob("*.gbk"):
    out_files = "products/"+files.replace(".gbk","_output.tsv")
    cluster_out = open(out_files, "w")


# Extract Cluster info, write to file
    for seq_record in SeqIO.parse(files, "genbank"):
     for seq_feat in seq_record.features:
      if seq_feat.type == "protocluster":
       cluster_number = seq_feat.qualifiers["protocluster_number"][0].replace(" ","_").replace(":","")
       cluster_type = seq_feat.qualifiers["product"][0]

       cluster_out.write("#"+cluster_number+"\tCluster Type:"+cluster_type+"\n") 

So on this way, from those gbk files I produced ".tsv" files that contain info about the products per each genome.

Here an example:

cat cluster1_bin1.tsv

1 Cluster Type:TfuA-related

1 Cluster Type:terpene

1 Cluster Type:NRPS-like

1 Cluster Type:terpene

1 Cluster Type:terpene

from those ".tsv" files I want to generate a table like this:

table_example

How can I produce that table? I can do this manually buy there are 700 ".tsv" files so I want to know if I can automate that.

Thanks for your time :)

awk biopython gbk antismash bash • 1.7k views
ADD COMMENT
1
Entering edit mode
import glob
import pandas as pd

df = {}

for fileName in glob.glob("*.tsv"): #Prefix "*.tsv" with the target directory
    genomeIdentifier = fileName.split("/")[-1].replace(".tsv", "")
    for line in open(fileName):
        line = line.strip()
        line = line.split("\t")[-1].replace("Cluster Type:", "")
        if(not line in df):
            df[line] = {}
        df[line][genomeIdentifier] = df[line].get(genomeIdentifier, 0) + 1
df = pd.DataFrame(df)
df = df.fillna(0)   #To handle missing value
df.to_csv("table.tsv", sep = "\t")

This would do the job but you can apply this logic in your main code, I mean while creating .tsv files.

Cheers!! :)

ADD REPLY
0
Entering edit mode

thanks so much! this helped me alot!

cheers :)

ADD REPLY

Login before adding your answer.

Traffic: 2034 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6