Update for Antismash v5:

Question

Counting Gene Clusters from Antismash Output

5

Entering edit mode

5.9 years ago

Barry Digby ★ 1.3k

First time posting, I'll try keep it concise:

After running Antismash, a 'final.gbk' file is generated. Using the following python script I can generate a file listing each Biosynthetic Gene Cluster (BGC) type:

#!/usr/bin/env python
from Bio import SeqIO

# Set input file, output file to write to
gbk_file = "contig_1.final.gbk"
tsv_file = "test2_output.tsv"      # can be .txt, .csv etc.. 
cluster_out = open(tsv_file, "w")

# Extract Cluster info, write to file
for seq_record in SeqIO.parse(gbk_file, "genbank"):
for seq_feat in seq_record.features:
if seq_feat.type == "cluster":
cluster_number = seq_feat.qualifiers["note"][0].replace(" ","_").replace(":","")
cluster_type = seq_feat.qualifiers["product"][0]

cluster_out.write("#"+cluster_number+"\tCluster Type:"+cluster_type+"\n")

The script gives the output1:

#Cluster_number_1   Cluster Type:   other
#Cluster_number_2   Cluster Type:   terpene
#Cluster_number_3   Cluster Type:   cf_putative
#Cluster_number_4   Cluster Type:   cf_putative
#Cluster_number_5   Cluster Type:   t1pks-nrps

and so on.

Using Python, I would like to read in the above file, and count the number of occurrences for "other", "terpene" "t1pks-nrps" etc. Ideally the output file would be 2 columns like so:

BGC Class     Count
other             1
terpene           1
cf_putative       2
t1pks-nrps        1

This file will then be fed into R to generate a basic histogram plot showing BGC quantities.

My initial thought is read in output1 as a panda dataframe, and to check if each row contains an element of a list, which contains the BGC Class names.

Any help for this python beginner would be greatly appreciated.

Update for Antismash v5:

(I have not used antismash since v3, there is more BGC info stored under protocluster, proto_core and cond_cluster [i.e seq_feat.type] that you may want to extract?)

The below code is based on the sample output file given by the website for Aspergillus fumigatus Af293

from Bio import SeqIO

genbank = "/home/barry/Downloads/AAHF00000000.1.gbk"
tsv_out = "/home/barry/test.tsv"
out_file = open(tsv_out, "w")

for seq_record in SeqIO.parse(genbank, "genbank"):
    for seq_feat in seq_record.features:
        if seq_feat.type == "source":
           chromosome = seq_feat.qualifiers["chromosome"][0]
           chromosome = str("Chromosome number: " + chromosome)
        if seq_feat.type == "protocluster":
           number = seq_feat.qualifiers["protocluster_number"][0]
           cluster = seq_feat.qualifiers["product"][0]
           number = str("Cluster number: " + number)
           cluster = str("BGC type: " + cluster)
           out_file.write(chromosome + "\t" + number + "\t" + cluster +"\n")

Gives:

Chromosome number: 1    Cluster number: 1   BGC type: T1PKS
Chromosome number: 1    Cluster number: 2   BGC type: NRPS
Chromosome number: 1    Cluster number: 3   BGC type: terpene
Chromosome number: 1    Cluster number: 4   BGC type: betalactone
Chromosome number: 1    Cluster number: 5   BGC type: NRPS
Chromosome number: 1    Cluster number: 6   BGC type: T1PKS
Chromosome number: 2    Cluster number: 1   BGC type: T1PKS
Chromosome number: 2    Cluster number: 2   BGC type: T1PKS
Chromosome number: 2    Cluster number: 3   BGC type: indole
Chromosome number: 3    Cluster number: 1   BGC type: T1PKS
Chromosome number: 3    Cluster number: 2   BGC type: T1PKS
Chromosome number: 3    Cluster number: 3   BGC type: NRPS-like

Cheers to Valentin Berrios Farías for reaching out on LinkedIn and pointing this out

Data SeqIO Python Mining Antismash • 3.8k views

ADD COMMENT • link 3.0 years ago by Barry Digby ★ 1.3k

0

Entering edit mode

Did this output provide the dataframe that you used in order to count the occurrence of the lists elements for each row in the dataframe, this being the BGC class? Thanks and cheers

ADD REPLY • link 4.6 years ago by biohacker_tobe ▴ 80

0

Entering edit mode

Check the stackoverflow link below. If you are using antismash, and need help parsing the data, I would reccomend posting a new question and figure out how to tag Joe (https://www.biostars.org/u/20598/). He helped me out in the past using BioPython.

ADD REPLY • link 4.6 years ago by Barry Digby ★ 1.3k

0

Entering edit mode

thanks :D I will check it out

ADD REPLY • link 4.6 years ago by biohacker_tobe ▴ 80

0

Entering edit mode

Hello,

I want to count Biosynthetic Gene Cluster (BGC) using the above python command but I get the empty output file (no error during the run). Can you please help me how can I count the BGC type in my .gbk file?

Thanks,

ADD REPLY • link 3.6 years ago by arshad1292 ▴ 100