Question

antismash 5.0 (New region concept) - counting BGCs

0

Entering edit mode

4.8 years ago

arshad1292 ▴ 110

Hi,

I am new to antismash analysis and using the updated/latest version 5.0 therefore can not find answer of my questions in the older threads.

Here is the detail:

I ran antismash and obtained .gbk files as well as a new folder called "region1". (I belie this is the new thing in the latet version). This folder contains several .html files that look like this "ctg3_14_mibig_hits.html" and so on....

When I open this .html file, it contains the following eight columns:

MIBiG Protein
Description
MIBiG Cluster
MiBiG Product
% ID
% Coverage
BLAST Score
E-value

The fourth column (MiBiG Product) contains name of the product e.g. NRP, polyketde, tarpene, other etc. and I am interested in counting the number of BGCs types in each sample. (may be from this column?)

Q1. I am confused which file should I use to count the BGC types? This .html file (I have several) under the "region1" folder or .gbk file?

Q2. In either case, I need a method/script to do so. I will really appreciate if someone can please share the code/script for counting the BGCs in each sample since I have several such files and then tens/hundreds of MIBiG product in each file.

Please help this newbie.

Many thanks,

antismash metagenomics • 2.2k views

ADD COMMENT • link updated 4.3 years ago by timothy.kirkwood ▴ 140 • written 4.8 years ago by arshad1292 ▴ 110

0

Entering edit mode

It looks be the that 4th column. See the answer in this SO thread.

You may also be able to simply cut/sort/uniq/count that column.

Anti-smash HTML output is thoroughly described in their help page.

ADD REPLY • link 4.8 years ago by GenoMax 152k

0

Entering edit mode

thank you for your response. I have read antismash output but I am still confused about the output files. So I am still struggling to understand the output. Sorry for my lack of knowledge.

For example, I obtained 116 html files from a single run. Well, I have 116 html files then each html file contains tens of MiBiG Product (4th column) please see image . On average if I have 10 MiBiG product for each html file, its going to 1160 files in total for each run. Should I count "MiBiG Product" (4th column) from all these 1160 files and then add them up to obtain total (NRP, polyketide etc.)?

ADD REPLY • link 4.8 years ago by arshad1292 ▴ 110

score 0 · Answer 1 · 2021-03-21

Probably too late but:

MIBiG is a database of known biosynthetic clusters and is used to output the 'known cluster blast' tab data in the antiSMASH web portal. It's not the predicted clusters for your input genome - it's just similar hits that have been published/confirmed to some degree (compared to the antiSMASH DB which is just predictions for all of ncbi data without being confirmed). I didn't dig into the html files for my stuff that much, but I think the following is correct. You have a single region, with one of more BGCs inside of it. This region has genes encoding proteins. Each of the html files corresponds to a single gene/protein in the region, and the entries in a single html file are MIBiG hits to that single protein. "ctg3_14_mibig_hits.html" would be the html file of MIBiG hits for the region 1 protein annotated '14' in whatever genome you fed into smash. For me, my html files are inside a folder called 'knownclusterblast' - I have around 30 regions so maybe yours aren't in this folder if you only have one region/are using a different SMASH version etc.

For one region you can just click the html 'index' file and that will open a web page with a clear summary. If you have lots of regions I would set something up to parse the JSON file that is also in the output as this has all the information in the antiSMASH run.