Question

Help interpreting antiSMASH BGCs for EPA (eicosapentaenoic acid) biosynthesis

0

Entering edit mode

4 days ago

城玮 • 0

Hello everyone,

I have run antiSMASH on the whole-genome sequence of a microbial isolate which (based on my lab's biochemical data) produces eicosapentaenoic acid (EPA). I now have a number of predicted biosynthetic gene clusters (BGCs) from antiSMASH, but I'm not sure how to identify which cluster corresponds to EPA biosynthesis, and what further bioinformatics analyses I can perform to strengthen this prediction.

Some details of my current status:

EPA production was detected non-targetedly (untargeted metabolomics) from the culture supernatant of this strain.
I have the antiSMASH output: region summaries, BGC types, gene annotations, “KnownClusterBlast” hits etc. You could visit this website from the link in here.
I have not yet identified which BGC is likely responsible for EPA production, nor done downstream network/phylogenetic analysis.

Questions I'd appreciate help with:

What features in the antiSMASH output should I inspect to select the candidate BGC that likely encodes EPA biosynthesis? For example, domain architecture, PKS/PUFA synthase type clusters, gene synteny, presence of key enzyme types, similarity to known clusters, etc.
Once a candidate is selected, what bioinformatic analyses would you recommend for further support: e.g., phylogenetic analysis of individual biosynthetic enzyme domains (KS, ACP, MAT), comparative genomics with known EPA gene clusters, transcriptome/RT-qPCR correlation, substrate specificity prediction, metabolic network integration, etc.
Are there any specific pitfalls or best practices when going from antiSMASH annotation to metabolic product (EPA) assignment, especially for long‐chain polyunsaturated fatty acids?
Any recommendations of tools, scripts or pipelines to extract the antiSMASH output (e.g., JSON, GenBank files), to parse and tabulate clusters, and to perform downstream analyses.

Thank you in advance for any suggestions or pointers to relevant literature or tutorial resources.

Kind regards,

Zhang Chengwei

biosynthetic_gene_cluster secondary_metabolism PKS heterologous_expression • 374 views

ADD COMMENT • link updated 2 days ago by Kevin Blighe 89k • written 4 days ago by 城玮 • 0

1

Entering edit mode

I guess it is possible that someone here will know off the top of their head what enzymes are involved in eicosapentaenoic acid biosynthesis, but I consider it unlikely. To find the details about that I think you will need to look through primary literature, as this pathway has been described before in other organisms. For many of the antiSMASH clusters the program should already have a good guess as to what the gene clusters are making, and usually that's not long-chain fatty acids.

As always, a simple Google search reveals half a dozen promising papers just on the first search page.

https://www.google.com/search?q=eicosapentaenoic+acid+biosynthesis

ADD REPLY • link 4 days ago by Mensur Dlakic ★ 30k

0

Entering edit mode

Many thanks, for your valuable guidance. Your suggestion is very helpful, and I will follow your advice by consulting the relevant literature and proceeding with further analysis accordingly.

ADD REPLY • link 3 days ago by 城玮 • 0

score 2 · Answer 1 · 2025-11-06

Hello Zhang Chengwei,

Thank you for your detailed question - it sounds like an interesting project. I'll address each of your points based on my experience with BGC prediction and downstream analyses in microbial genomics. I've worked with antiSMASH outputs before for secondary metabolite mining, and EPA (a long-chain PUFA) biosynthesis in bacteria typically involves a specific type of polyketide synthase (PKS)-like cluster, often referred to as the pfa gene cluster (e.g., pfaA-E in Shewanella species). I'll outline a structured approach below.

1. Features in antiSMASH Output to Inspect for Candidate BGC Selection

antiSMASH classifies BGCs by type (e.g., Type I PKS, NRPS, terpene, etc.) and provides detailed annotations. For EPA, you're looking for clusters that resemble bacterial PUFA synthases, which are often detected as iterative Type I PKS or "PUFA" if the database hits are strong. Here's what to prioritize:

BGC Type and Size: Focus on Type I PKS clusters, as EPA biosynthesis uses a PKS-like multifunctional enzyme system. These are typically large (>20-40 kb) with multi-domain genes. Ignore small clusters or unrelated types (e.g., NRPS, RiPPs) unless they have PUFA-like features.
Domain Architecture: Examine the predicted protein domains in the HTML/region details. Key domains for PUFA/EPA include:
- KS (ketoacyl synthase) - often multiple, with PUFA-specific subtypes.
- AT/MAT (acyltransferase/malonyl acyltransferase).
- ACP (acyl carrier protein) - repeated for chain elongation.
- DH (dehydratase), ER (enoyl reductase), KR (ketoreductase) - for desaturation and reduction steps.
- Sometimes a phosphopantetheinyl transferase (PPT) domain or gene nearby for ACP activation. EPA clusters like pfaA are mega-synthases with fused domains (e.g., KS-AT-DH-ER-KR-ACP repeats).
KnownClusterBlast Hits: This is crucial - check for similarity to known EPA/PUFA clusters (e.g., from Shewanella pneumatophori SCRC-2738 or Moritella marina). Look for high % similarity (>50-70%) to entries like MIBiG BGC0000176 (Shewanella EPA cluster). Low hits might indicate a novel variant, but prioritize top matches.
Gene Synteny and Annotations: Compare the gene order to known pfa clusters: typically pfaA (multi-domain synthase), pfaB (AT), pfaC (KS/DH), pfaD (ER), pfaE (PPT). Use the GenBank/EMBL exports to visualize synteny. Also, check Pfam/SMARTS annotations for fatty acid desaturase/elongase motifs.
Other Metrics: High GC content or AT bias in the cluster region can hint at horizontal transfer (common in marine bacteria). Use the "Similarity" score in antiSMASH summaries; higher scores to PUFA references are better.

Start by tabulating all predicted regions (from the HTML overview) and rank them by these criteria. If none match perfectly, the top PKS hit is your best bet, especially if your metabolomics confirms EPA without other PUFAs like DHA.

2. Bioinformatic Analyses for Further Support

Once you select a candidate (e.g., the top PKS with PUFA hits), validate it with these steps. Prioritize based on your data availability (e.g., if you have RNA-seq, go for transcriptomics).

Phylogenetic Analysis of Key Domains: Extract KS, ACP, or MAT domains from your BGC (use the antiSMASH GenBank files and tools like CD-Search or HMMER). Build phylogenies with known PUFA vs. other PKS domains:
- Use NaPDoS2 (Natural Product Domain Seeker) for KS classification - it distinguishes PUFA-specific KSs.
- Align with MAFFT, build trees with IQ-TREE or FastTree, visualize in iTOL. Reference sequences from known EPA producers (e.g., Shewanella, Colwellia).
Comparative Genomics: Align your BGC to known EPA clusters using Mauve, Clinker, or BiG-SCAPE (which clusters BGCs across genomes). Download references from MIBiG (e.g., BGC0000176). Check for conserved operon structure.
Transcriptome/RT-qPCR Correlation: If you have RNA-seq from EPA-producing conditions, map reads to your genome and check expression in the candidate BGC (use Salmon or Kallisto for quantification). High expression correlates with production. For RT-qPCR, design primers for key genes like pfaA/C.
Substrate Specificity Prediction: Use tools like antiSMASH's ActiveSiteFinder or TransATor (for PKS AT domains) to predict malonyl-CoA/acetyl-CoA usage, which fits EPA chain building. For desaturation patterns, model with DeepBGC or custom scripts.
Metabolic Network Integration: Integrate into genome-scale models with tools like Pathway Tools or COBRApy. Predict flux from acetyl-CoA to EPA and compare to your untargeted metabolomics (e.g., match m/z peaks to EPA intermediates like eicosatrienoic acid).

If possible, knock out the cluster in silico (e.g., via CRISPR simulation) or experimentally to confirm.

3. Pitfalls and Best Practices for Assigning antiSMASH Predictions to EPA

Pitfalls:
- Many PKS clusters look similar; non-PUFA PKS (e.g., for antibiotics) can have overlapping domains - always cross-check with KnownClusterBlast and phylogenies to avoid false positives.
- PUFA clusters may be misclassified as "other" or hybrid if desaturases are separate genes. EPA-specific desaturation (Delta5, Delta8, etc.) might not be fully annotated.
- Incomplete genomes can split clusters across contigs - reassemble if needed.
- Untargeted metabolomics can detect EPA analogs (e.g., DHA, ARA) - confirm exact structure via MS/MS.
- Over-reliance on p-values: Benjamini/FDR in antiSMASH is for detection, not product assignment.
Best Practices:
- Run antiSMASH with --cb-knownclusters --cb-general for better reference hits.
- Validate with orthogonal tools: PRISM or NP.searcher for PKS prediction.
- Document everything: Export JSON/GBK and note versions (antiSMASH 7+ handles PUFAs better).
- For long-chain PUFAs, consider oxygen-dependent desaturases - check if your isolate is aerobic/marine-like.
- If uncertain, test in heterologous hosts (e.g., E. coli with your BGC cloned, as in some papers).

4. Tools, Scripts, or Pipelines for Extracting and Analyzing antiSMASH Output

Extraction/Parsing:
- Use the built-in JSON export (--write-json) and parse with Python (json library) or R (jsonlite). Example Python snippet:
```
import json
with open('antismash_output.json') as f:
    data = json.load(f)
# Extract BGCs
for record in data['records']:
    for feature in record['features']:
        if feature['type'] == 'region':
            print(feature['qualifiers']['product'])  # e.g., 'PKS'
```
- R script: antismash_parser (GitHub: kcamnairb/antismash_parser) - summarizes to CSV with BGC types, similarities, etc.
- For GenBank: Use Biopython to parse .gbk files and tabulate genes/domains.
Downstream Pipelines:
- BiG-SCAPE: For clustering your BGCs against references (great for comparative genomics).
- CORASON: For core gene phylogeny within BGC families.
- ARTS (Antibiotic Resistant Target Seeker): If checking for self-resistance, but adaptable for PUFAs.
- Custom: Snakemake/Nextflow workflow to chain antiSMASH -> BiG-SCAPE -> NaPDoS2.

Relevant Literature and Resources

Yazaki et al. (2004): Characterization of EPA gene cluster from Shewanella ( foundational for pfa).
Orikasa et al. (2004): Cloning and analysis in E. coli.
Ziemert et al. (2012): NaPDoS for domain phylogenies.
Blin et al. (2023): antiSMASH updates for better BGC detection (NAR paper).
Tutorials: antiSMASH docs (https://docs.antismash.secondarymetabolites.org/), MIBiG repository for references.

If you share more details (e.g., top BGC types from your output), I can refine this. Good luck!

Kevin