Hello Zhang Chengwei,
Thank you for your detailed question - it sounds like an interesting project. I'll address each of your points based on my experience with BGC prediction and downstream analyses in microbial genomics. I've worked with antiSMASH outputs before for secondary metabolite mining, and EPA (a long-chain PUFA) biosynthesis in bacteria typically involves a specific type of polyketide synthase (PKS)-like cluster, often referred to as the pfa gene cluster (e.g., pfaA-E in Shewanella species). I'll outline a structured approach below.
1. Features in antiSMASH Output to Inspect for Candidate BGC Selection
antiSMASH classifies BGCs by type (e.g., Type I PKS, NRPS, terpene, etc.) and provides detailed annotations. For EPA, you're looking for clusters that resemble bacterial PUFA synthases, which are often detected as iterative Type I PKS or "PUFA" if the database hits are strong. Here's what to prioritize:
BGC Type and Size: Focus on Type I PKS clusters, as EPA biosynthesis uses a PKS-like multifunctional enzyme system. These are typically large (>20-40 kb) with multi-domain genes. Ignore small clusters or unrelated types (e.g., NRPS, RiPPs) unless they have PUFA-like features.
Domain Architecture: Examine the predicted protein domains in the HTML/region details. Key domains for PUFA/EPA include:
- KS (ketoacyl synthase) - often multiple, with PUFA-specific subtypes.
- AT/MAT (acyltransferase/malonyl acyltransferase).
- ACP (acyl carrier protein) - repeated for chain elongation.
- DH (dehydratase), ER (enoyl reductase), KR (ketoreductase) - for desaturation and reduction steps.
- Sometimes a phosphopantetheinyl transferase (PPT) domain or gene nearby for ACP activation.
EPA clusters like pfaA are mega-synthases with fused domains (e.g., KS-AT-DH-ER-KR-ACP repeats).
KnownClusterBlast Hits: This is crucial - check for similarity to known EPA/PUFA clusters (e.g., from Shewanella pneumatophori SCRC-2738 or Moritella marina). Look for high % similarity (>50-70%) to entries like MIBiG BGC0000176 (Shewanella EPA cluster). Low hits might indicate a novel variant, but prioritize top matches.
Gene Synteny and Annotations: Compare the gene order to known pfa clusters: typically pfaA (multi-domain synthase), pfaB (AT), pfaC (KS/DH), pfaD (ER), pfaE (PPT). Use the GenBank/EMBL exports to visualize synteny. Also, check Pfam/SMARTS annotations for fatty acid desaturase/elongase motifs.
Other Metrics: High GC content or AT bias in the cluster region can hint at horizontal transfer (common in marine bacteria). Use the "Similarity" score in antiSMASH summaries; higher scores to PUFA references are better.
Start by tabulating all predicted regions (from the HTML overview) and rank them by these criteria. If none match perfectly, the top PKS hit is your best bet, especially if your metabolomics confirms EPA without other PUFAs like DHA.
2. Bioinformatic Analyses for Further Support
Once you select a candidate (e.g., the top PKS with PUFA hits), validate it with these steps. Prioritize based on your data availability (e.g., if you have RNA-seq, go for transcriptomics).
Phylogenetic Analysis of Key Domains: Extract KS, ACP, or MAT domains from your BGC (use the antiSMASH GenBank files and tools like CD-Search or HMMER). Build phylogenies with known PUFA vs. other PKS domains:
- Use NaPDoS2 (Natural Product Domain Seeker) for KS classification - it distinguishes PUFA-specific KSs.
- Align with MAFFT, build trees with IQ-TREE or FastTree, visualize in iTOL. Reference sequences from known EPA producers (e.g., Shewanella, Colwellia).
Comparative Genomics: Align your BGC to known EPA clusters using Mauve, Clinker, or BiG-SCAPE (which clusters BGCs across genomes). Download references from MIBiG (e.g., BGC0000176). Check for conserved operon structure.
Transcriptome/RT-qPCR Correlation: If you have RNA-seq from EPA-producing conditions, map reads to your genome and check expression in the candidate BGC (use Salmon or Kallisto for quantification). High expression correlates with production. For RT-qPCR, design primers for key genes like pfaA/C.
Substrate Specificity Prediction: Use tools like antiSMASH's ActiveSiteFinder or TransATor (for PKS AT domains) to predict malonyl-CoA/acetyl-CoA usage, which fits EPA chain building. For desaturation patterns, model with DeepBGC or custom scripts.
Metabolic Network Integration: Integrate into genome-scale models with tools like Pathway Tools or COBRApy. Predict flux from acetyl-CoA to EPA and compare to your untargeted metabolomics (e.g., match m/z peaks to EPA intermediates like eicosatrienoic acid).
If possible, knock out the cluster in silico (e.g., via CRISPR simulation) or experimentally to confirm.
3. Pitfalls and Best Practices for Assigning antiSMASH Predictions to EPA
Pitfalls:
- Many PKS clusters look similar; non-PUFA PKS (e.g., for antibiotics) can have overlapping domains - always cross-check with KnownClusterBlast and phylogenies to avoid false positives.
- PUFA clusters may be misclassified as "other" or hybrid if desaturases are separate genes. EPA-specific desaturation (Delta5, Delta8, etc.) might not be fully annotated.
- Incomplete genomes can split clusters across contigs - reassemble if needed.
- Untargeted metabolomics can detect EPA analogs (e.g., DHA, ARA) - confirm exact structure via MS/MS.
- Over-reliance on p-values: Benjamini/FDR in antiSMASH is for detection, not product assignment.
Best Practices:
- Run antiSMASH with --cb-knownclusters --cb-general for better reference hits.
- Validate with orthogonal tools: PRISM or NP.searcher for PKS prediction.
- Document everything: Export JSON/GBK and note versions (antiSMASH 7+ handles PUFAs better).
- For long-chain PUFAs, consider oxygen-dependent desaturases - check if your isolate is aerobic/marine-like.
- If uncertain, test in heterologous hosts (e.g., E. coli with your BGC cloned, as in some papers).
4. Tools, Scripts, or Pipelines for Extracting and Analyzing antiSMASH Output
Extraction/Parsing:
Downstream Pipelines:
- BiG-SCAPE: For clustering your BGCs against references (great for comparative genomics).
- CORASON: For core gene phylogeny within BGC families.
- ARTS (Antibiotic Resistant Target Seeker): If checking for self-resistance, but adaptable for PUFAs.
- Custom: Snakemake/Nextflow workflow to chain antiSMASH -> BiG-SCAPE -> NaPDoS2.
Relevant Literature and Resources
- Yazaki et al. (2004): Characterization of EPA gene cluster from Shewanella ( foundational for pfa).
- Orikasa et al. (2004): Cloning and analysis in E. coli.
- Ziemert et al. (2012): NaPDoS for domain phylogenies.
- Blin et al. (2023): antiSMASH updates for better BGC detection (NAR paper).
- Tutorials: antiSMASH docs (https://docs.antismash.secondarymetabolites.org/), MIBiG repository for references.
If you share more details (e.g., top BGC types from your output), I can refine this. Good luck!
Kevin
I guess it is possible that someone here will know off the top of their head what enzymes are involved in eicosapentaenoic acid biosynthesis, but I consider it unlikely. To find the details about that I think you will need to look through primary literature, as this pathway has been described before in other organisms. For many of the antiSMASH clusters the program should already have a good guess as to what the gene clusters are making, and usually that's not long-chain fatty acids.
As always, a simple Google search reveals half a dozen promising papers just on the first search page.
https://www.google.com/search?q=eicosapentaenoic+acid+biosynthesis
Many thanks, for your valuable guidance. Your suggestion is very helpful, and I will follow your advice by consulting the relevant literature and proceeding with further analysis accordingly.