Hi all,
I'm running EggNOG-mapper on a metagenomics dataset for functional annotation using the DIAMOND mapping mode on a high-performance virtual machine in Google Cloud Platform (GCP). The dataset is approximately 200 GB in size, consisting of .faa protein sequence files generated by FragGeneScan. I’m using the latest EggNOG-mapper version with the DIAMOND and annotation databases downloaded locally to the instance.
The GCP virtual machine I'm using is an n2-standard-96 with 96 vCPUs and 384 GB RAM, running Ubuntu 22.04 (minimal image). My data disk is a 2 TB SSD persistent disk, with SCSI interface and x86/64 architecture. There's no swap configured, though the available memory is more than sufficient. The EggNOG-mapper database (eggnog_db) and temporary files are stored on the boot disk (/tmp).
I run my annotation script in a loop, processing each .faa file one at a time, using all 64 cores. Here's the command I'm using:
mkdir -p ~/workdir/eggnog/eggnog_db && cd ~/workdir/eggnog/eggnog_db
#Download the EggNOG database (eggnog.db.gz) and unzip it
wget -c http://eggnog5.embl.de/download/emapperdb-5.0.2/eggnog.db.gz
wget -c http://eggnog5.embl.de/download/emapperdb-5.0.2/eggnog_proteins.dmnd.gz
wget -c http://eggnog5.embl.de/download/emapperdb-5.0.2/eggnog.taxa.tar.gz
gunzip eggnog.db.gz
gunzip eggnog_proteins.dmnd.gz
tar -xvzf eggnog.taxa.tar.gz
#emapper functional annotation
INPUT_DIR="$HOME/workdir/fraggenescan_out" # Current directory or change as needed
OUTPUT_DIR="$HOME/workdir/eggnog" # Output folder for all results
DB_DIR="$HOME/workdir/eggnog/eggnog_db" # Path to your EggNOG database
CPU=64
for faa_file in "$INPUT_DIR"/*_FGS.faa; do
base_name=$(basename "$faa_file" _FGS.faa)
output_prefix="${base_name}_annotation"
emapper.py \
-i "$faa_file" \
--itype proteins \
-o "$output_prefix" \
--output_dir "$OUTPUT_DIR" \
--cpu "$CPU" \
--data_dir "$DB_DIR"
done
The DIAMOND step performs very well — CPU usage is fully utilized with over 60% in user space. However, once the workflow transitions into the Python-based annotation step, performance drops sharply. CPU usage becomes dominated by the kernel, with about 97% system (sy) usage and only 2–3% user (us) usage, as seen in top. RAM usage remains very low (under 5 GB), and no swap is used, despite the system having ample memory.
To try and improve speed, I also tested a parallel processing approach using GNU parallel (10 jobs* 8 cores) to annotate multiple .faa files at once, assigning fewer cores to each. However, this did not improve performance and actually made things slower overall — most likely due to I/O contention or internal bottlenecks within EggNOG-mapper’s post-DIAMOND steps.
I’m looking for guidance on how to reduce the system CPU load and speed up the annotation step. Is this behavior typical for large datasets with EggNOG-mapper? Are there additional flags or optimizations that could help with the annotation performance?
Thanks in advance for any suggestions or insights