Question

EggNOG-mapper problem

2

Entering edit mode

3 months ago

Salma ▴ 20

Hi all,

I'm running EggNOG-mapper on a metagenomics dataset for functional annotation using the DIAMOND mapping mode on a high-performance virtual machine in Google Cloud Platform (GCP). The dataset is approximately 200 GB in size, consisting of .faa protein sequence files generated by FragGeneScan. I’m using the latest EggNOG-mapper version with the DIAMOND and annotation databases downloaded locally to the instance.

The GCP virtual machine I'm using is an n2-standard-96 with 96 vCPUs and 384 GB RAM, running Ubuntu 22.04 (minimal image). My data disk is a 2 TB SSD persistent disk, with SCSI interface and x86/64 architecture. There's no swap configured, though the available memory is more than sufficient. The EggNOG-mapper database (eggnog_db) and temporary files are stored on the boot disk (/tmp).

I run my annotation script in a loop, processing each .faa file one at a time, using all 64 cores. Here's the command I'm using:

mkdir -p ~/workdir/eggnog/eggnog_db && cd ~/workdir/eggnog/eggnog_db 

#Download the EggNOG database (eggnog.db.gz) and unzip it 
wget -c http://eggnog5.embl.de/download/emapperdb-5.0.2/eggnog.db.gz 
wget -c http://eggnog5.embl.de/download/emapperdb-5.0.2/eggnog_proteins.dmnd.gz 
wget -c http://eggnog5.embl.de/download/emapperdb-5.0.2/eggnog.taxa.tar.gz 

gunzip eggnog.db.gz 
gunzip eggnog_proteins.dmnd.gz 
tar -xvzf eggnog.taxa.tar.gz 

#emapper functional annotation 

INPUT_DIR="$HOME/workdir/fraggenescan_out"    # Current directory or change as needed 
OUTPUT_DIR="$HOME/workdir/eggnog"    # Output folder for all results 
DB_DIR="$HOME/workdir/eggnog/eggnog_db"  # Path to your EggNOG database 
CPU=64 


for faa_file in "$INPUT_DIR"/*_FGS.faa; do 
    base_name=$(basename "$faa_file" _FGS.faa) 
    output_prefix="${base_name}_annotation" 

    emapper.py \ 
      -i "$faa_file" \ 
      --itype proteins \ 
      -o "$output_prefix" \ 
      --output_dir "$OUTPUT_DIR" \ 
      --cpu "$CPU" \ 
      --data_dir "$DB_DIR" 

done

The DIAMOND step performs very well — CPU usage is fully utilized with over 60% in user space. However, once the workflow transitions into the Python-based annotation step, performance drops sharply. CPU usage becomes dominated by the kernel, with about 97% system (sy) usage and only 2–3% user (us) usage, as seen in top. RAM usage remains very low (under 5 GB), and no swap is used, despite the system having ample memory.

To try and improve speed, I also tested a parallel processing approach using GNU parallel (10 jobs* 8 cores) to annotate multiple .faa files at once, assigning fewer cores to each. However, this did not improve performance and actually made things slower overall — most likely due to I/O contention or internal bottlenecks within EggNOG-mapper’s post-DIAMOND steps.

I’m looking for guidance on how to reduce the system CPU load and speed up the annotation step. Is this behavior typical for large datasets with EggNOG-mapper? Are there additional flags or optimizations that could help with the annotation performance?

Thanks in advance for any suggestions or insights

all Hi • 784 views

ADD COMMENT • link updated 3 months ago by GenoMax 153k • written 3 months ago by Salma ▴ 20