EggNOG-mapper problem
0
2
Entering edit mode
3 months ago
Salma ▴ 20

Hi all,

I'm running EggNOG-mapper on a metagenomics dataset for functional annotation using the DIAMOND mapping mode on a high-performance virtual machine in Google Cloud Platform (GCP). The dataset is approximately 200 GB in size, consisting of .faa protein sequence files generated by FragGeneScan. I’m using the latest EggNOG-mapper version with the DIAMOND and annotation databases downloaded locally to the instance.

The GCP virtual machine I'm using is an n2-standard-96 with 96 vCPUs and 384 GB RAM, running Ubuntu 22.04 (minimal image). My data disk is a 2 TB SSD persistent disk, with SCSI interface and x86/64 architecture. There's no swap configured, though the available memory is more than sufficient. The EggNOG-mapper database (eggnog_db) and temporary files are stored on the boot disk (/tmp).

I run my annotation script in a loop, processing each .faa file one at a time, using all 64 cores. Here's the command I'm using:

mkdir -p ~/workdir/eggnog/eggnog_db && cd ~/workdir/eggnog/eggnog_db 

#Download the EggNOG database (eggnog.db.gz) and unzip it 
wget -c http://eggnog5.embl.de/download/emapperdb-5.0.2/eggnog.db.gz 
wget -c http://eggnog5.embl.de/download/emapperdb-5.0.2/eggnog_proteins.dmnd.gz 
wget -c http://eggnog5.embl.de/download/emapperdb-5.0.2/eggnog.taxa.tar.gz 

gunzip eggnog.db.gz 
gunzip eggnog_proteins.dmnd.gz 
tar -xvzf eggnog.taxa.tar.gz 

#emapper functional annotation 

INPUT_DIR="$HOME/workdir/fraggenescan_out"    # Current directory or change as needed 
OUTPUT_DIR="$HOME/workdir/eggnog"    # Output folder for all results 
DB_DIR="$HOME/workdir/eggnog/eggnog_db"  # Path to your EggNOG database 
CPU=64 


for faa_file in "$INPUT_DIR"/*_FGS.faa; do 
    base_name=$(basename "$faa_file" _FGS.faa) 
    output_prefix="${base_name}_annotation" 

    emapper.py \ 
      -i "$faa_file" \ 
      --itype proteins \ 
      -o "$output_prefix" \ 
      --output_dir "$OUTPUT_DIR" \ 
      --cpu "$CPU" \ 
      --data_dir "$DB_DIR" 

done 

The DIAMOND step performs very well — CPU usage is fully utilized with over 60% in user space. However, once the workflow transitions into the Python-based annotation step, performance drops sharply. CPU usage becomes dominated by the kernel, with about 97% system (sy) usage and only 2–3% user (us) usage, as seen in top. RAM usage remains very low (under 5 GB), and no swap is used, despite the system having ample memory.

To try and improve speed, I also tested a parallel processing approach using GNU parallel (10 jobs* 8 cores) to annotate multiple .faa files at once, assigning fewer cores to each. However, this did not improve performance and actually made things slower overall — most likely due to I/O contention or internal bottlenecks within EggNOG-mapper’s post-DIAMOND steps.

I’m looking for guidance on how to reduce the system CPU load and speed up the annotation step. Is this behavior typical for large datasets with EggNOG-mapper? Are there additional flags or optimizations that could help with the annotation performance?

Thanks in advance for any suggestions or insights

all Hi • 784 views
ADD COMMENT

Login before adding your answer.

Traffic: 5326 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6