Better way for speedup of Humann3 metagenomics analysis pipeline?
Entering edit mode
3.9 years ago
boaty ▴ 220

Hi guys,

This is an open question.

While dealing with microbiome metagenomics data by humann3, I found that it is quite time-consuming. With 40-core cups and 240 GB MEM, single sample analysis took 4 hours for nucleotide-searching only and 9 hours for translated-searching included and this is inacceptable for a institute server.

After some tests, I noticed that disk transmission speed is probably the choking point.

Part of the reason is that the Humann3 pipeline is a collection of separated tools. Humann3 pipeline takes input file, analyzes it, outputs/writes it on disk and then the second tool takes this output as input and then goes on. The consequence is that process waited long time for HDD reading and writing which slows down the whole analysis speed.

The first thing I was thinking is a m.2 SSD, but it is even slower because of administration paper work. So is there any better way by software or disk management for the pipeline speedup?

Thank in advance

alignment metagenomics humann pipeline • 2.8k views
Entering edit mode
3.3 years ago
Raygozak ★ 1.4k

HI, in my experience the most time-consuming part is when it uses diamond on the reads that did not match against anything in the prescreen. Diamond has a parameter that sets the block size for processing reads (-b), which is a multiple of millions of reads to be processed, the default is 2.0, of course, uses more or less memory depending if the value is small or large. Humann per see does not have a way to pass this as an argument so you might need to go into the code and change it manually or add the functionality to accept this parameter. The larger the number the more memory it uses, and you get shorter runtimes, and the smaller the less memory it uses but takes longer.

This is useful if you don't have enough memory but have time, then you can set it to a low value, and vice versa.

Hope this helps.

Entering edit mode

thanks Raygozak,

I started to run humann3 with gnu parallel, and it is much faster

ls merged_filtered_fastq/ | parallel --eta -j 10 --load 90% --noswap 'humann3 --input merged_filtered_fastq/{} --metaphlan-options "-t rel_ab_w_read_stats" --search-mode uniref90 --output results --memory-use maximum --threads 150'

Using parallel, the program is able run the second(or third ... depends on -j parameter, your mem size and your number of cores) sample while huamnn3 is writing result of first sample on disk.


Login before adding your answer.

Traffic: 2068 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6