Better way for speedup of Humann3 metagenomics analysis pipeline?
1
0
Entering edit mode
2.2 years ago
boaty ▴ 200

Hi guys,

This is an open question.

While dealing with microbiome metagenomics data by humann3, I found that it is quite time-consuming. With 40-core cups and 240 GB MEM, single sample analysis took 4 hours for nucleotide-searching only and 9 hours for translated-searching included and this is inacceptable for a institute server.

After some tests, I noticed that disk transmission speed is probably the choking point.

Part of the reason is that the Humann3 pipeline is a collection of separated tools. Humann3 pipeline takes input file, analyzes it, outputs/writes it on disk and then the second tool takes this output as input and then goes on. The consequence is that process waited long time for HDD reading and writing which slows down the whole analysis speed.

The first thing I was thinking is a m.2 SSD, but it is even slower because of administration paper work. So is there any better way by software or disk management for the pipeline speedup?

alignment metagenomics humann pipeline • 1.6k views
1
Entering edit mode
20 months ago
Raygozak ★ 1.4k

HI, in my experience the most time-consuming part is when it uses diamond on the reads that did not match against anything in the prescreen. Diamond has a parameter that sets the block size for processing reads (-b), which is a multiple of millions of reads to be processed, the default is 2.0, of course, uses more or less memory depending if the value is small or large. Humann per see does not have a way to pass this as an argument so you might need to go into the code and change it manually or add the functionality to accept this parameter. The larger the number the more memory it uses, and you get shorter runtimes, and the smaller the less memory it uses but takes longer.

This is useful if you don't have enough memory but have time, then you can set it to a low value, and vice versa.

Hope this helps.

0
Entering edit mode

thanks Raygozak,

I started to run humann3 with gnu parallel, and it is much faster

ls merged_filtered_fastq/ | parallel --eta -j 10 --load 90% --noswap 'humann3 --input merged_filtered_fastq/{} --metaphlan-options "-t rel_ab_w_read_stats" --search-mode uniref90 --output results --memory-use maximum --threads 150'


Using parallel, the program is able run the second(or third ... depends on -j parameter, your mem size and your number of cores) sample while huamnn3 is writing result of first sample on disk.