Forum:Declining multithreading Diamond performance on decent dual AMD Epyc 7643 server
1
0
Entering edit mode
7 months ago

I would like to share some interesting observations about Diamond's performance on a fairly powerful AMD Epyc based server comparing to an older one. Interestingly, as I increase the thread count on the new system the performance decreases and even in 32 threads mode I can't reach the speed of the older system.

version: diamond v2.0.14.152

CLI: ./diamond blastx -d NR -q N65-111_dedup.fa -o N65-111 --threads (32) (64) (188) -f 100 -c1 -b20 --un N65-dark-111.fa

Reference system: (Ubuntu 20.04.3 LTS)

• 2xAMD EPYC 7282 - 1,5 GHz / 2,8 GHz (Boost) 64 threads 384Gb RAM Total time = 5075.2s

Decent system: (Ubuntu 21.10)

• 2xAMD EPYC 7643 - 2,30 GHz / 3,6 GHz (Boost) 188 threads 1Tb RAM Total time = 6979.27s

• 2xAMD EPYC 7643 - 2,30 GHz / 3,6 GHz (Boost) 64 threads 1Tb RAM Total time = 5638.73s

• 2xAMD EPYC 7643 - 2,30 GHz / 3,6 GHz (Boost) 32 threads 1Tb RAM Total time = 5555.9s

Reported 33332955 pairwise alignments, 33332955 HSPs. 2305437 queries aligned.

Database: NR.dmnd (type: Diamond database, sequences: 458431797, letters: 174524903011)
N65-111_dedup.fa is a metagenomic readset with trimmed 4297335 sequences (lenghts: 80-190nt)
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Target sequences to report alignments for: 25


In particular, the following operations slow down as the thread count increases: Masking reference, Building reference histograms, Computing alignments.

More detailed information in the attachment:

Diamond_benchmark_AMD_Epyc_platforms.pdf

0
Entering edit mode

I would test out the run in multiprocessing mode, instead of a single 32 thread, run 4x8 thread processes (or something similar) perhaps could point towards whether it is an IO-bound problem or a thread contention of some sorts. Not that I am an expert on either issue, but would be an interesting data point to collect.

0
Entering edit mode
7 months ago
GenoMax 121k

the following operations slow down as the thread count increases: Masking reference, Building reference histograms, Computing alignments.

Unless you are keeping everything in a RAM disk I/O performance is something of a bottleneck with any bioinformatic analysis. Most operations remain I/O bound with current crop of speedy CPU's and you certainly have some of the best. You have not provided any information about what kind of storage you are using. Considering your systems are a probably a generation apart I/O likely uses different interconnects (PCIe3/4 etc) so that may be the main limitation here.

Look at it as glass half full. Many on this forum don't have even 16G of RAM to do a particular analysis. You on the other have have a Ferrari of a system in comparison.

0
Entering edit mode

Dear Genomax,

Thanks for your answer: I think there are no bottlenecks: all I/O systems are working fine and at full benchmarked speed (4+6Tb Ultra-Fast M.2 NvRAMs with PCIe Gen4 + 1Tb 3200MHz / RDIMM DDR4 RAM). Also Diamond developers reported this latest Epyc related issues, I'm just very curious if anyone else has had similar experiences or solutions to this problem. In addition, the old reference server only has a PCIe 3 NVRAM interface, yet it is faster.

0
Entering edit mode

Also Diamond developers reported this latest Epyc related issues,

If this is a known issue then there is not much you can do but wait to see if devs are able to resolve this. Are you able to try a different Linux distribution out?

While not related there are other reports of software running poorly on EPYC with Linux. This could also be a OS/kernel issue based on the limited reading I did. For those not inclined to click on the link above here is a relevant comment

It appears right now though, at least on CentOS 7.8 installs, Epyc CPUs have improperly defined vectorization characteristics. A given vectorized operation e.g.:

A = magic(20000);
[L,U,P] = lu(A);


May take 10x longer on the exact same AMD Epyc CPU on Linux compared to the same CPU running in Windows. Other functions with notable problems include isonormals, gradient, and bwareaopen to name a few. It appears to only run on a single thread in those sections of code rather than properly multithreading vectorized code operations.