EUPEX consortium partner FORTH had a research paper accepted at ICPP’2023.
Recent processor advances have made feasible HPC nodes with high core counts, capable of hosting tens or even, hundreds of processes. Therefore, designing MPI collective operations at the intra-node level has received significant attention over the past years. Deriving efficient algorithms for modern HPC nodes, with complex internal topologies and memory hierarchies, is challenging. Moreover, the cache coherency protocol, and its impact on performance, further complicate algorithm design for MPI collectives. This latter concern is often only partially addressed.
In this work, we demonstrate a particularly challenging performance degradation scenario in the case of shared-memory–based MPI broadcast, on three generations of the Intel Xeon Scalable processor architecture. Based on analysis of hardware-based performance counters, we conclude that the performance degradation observed is attributed to the cache coherency protocol and the multi-socket configuration of the execution platforms examined. We present a number of novel approaches designed to amend this effect, and apply them in a cache coherency aware version of the MPI broadcast implementation. We reduce the overall latency of the broadcast operation by up to 1.5 × and 1.25 × for small and large messages, respectively.
Collection of computationtal artifacts (source code, scripts, datasets, instructions) for reproducibility of experiments featured in the associated paper: https://zenodo.org/records/8094307