Integrating different components forming the execution environment that will enable the efficient utilisation of all available resources on the modular architecture of the EUPEX platform.
Atos Open MPI is a software implementing the MPI standard. This implementation extends the well-known open source Open MPI version by adding several key components. These extensions include the support for the BXI interconnect developed by Atos, several implementations of collective communication algorithms and new features not yet covered by the standard such as Notified RMA and Partitioned collective communications.
In EUPEX, Atos Open MPI will provide a second MPI library beyond ParaStation MPI for supporting applications‘ communications.
Responsible Partner: Atos
The StreamFlow Workflow Management System (WMS) declaratively describes cross-application workflows with data dependencies, complex execution environments composed of heterogeneous and independent infrastructures, and mapping of steps onto execution locations. This hybrid workflow approach, based on the Common Workflow Language (CWL) open standard, enables the deployment of different workflow steps (e.g., MPI application, TensorFlow application) onto different modules (e.g., GPU booster, GPP cluster) using different methods (e.g., Slurm, Kubernetes, YSTIA) and the seamless porting onto new modules and deployment methods through self-contained plugins. The topology awareness emerging from these workflow models allows StreamFlow to implement locality-based scheduling strategies, automated data transfers and fault tolerance.
In EUPEX, StreamFlow couples applications via their native file interfaces; thus, it does not require any change in the original code. For this reason, it will be the primary mean for rapid prototyping of assemblies legacy applications onto the MSA system.
Responsible partner: CINI – University of Turin
ACCO (Atos Collective Communication Optimizer) is a solution for tuning Open MPI collective communications’ parameters. Open MPI implements several algorithms for each type of collective operations. Most of these implementations also includes several parameters such as the segmentation of messages or the degree of the underlying tree-shaped topology when employed. Finding the best performing algorithm and its parameters for a collective communication based on the considered rank distribution and the message size is a difficult task. Using a benchmark, ACCO will perform an exploration of the possible values and generate a tuning of these Open MPI parameters for maximising the corresponding collective communications on the targeted supercomputers.
In EUPEX, ACCO will be used to tune Open MPI performance, including some work on tuning AI workloads.
Responsible partner: Atos
BHCO (Bull Hybrid Communication Optimizer) is a collection of tools to improve performance of MPI+OpenMP applications. It currently comprises tools to perform Open MPI communication progress within OpenMP section at relevant time, and for balancing OMP threads between the different MPI processes local to a node.
In EUPEX, BHCO will be a toolbox to improve the performance of MPI+OpenMP applications.
Responsible partner: Atos
The Portable Hardware Locality (hwloc) software package provides a portable abstraction (across OS, versions, architectures, …) of the hierarchical topology of modern architectures, including NUMA memory nodes, sockets, shared caches, cores and simultaneous multithreading. It also gathers various system attributes such as cache and memory information as well as the locality of I/O devices such as network interfaces, InfiniBand HCAs or GPUs.
Role in EUPEX: hwloc will be in charge of modelling the architecture of the pilot platform and exposing it to upper software layers such as runtimes and applications that want to place tasks and buffers according to hardware and software affinities.
Responsible partner: Inria
XHC (XPEM-based Hierarchical Collectives) is a software component implementing common MPI collective primitives (eg. Broadcast, Barrier, Allreduce) in Open MPI, with several optimizations in the intra-node case, especially for nodes with high core counts and complex memory topology. The collectives are implemented with algorithms that combine the benefits of shared-memory for small messages, with the single-copy capabilities of XPMEM (Cross-memory Attach Linux kernel feature) for larger ones. The algorithms’ communication hierarchy reflects the internal node structure (i.e. NUMA regions, sockets, L3/SLC caches). These topology-aware algorithms group nearby cores, and arrange for the majority of communication to take place locally, limiting crossings to distant topological domains.
Moreover, hierarchy partitions traffic on different levels and avoids congestion from fan-in or fan-out patterns. Pipelining is utilized to overlap communication at different levels of the hierarchy. We provide direct implementations with explicit handling of synchronization, following the single-writer, multiple-readers paradigm, thus avoiding the overhead of atomics or shared-memory locks. Upcoming extensions of XHC include improved inter-node interactions, and utilization of hardware acceleration functionality. We are also considering interoperability and integration with Open MPI’s HAN (Hierarchical AutotuNed) collectives framework and with UCX/UCC.
Role in Eupex: Optimized hierarchical collective operations for OpenMPI.
Responsible partner: FORTH
XHC as updated by EUPEX is available in Open Source in GitHub>>
Knot is a full-featured, Kubernetes-based software stack for facilitating data science in public or private clouds, by integrating a web-based environment that includes an extensible set of productivity tools and services. At its core, the Knot dashboard supplies the landing page for users, allowing them to launch notebooks and other services from a set of curated templates, design workflows, access data, and specify parameters related to execution through a user-friendly interface. Knot aims to make it straightforward for domain experts to interact with resources in the underlying infrastructure without having to understand lower-level mechanisms, or ever needing to use the command line.
Knot includes JupyterHub, Argo Workflows, Harbor, and Grafana/Prometheus, all accessible through the custom dashboard. Behind the scenes, other popular tools are automatically installed to help with the integration. The underlying Kubernetes environment can also be extended via KNoC (Kubernetes Node on Cluster), a Virtual Kubelet Provider implementation that uses an HPC cluster as the container execution environment, effectively bringing Cloud with HPC.
In EUPEX, Knot/KNoC will be used as an application/workflow development and management environment for workloads used to co-design, test and validate the pilot system.
Responsible partner: FORTH
Knot as updated by EUPEX is available in Open Source in GitHub>>
KNoC as updated by EUPEX is available in Open Source in GitHub>>
TeraHeap extends the java virtual machine (JVM) to use a second, high-capacity heap over a fast storage device that coexists with the regular Java heap. TeraHeap provides direct access to objects on the second heap, eliminating serialization/deserialization (S/D) cost. It also reduces garbage collection (GC) cost by fencing the garbage collector from scanning the second heap. TeraHeap leverages frameworks’ property of choosing specific objects for off-heap placement and offers frameworks a hint-based interface for moving such objects to the second heap. TeraHeap is implemented in the OpenJDK Java runtime. TeraHeap improves analytics performance when using either block-addressable NVMe SSDs or byte-addressable NVM devices to extend the application heap.
Role in EUPEX: Provide the ability to use large address spaces that extend over NVM and NMVe, for the JVM and JAVA applications.
Responsible partner: FORTH
TeraHeap as updated by EUPEX is available in Open Source in GitHub>>