A software ecosystem based on European technologies

One of the objectives of EUPEX is to provide a software ecosystem for the pilot based on European technologies. Its design will not only take the needs of the key applications identified in EUPEX into consideration, but also those of the system operators for the management of large-scale Modular Supercomputing Architecture (MSA) systems.

The EUPEX software stack addresses 4 objectives

Management

Define a management software stack to support the administration of modular systems while being versatile enough to meet the requirements of of upcoming architectures.

Execution environment

Integrate different components forming the execution environment that will enable the efficient utilisation of all available resources on the modular architecture of the EUPEX platform.

Tools

Provide a set of tools to aid application developers as well as system operators to optimise the efficiency with respect to performance and energy, i.e., to maximise system utilisation.

Storage architecture

Define a multi-tier storage architecture to meet the I/O demands of large-scale MSA systems, to transparently integrate fast storage technologies, and to minimize data movement.

Ocean logo

Ocean

Ocean will be the base operating system on the EUPEX pilot. It will provide all necessary dependencies for other software components in EUPEX.

Ocean is an administration software stack to provide tools and documentation for heterogeneous cluster management. It is composed in two sub projects Ocean-core and Ocean-stack.

Ocean-core is a RPM-based distribution built upon a standard Linux distribution and provides packaged tools and software updates needed to manage HPC clusters.

Ocean-stack (aka Deepblue) describes how to architecture core services required on HPC clusters, how to integrate third-party services and gives operating procedures to manage a HPC Cluster at scale.

Responsible Partner: CEA

ParaStation Modulo

ParaStation Modulo will serve as a central pillar of the EUPEX management software stack. Therefore, it will be integrated with the Deepblue management infrastructure developed by CEA enriching it to support the management and operation of supercomputing systems following the MSA approach.

ParaStation Modulo is the MSA-enabling supercomputing software suite developed by ParTec. It is extensively used in production environments such as the JUWELS Cluster-Booster system at Jülich Supercomputing Centre (JSC) and the modular MeluXina system in Luxembourg. A central pillar of ParaStation Modulo is ParaStation MPI providing a versatile process management subsystem and an MPI runtime library. The latter is powered by pscom, a low-level communication library supporting various transports and communication protocols, including gateway capabilities to efficiently bridge MPI traffic across different interconnects. The ParaStation Management execution environment acts as the resource manager and is fully integrated with the Slurm workload manager constituting a complete framework for enabling modular supercomputing.

In addition to this robust and efficient system middleware, the ParaStation Modulo software suite also comprises sophisticated management components such as the ParaStation ClusterTools (for provisioning and administration), the ParaStation HealthChecker (for automated error detection and integrity checking) and the ParaStation TicketSuite (for analysing and keeping track of issues).

Responsible Partner: ParTec AG

LLview

LLview is a set of software components to monitor clusters that are controlled by a resource manager and a scheduler system. Within its Job Reporting module, it provides detailed information of all the individual jobs running on the system. To achieve this, LLview connects to different sources in the system and collects data to present to the user via a web portal. For example, the resource manager provides information about the jobs, while additional daemons may be used to acquire extra information from the compute nodes, keeping the overhead at a minimum, as the metrics are obtained in the range of minutes apart. The LLview portal establishes a link between performance metrics and individual jobs to provide a comprehensive job reporting interface.

Responsible partner: Jülich

Countdown logo

COUNTDOWN

  • COUNTDOWN is a methodology and a tool for identifying and automatically reducing the frequency of the computing elements in order to save energy during communication and synchronization primitives. COUNTDOWN is able to filter out phases which would detriment the time to solution of the application transparently to the user, without touching the application code nor requiring recompilation of the application. Besides its primary use as an energy-saving framework, COUNTDOWN can be a powerful monitoring tool as it allows us to track and record low-level commands of the parallel application. This then allows a granular monitoring and performance analysis of the application running on specific hardware.

  • In EUPEX, COUNTDOWN  will have two principal uses: 1) it will be used as an energy-saving framework which reduces the CPU frequency during the communication and synchronization phases of the parallel application, and 2) it will be used as a granular monitoring tool that will give us more insights into the performance of applications running on pilot hardware.

  • Responsible partner: CINI – University of Bologna

StreamFlow logo

StreamFlow

The StreamFlow Workflow Management System (WMS) declaratively describes cross-application workflows with data dependencies, complex execution environments composed of heterogeneous and independent infrastructures, and mapping of steps onto execution locations. This hybrid workflow approach, based on the Common Workflow Language (CWL) open standard, enables the deployment of different workflow steps (e.g., MPI application, TensorFlow application) onto different modules (e.g., GPU booster, GPP cluster) using different methods (e.g., Slurm, Kubernetes, YSTIA) and the seamless porting onto new modules and deployment methods through self-contained plugins. The topology awareness emerging from these workflow models allows StreamFlow to implement locality-based scheduling strategies, automated data transfers and fault tolerance.

In EUPEX, StreamFlow couples applications via their native file interfaces; thus, it does not require any change in the original code. For this reason, it will be the primary mean for rapid prototyping of assemblies legacy applications onto the MSA system.

Responsible partner: CINI – University of Turin

Open MPI

Atos Open MPI is a software implementing the MPI standard. This implementation extends the well-known open source Open MPI version by adding several key components. These extensions include the support for the BXI interconnect developed by Atos, several implementations of collective communication algorithms and new features not yet covered by the standard such as Notified RMA and Partitioned collective communications.

In EUPEX, Atos Open MPI will provide a second MPI library beyond ParaStation MPI for supporting applications‘ communications.

 

Responsible Partner: Atos

CAPIO (Cross Application Programmable I/O)

CAPIO is a user-space middleware that can enhance the performance of scientific workflow composed of applications that communicate through files and file system’s POSIX APIs. CAPIO is designed to be programmable, i.e., it allows the definition of cross-application parallel data movement streaming patterns and the definition of parallel in-situ/in-transit data transformation on these data streams. Data-movement programming complements the application workflow by providing the application pipeline with cross-layer optimization between the application and the storage system. The user provides a JSON file in which for each file or directory there is written which application produces and reads them, how the communication is performed (e.g., scatter, gather, broadcast), and eventual in-situ or in-transit data transformation. Besides, CAPIO enables streaming computation while still using the POSIX I/O APIs. CAPIO automatically transforms synchronous POSIX-based I/O operations into asynchronous operations removing the file system from the critical path of the computation. This feature improves the performance of scientific workflows that are typically organized as a data-flow pipeline of several components where each one can start only when the previous stages have generated all its input data.

Responsible partner: CINI – University of Pisa and University of Turin

Inria logo

Hwloc

  • The Portable Hardware Locality (hwloc) software package provides a portable abstraction (across OS, versions, architectures, …) of the hierarchical topology of modern architectures, including NUMA memory nodes, sockets, shared caches, cores and simultaneous multithreading. It also gathers various system attributes such as cache and memory information as well as the locality of I/O devices such as network interfaces, InfiniBand HCAs or GPUs.

  • Role in EUPEX: hwloc will be in charge of modelling the architecture of the pilot platform and exposing it to upper software layers such as runtimes and applications that want to place tasks and buffers according to hardware and software affinities.

  • Responsible partner: Inria

  • More on Hwloc>>

Contact us