Defining a management software stack to support the administration of modular systems while being versatile enough to meet the requirements of of upcoming architectures
Ocean is an open source suite for HPC cluster administration. It is developed by CEA and supports the deployment and the day-to-day operations of a robust management infrastructure for HPC systems. Today, Ocean serves as administration stack on all the different HPC clusters on CEA’s premises, including the Tier-0 system Irene.
Ocean-core is an RPM-based Linux distribution tailored to HPC cluster administration. This distribution is built upon a standard Linux distribution (CentOS 7.9 , RockyLinux) and gathers a large range of open source tools allowing for an almost automated management of heterogeneous HPC clusters.
In addition, Ocean-stack describes design and implementation guidelines for such HPC clusters along with best practices and standard operating procedures. This applies to main services management (such as deployment, content delivery, configuration management, hardware management…) in a robust way. Ocean-stack is the handbook to build and operate HPC clusters at scale.
Ocean will serve as admin stack for the EUPEX Pilot and will provide a solid infrastructure for the other software components and applications of the EUPEX Pilot.
Responsible Partner: CEA
ParaStation Modulo will serve as a central pillar of the EUPEX management software stack. Therefore, it will be integrated with the Deepblue management infrastructure developed by CEA enriching it to support the management and operation of supercomputing systems following the MSA approach.
ParaStation Modulo is the MSA-enabling supercomputing software suite developed by ParTec. It is extensively used in production environments such as the JUWELS Cluster-Booster system at Jülich Supercomputing Centre (JSC) and the modular MeluXina system in Luxembourg. A central pillar of ParaStation Modulo is ParaStation MPI providing a versatile process management subsystem and an MPI runtime library. The latter is powered by pscom, a low-level communication library supporting various transports and communication protocols, including gateway capabilities to efficiently bridge MPI traffic across different interconnects. The ParaStation Management execution environment acts as the resource manager and is fully integrated with the Slurm workload manager constituting a complete framework for enabling modular supercomputing.
In addition to this robust and efficient system middleware, the ParaStation Modulo software suite also comprises sophisticated management components such as the ParaStation ClusterTools (for provisioning and administration), the ParaStation HealthChecker (for automated error detection and integrity checking) and the ParaStation TicketSuite (for analysing and keeping track of issues).
Responsible Partner: ParTec AG
LLview is a set of software components to monitor clusters that are controlled by a resource manager and a scheduler system. Within its Job Reporting module, it provides detailed information of all the individual jobs running on the system. To achieve this, LLview connects to different sources in the system and collects data to present to the user via a web portal. For example, the resource manager provides information about the jobs, while additional daemons may be used to acquire extra information from the compute nodes, keeping the overhead at a minimum, as the metrics are obtained in the range of minutes apart. The LLview portal establishes a link between performance metrics and individual jobs to provide a comprehensive job reporting interface.
Responsible partner: Jülich