In terms of hardware, EUPEX has a two-fold objective:

Design the hardware architecture of the heterogeneous modular Pilot platform

Deploy and operate an Arm-based Pilot production-class platform, as well as early test platforms, and make them available to the European scientific communities

Heterogeneous Modular Pilot platform

In this respect, EUPEX’s objectives are to:

  • Collect the key technological requirements for EUPEX and for the future Exascale systems and define the architecture for the modular Pilot Platform, including the interconnect technology and topology.
  • Define the hardware implementation of EUPEX compute modules.
  • Design the components of EUPEX (mainly nodes and blades, and improvements in existing components such as interconnects and ancillary equipment) leveraging the high performance and efficiency of European SiPearl Rhea processor and the OpenSequana specifications, and targeting:
  1. easy serviceability and replacement of the blade
  2. easy download and upgrade of the node firmware, device drivers and low-level code
  3. hot-swap at the blade level
  4. the highest level of MTBF consistent with the adoption of innovative components such as the first generation of the EPI processor.
  • Design the blades and the interconnectivity at the router level to optimize the I/O requirements of the applications while taking into account the architectural constraints and the tools/utilities.
  • Provide a number of spare blades for on-site replacement of faulty blades.

System deployment and operation

Objectives

The objectives are to:

Provide, operate and assess advanced Arm-based production-class platforms including access to early test systems and the final Pilot system.

Make such facilities available to partners and more globally to the European scientific communities – e.g. through PRACE preparatory access.

Ensure pre-production service level on the Pilot system and assess its stability, maturity, usability, maintainability.

Staged approach

In the early phases of the project, some key elements of the future Pilot system are not available yet, such as the SiPearl Rhea processor. Nevertheless, the EUPEX partners need software development vehicles. Therefore, EUPEX opted for a staged approach to allow for a meaningful co-design process and to allow applications and the SW stack to prepare for the future hardware components:

EUPEX Alpha: first phase deployed and available from month six of the project, provided in-kind by the EUPEX partners.

EUPEX Beta: second phase once the first Rhea samples are available and integrated into the reference board developed in the EPI project. A minimal air-cooled packaging compatible with safety standards will be designed for a EUPEX-Beta system offering up to 16 Rhea reference boards hosting one Rhea processor connected to one GPU PCIe card and one interconnect PCIe card.

EUPEX Pilot: the complete target HPC system, fully deployed and operational

EUPEX Alpha

As of July 2022, 6 EUPEX partners provide test system resources in-kind from their supercomputing centres.

The availability of the test systems as first phase is key to be able to define the requirements for the Exascale Pilot architecture and characteristics. Relying on representative application profile analysis and co-design studies performed on the test systems, our WP3 will provide valuable inputs to both WP4 and WP5 in order to specify the Pilot hardware and software components.

Applications porting, optimization and scaling tests will be performed on these test systems, based on state-of-the-art ARM technologies, innovative interconnect, optionally connected to accelerated modules. Some systems, although not based upon Arm processors, offer interesting features.

The 6 EUPEX "Alpha" test systems

Joliot-Curie-Irene A64FX (CEA / GENCI)

This system is identified as the " Software-defined vehicle (SDV) ", i.e., the MAIN SYSTEM FOR PREPARING APPLICATIONS AND MIDDLEWARE for the EUPEX project. Its positioning in the target infrastructure and centralized administration (OCEAN) of the future EUPEX Pilot within the TGCC, allows to offer all partners the opportunity to prepare applications, runtimes and tools in the target infrastructure in order to simplify their deployment and integration on the pilot.

Based on Fujitsu PRIMEHPC FX700 technology, the "A64FX" partition consists of 80 single-socket DDR-less A64FX compute nodes connected via Mellanox InfiniBand and integrated into GENCI's Joliot-Curie supercomputer, hosted and operated by CEA.

The A64FX partition give scientists the opportunity to port their applications and prepare for the use of the future European processor by relying on the unique features of the A64FX processor such as its SVE vector instruction set and the use of HBM2 fast access memory (32 GB per node).

ARMIDA (E4)

ARMIDA is an Arm cluster, based on Cavium ThunderX2 CPUs. ARMIDA’s architecture is the same as the European Rhea processor. This is of great interest, to develop and test the code in an environment as close as possible to the production one.  With the use of energy consumption sensors (power meter, metered PDU, queries to OS from command line), detailed studies on energy efficiency are possible.

KAROLINA GPU partition (IT4Innovation)

IT4Inovation’s Karolina is a EuroHPC petascale system with an overall theoretical peak performance of15.7 PFLOP/s. Its accelerated partition with 8 GPUs per node offers a developing testbed for the EUPEX applications, particularly to reach efficient usage on accelerated architecture, which will be implemented in the pilot systems.

Moreover, the system allows to measure energy consumption (AMD RAPL + NVML) and tune CPU and GPU parameters (frequencies, power capping), so it can be used for energy-efficiency analysis of the EUPEX application or development of the energy-efficient/power-aware runtime systems.

BARBORA CPU partition (IT4Innovation)

The Barbora cluster is a relatively small Intel CPU-powered cluster. In the context of the EUPEX project this system is interesting for the possibility of hardware parameters tuning and precise power monitoring.

The Barbora nodes are equipped with an out-of-band Atos High Definition Energy Efficiency Monitoring (HDEEM) system, that allows to precisely measure energy consumption of the applications executed on the node. The system monitors the whole compute node power consumption, and several power subdomains (Voltage Regulators, VRs) such as CPU, DRAM or NIC.

MONTECIMONE (University of Bologna)

Monte Cimone is based on the SiFive Freedom U740 RISC-V SoC HiFive Unmatched board integrated in an HPC node form factor. The proposed architecture is the first RISC-V cluster worldwide. This processor architecture, free of licence, is among the best candidates for European technology. MonteCimone opens the opportunity to assess to maturity of RISC V with respect to Arm for HPC usage, where memory performance, cache management, concurrency are critical, as well as energy efficiency. The benefit for EUPEX is to be able to provide feedback to RISC V design and contribute to a wider choice for European technology candidates.

EPI-TO (University of Torino)

The proposed architecture (Arm) is close to the European Rhea processor. The EPI-To cluster integrates two high-end GPU per node (NVidia A100), making it possible to test accelerated ARM codes, and two NVidia BF2 DPU accelerators per node, which makes it possible to test an entirely different accelerator. EPI-TO is part of the NVidia dev-kit which has been distributed to only 100 universities worldwide (UNITO only in Italy).

Contact us