In this respect, EUPEX’s objectives are to:
The objectives are to:
In the early phases of the project, some key elements of the future Pilot system are not available yet, such as the SiPearl Rhea processor. Nevertheless, the EUPEX partners need software development vehicles. Therefore, EUPEX opted for a staged approach to allow for a meaningful co-design process and to allow applications and the SW stack to prepare for the future hardware components:
EUPEX Alpha: first phase deployed and available from month six of the project, provided in-kind by the EUPEX partners.
EUPEX Beta: second phase once the first Rhea samples are available and integrated into the reference board developed in the EPI project. A minimal air-cooled packaging compatible with safety standards will be designed for a EUPEX-Beta system offering up to 16 Rhea reference boards hosting one Rhea processor connected to one GPU PCIe card and one interconnect PCIe card.
EUPEX Pilot: the complete target HPC system, fully deployed and operational
As of July 2022, 6 EUPEX partners provide test system resources in-kind from their supercomputing centres.
The availability of the test systems as first phase is key to be able to define the requirements for the Exascale Pilot architecture and characteristics. Relying on representative application profile analysis and co-design studies performed on the test systems, our WP3 will provide valuable inputs to both WP4 and WP5 in order to specify the Pilot hardware and software components.
Applications porting, optimization and scaling tests will be performed on these test systems, based on state-of-the-art ARM technologies, innovative interconnect, optionally connected to accelerated modules. Some systems, although not based upon Arm processors, offer interesting features.
This system is identified as the “Software development vehicle” (SDV), i.e., the MAIN SYSTEM FOR PREPARING APPLICATIONS AND MIDDLEWARE for the EUPEX project. Its positioning in the target infrastructure and centralized administration (OCEAN) of the future EUPEX Pilot within the TGCC, allows to offer all partners the opportunity to prepare applications, runtimes and tools in the target infrastructure in order to simplify their deployment and integration on the pilot.
Based on Fujitsu PRIMEHPC FX700 technology, the “A64FX” partition consists of 80 single-socket DDR-less A64FX compute nodes connected via Mellanox InfiniBand and integrated into GENCI’s Joliot-Curie supercomputer, hosted and operated by CEA.
The A64FX partition give scientists the opportunity to port their applications and prepare for the use of the future European processor by relying on the unique features of the A64FX processor such as its SVE vector instruction set and the use of HBM2 fast access memory (32 GB per node).
ARMIDA is an Arm cluster, based on Cavium ThunderX2 CPUs. ARMIDA’s architecture is close to the European Rhea processor. This is of great interest, to develop and test the code in an environment as close as possible to the production one. With the use of energy consumption sensors (power meter, metered PDU, queries to OS from command line), detailed studies on energy efficiency are possible.
IT4Inovation’s Karolina is a EuroHPC petascale system with an overall theoretical peak performance of15.7 PFLOP/s. Its accelerated partition with 8 GPUs per node offers a developing testbed for the EUPEX applications, particularly to reach efficient usage on accelerated architecture, which will be implemented in the pilot systems.
Moreover, the system allows to measure energy consumption (AMD RAPL + NVML) and tune CPU and GPU parameters (frequencies, power capping), so it can be used for energy-efficiency analysis of the EUPEX application or development of the energy-efficient/power-aware runtime systems.
The Barbora cluster is a relatively small Intel CPU-powered cluster. In the context of the EUPEX project this system is interesting for the possibility of hardware parameters tuning and precise power monitoring.
The Barbora nodes are equipped with an out-of-band Atos High Definition Energy Efficiency Monitoring (HDEEM) system, that allows to precisely measure energy consumption of the applications executed on the node. The system monitors the whole compute node power consumption, and several power subdomains (Voltage Regulators, VRs) such as CPU, DRAM or NIC.
Monte Cimone is based on the SiFive Freedom U740 RISC-V SoC HiFive Unmatched board integrated in an HPC node form factor. The proposed architecture is the first RISC-V cluster worldwide. This processor architecture, free of licence, is among the best candidates for European technology. MonteCimone opens the opportunity to assess to maturity of RISC V with respect to Arm for HPC usage, where memory performance, cache management, concurrency are critical, as well as energy efficiency. The benefit for EUPEX is to be able to provide feedback to RISC V design and contribute to a wider choice for European technology candidates.
The proposed architecture (Arm) is close to the European Rhea processor. The EPI-To cluster integrates two high-end GPU per node (NVidia A100), making it possible to test accelerated ARM codes, and two NVidia BF2 DPU accelerators per node, which makes it possible to test an entirely different accelerator. EPI-TO is part of the NVidia dev-kit which has been distributed to only 100 universities worldwide (UNITO only in Italy).