EUPEX partners CINI and CINECA had a paper accepted at MODA22 – 3rd Workshop on “Monitoring & Operational Data Analytics” co-located with ISC 2022: “Rule-based Thermal Anomaly Detection for Tier-0 HPC Systems”.
Abstract: Today, significant advances in science and technology can not be envisioned without high computing capacity. To solve large problems in science, engineering, and business, data centers provide High-Performance Computing (HPC) systems with aggregation of the computing capacity of thousand of computing nodes with the cost of millions of euros per year. In the datacenter, an anomaly is a suspicious/abnormal pattern in the monitoring signals. The severity of the anomaly can be different, and in extreme conditions, it can yield the outage of the datacenter. By defining complex statistical rules-based anomaly detection methods, this paper investigates the thermal anomaly detection task in one of the most powerful HPC systems in the world, namely Marconi100 hosted at CINECA. The suggested anomaly detection method is successfully validated against real thermal hazard events reported for the studied HPC cluster while in production.
Authors: Mohsen S. Ardebili, Andrea Bartolini, Andrea Acquaviva and Luca Benini
The recording of the presentation can be viewed below.