Start: | 2025-02-25 08:00:00 |
End: | Unknown |
Updated: | 2025-04-02 13:36:42 |
Level: | high |
Considerable hardware is currently broken in the Tier1 Compute infrastructure Hortense. Site research has shown this is due to corrosion and leakage in the cooling system directly going to the processing units. Vendors ATOS (now Eviden) and APAC are aware of this issue and confirm this is due to a design issue with the cooling system.
The damage is extensive. A.o. 90% of all GPU nodes of the gpu_rome_a100_40 partition are unavailable.
We are in communication with vendors Eviden (previously ATOS) and APAC to have this damage fixed and hardware replaced as soon as possible. It is already clear that a complete retrofit of the cooling system will need to happen. This, no doubt, will require another substantial maintenance window.
Update 10/03/2025: an intermediate repair will already be executed starting 17/03/2025, and will likely take a couple of days. A leak in the cooling system will be fixed. No full downtime is needed this time, only some hardware racks are affected. There will be a reduction in the available hardware during these repair works, but considerable CPU and GPU nodes will remain available.
More extensive retrofit repair works will still be needed to replace the direct cooling loops. We will inform all Tier1 users by mail as soon as a date is confirmed by vendor ATOS to execute these repairs.
No maintenance planned currently.