(RESOLVED) HPC-UGent shared filesystems unstable ($VSC_HOME, $VSC_DATA, $VSC_SCRATCH*)

Start: 2023-09-11 02:20:00
End: Unknown
Updated: 2023-09-22 13:46:37
Level: high

A total lockup of the GPFS shared filesystems has occurred around Mon 11 Sept'23 at 02:20.

Cluster workernodes are currently unable to start new jobs, we are looking into fixing the problem ASAP.

Update Mon 11 Sept'23 - 09:39:
To mitigate the problems we are seeing, we need to perform some emergency maintenance updates to the Infiniband switches that are using for the shared filesystems.

This means that HPC-UGent shared filesystems will be unavailable on the HPC-UGent Tier-2 login nodes as well, and that for UGent VSC accounts (vsc4****) both $VSC_HOME and $VSC_DATA will be unavailable on Tier-1 Hortense, until further notice.

Update Mon 11 Sept'23 - 13:05
Firmware update for Infiniband switches + subsequent reboot of IB switches are almost complete.
We do not expect any data loss or corruption problems on the HPC-UGent filesystems because of this event.

Update Mon 11 Sept'23 - 14:54 Problems have been resolved, we are running some test jobs to make sure the HPC-UGent shared filesystems are stable again on all HPC-UGent Tier-2 clusters.

Update Mon 11 Sept'23 - 15:57 $VSC_HOME and $VSC_DATA is available again on Tier-1 Hortense for UGent VSC accounts (vsc4****).

Update Mon 11 Sept'23 - 16:57 Test jobs have not revealed any problems, so HPC-UGent shared filesystems can be considered stable again.
Job schedulers will gradually be resumed on all HPC-UGent Tier-2 clusters in the next couple of hours.

Update Mon 11 Sept'23 - 22:56 Problems with the HPC-UGent shared filesystems have resurfaced. The HPC-UGent $VSC_HOME, $VSC_DATA, and $VSC_SCRATCH* filesystems are currently unavailable everywhere.

Update Tue 12 Sept'23 - 00:10 HPC-UGent shared filesystems are available again on the HPC-UGent Tier-2 login nodes and VSC Tier-1 Hortense, but stability can not be guaranteed.
Job schedulers on all HPC-UGent Tier-2 clusters have been paused so we can investigate further.

Update Tue 12 Sept'23 - 07:53 The cause of the problems with the HPC-UGent shared filesystems is a core Infiniband switch that is acting up, and spontaneously rebooted on Mon 11 Sept’23 around 22:37 CEST (and again on Tue 12 Sept'12 around 05:25 CEST).
We are looking into resolving the issue ASAP and making the HPC-UGent filesystems stable again.

Update Tue 12 Sept'23 - 13:56 We are still seeing stability problems with the Infiniband network that is used for the HPC-UGent shared filesystems.
We are in touch with the vendor of the hardware, and are doing what we can to resolve the problem ASAP.

Update Tue 12 Sept'23 - 16:56 Infiniband network has been restored (although with reduced redundancy), faulty switch has been removed.
The HPC-UGent shared filesystems should be stable again.
We are running tests for ensure that the clusters are stable again before we resume the HPC-UGent Tier-2 job schedulers.

Update Tue 12 Sept'23 - 19:41 HPC-UGent Tier-2 job schedulers have been resumed, everything should be back to normal for now.
We will follow up on the faulty Infiniband switch with the vendor, and will look into restoring redundancy in the Infiniband network later.

Planned maintenance

No maintenance planned currently.