MT2 Sept 2017 Login Node Outage

Back to documentation index

Summary

Beginning Thursday, Sept 14th, POD customers began to experience an outage of MT2's Scyld Cloud Workstation and Login Node instances with a total outage on Friday Sept 15 at 0825 GMT.  All services were restored and made available to users Monday, September 18th at 0500 GMT.

Impact

During the maintenance window, the following services were unavailable to customers:

  • POD MT2 Login Node Instances

  • POD MT2 Scyld Cloud Workstation Desktops (SCW)

Data storage and compute nodes were unaffected by this outage.  We experienced no data loss on any file system.

Cause of Outage

A bug in open source software, related to XFS LVM volumes running Docker containers for system services, halted the Ceph distributed file system which hosts Login Node and SCW instances. In turn, the Login Node and SCW instances were unable to access their OS partitions and were unavailable or shutdown.

Remediation

POD's Login Node and SCW instance storage was decoupled from the services affected by the bug, and all Login Node and SCW instances were brought back online for customers.  In most cases the SCW and login nodes were assigned the same IP address as before the outage.

Penguin Computing will not bill customers during the September and October billing cycles for MT2 Login Nodes and SCW instances that were affected by the outage.  Please contact sales with questions: podsales@penguincomputing.com.