Recovering from failed data disk

Hard Disks and Solid State Disks are components that are likely to fail after a period of use. Ceph keeps multiple copies of all data in the cluster and can automatically recover from one or more disk failures, provided enough disks remain on all nodes to adequately distribute the data, but will reduce the total size of the pool as disks fail. This documents how to recover from this situation.

Symptoms

Ceph will report HEALTH_WARN OR HEALTH_ERR in ceph status with osds down reported.

Info

If there are enough data disks remaining to redistribute the data, once data is redistributed Ceph may transition back into a HEALTH_OK state, but ceph status will still report osds down.

Danger

There may be data loss in a HEALTH_ERR state! A ticket MUST be created with the HyperCloud Support Team at support@softiron.com to look into this issue before proceeding.

The Linux kernel may report I/O errors to the disk, which can be viewed by reviewing dmesg or /var/log/all.

Recovery

SSH into the storage node with the failed disk.
Suspend ceph-automountd from getting in the way by running mkdir -p /var/run/ceph-automount && touch /var/run/ceph-automount/suspend
For each OSD that has failed, run ceph-decom-osd -failedDisk. The OSD ID can be determined by running ceph osd tree.
The disk should be physically removed and optionally replaced
Run rm /var/run/ceph-automount/suspend to allow ceph-automountd to re-ingest the disk.