Recovering from failed data disk
Hard Disks and Solid State Disks are components that are likely to fail after a period of use. Ceph keeps multiple copies of all data in the cluster and can automatically recover from one or more disk failures, provided enough disks remain on all nodes to adequately distribute the data, but will reduce the total size of the pool as disks fail. This documents how to recover from this situation.
Symptoms
Ceph will report HEALTH_WARN OR HEALTH_ERR in ceph status with osds down reported.
Info
If there are enough data disks remaining to redistribute the data, once data is redistributed Ceph may transition back into a HEALTH_OK state, but ceph status will still report osds down.
Danger
There may be data loss in a HEALTH_ERR state! A ticket MUST be created with the HyperCloud Support Team at support@softiron.com to look into this issue before proceeding.
The Linux kernel may report I/O errors to the disk, which can be viewed by reviewing dmesg or /var/log/all.
Recovery
- SSH into the storage node with the failed disk.
- Suspend
ceph-automountdfrom getting in the way by runningmkdir -p /var/run/ceph-automount && touch /var/run/ceph-automount/suspend - For each OSD that has failed, run
ceph-decom-osd -failedDisk. The OSD ID can be determined by runningceph osd tree. - The disk should be physically removed and optionally replaced
- Run
rm /var/run/ceph-automount/suspendto allowceph-automountdto re-ingest the disk.