Date: 06 Aug 2023

Failure Recovery on ZFS

Miscellanous log of ZFS issues I have had over the years.

Replacing failed disks

I had a disk that was labeled “degraded”. Ordered a new disk and hot swapped it with the degraded disk.

NOTE TO FUTURE SELF - offline the disk before hot swapping next time.

NOTE FROM FUTURE SELF - if the disk was already faulted, you can’t actually offline it, so it’s fine to just pull out the faulted disk and then mark it as offline after.

Once I did that, the disk in zfs was labeled as “unavailable” instead. After some googling I decided to retroactively offline the bad disk using the command zpool offline <pool name> <device name>.

<device name> is the name that shows up when you run zpool status <pool name>.

Once I did that, the status of the disk changed to “offline”. I think I probably should have done this before I hot swapped the disks.

Next I had to indicate to zfs that I had a new disk to replace the old disk. After some trial and error, I stumbled upon the right command:

zpool replace <pool name> <old device name> <new device name>

You can get the new device name by checking /dev/disk/by-id/, in my case with my WD disks, I could run ls /dev/disk/by-id/ata-WDC* and compare the last chunk of alphanumeric characters with the disk serial number. I can either get the serial number from proxmox web UI (I know which slot it is in because I replaced the bad disk in /dev/sdX) or on the physical drive itself.

After that you should see a status of replacing-2 for the disks for a while.

False positive failed disks

If there is a disk that shows up as faulted but you don’t think it is (i.e. you just bought it, there was a power failure, etc), you can do the following:

zpool clear <pool name> <device name>
zpool online <pool name> <device name> 
zpool scrub <pool name>

you can get the device name from zpool status <pool name>

Online swaps

had an issue come up when trying to in-place swap a drive: the replacement drive would also appear as faulted and there would be errors “too many errors” and “insufficient replicas”

Because of this, started over by removing the new drive zpool detach <pool name> <device name> and tried to do an online swap instead, installing the new drive in an extra drive bay that I had.

online swap is similar to what I was doing before, but I don’t do zpool offline beforehand. Requires a spare drive bay

Drives to use for a NAS backed by ZFS

I had a WD red drive that started appearing as faulted in ZFS after only about three months of use. I did some research:

https://blog.westerndigital.com/wd-red-nas-drives/

https://www.reddit.com/r/synology/comments/upqty6/are_wd_reds_the_best_drives_for_a_new_nas_owner/

And found that I should be staying away from WD red, but WD red plus or pro are ok. The difference is between shingled (SMR) and conventional (CMR) magnetic recording. Apparently shingled is bad for zfs applications, and the drive I had was an SMR drive.

Almost catastrophic failure?

Ran into a scary scenario where one drive (raidz1) faulted, and another drive became unresponsive. The pool was automatically offlined, and I did a hard reboot of the server. When it booted there was a loud whining noise, but that turned out to be a small fan in the case that was coming out of its enclosure.

After some investigation, I think I know what the issue was. I was running a large file transfer on two other disks, and I think the power consumption was too high and caused things to get messed up. When I redid the scrub, I also restarted the file transfer and the issue happened again… my disk drives are too power hungry and my power supply sucks!