ZFS: Maintenance

Replace failed drive of a ZFS pool

by ross at 12:53:21 on October 20, 2016

Prepare

We assume that pool's name is system. The failed drive is da0 which holds two GPT partitions: swap0 and system0.

First, take the failed drive's ZFS device offline:

# zpool offline system /dev/gpt/system0

Replace the disk

If your hardware does not support hot-plugging then remove this disk's swap device from /etc/fstab, shutdown, replace the disk and boot again.

If it does support hot-plugging:

# swapoff /dev/gpt/swap0
# camcontrol stop da0

Now physically replace the drive.

Partition the new disk

Repeat the commands you used to create the partitions (this guide might help):

# gpart create -s gpt da0
# gpart add -b 34 -s 512k -t freebsd-boot da0
# gpart add -s 2G -t freebsd-swap -l swap0 da0
# swapon /dev/gpt/swap0
# gpart add -t freebsd-zfs -l system0 da0
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0

Don't forget to add swap0 back to /etc/fstab.

Tell ZFS about the new drive

# zpool replace system /dev/gpt/system0

Checking the resilvering process

Resilvering is the process of copying existing data from the good mirror disk to the new disk. Do not reboot or shut down your server until the resilvering process is finished.

Run the following command to check the status of the resilvering process after replacing a mirrored disk, in this case da1. Only 43GB were in use when this disk was replaced, and the resilvering process as shown in the final status took 20 minutes.

root@server ~ # zpool status -v system
pool: system
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
       continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Sep 19 14:54:09 2012
   17.3G scanned out of 43.1G at 42.5M/s, 0h10m to go
   17.3G resilvered, 40.27% done
config:

       NAME                       STATE     READ WRITE CKSUM
       system                     DEGRADED     0     0     0
         mirror-0                 DEGRADED     0     0     0
           gpt/system0            ONLINE       0     0     0
           replacing-1            OFFLINE      0     0     0
             6740534297703757750  OFFLINE      0     0     0  was /dev/gpt/system1/old
             gpt/system1          ONLINE       0     0     0  (resilvering)

errors: No known data errors

If the resilvering completes normally, you should see something like this:

root@server ~ # zpool status -v system
pool: system
state: ONLINE
scan: resilvered 43.1G in 0h20m with 0 errors on Wed Sep 19 15:14:50 2012
config:

       NAME             STATE     READ WRITE CKSUM
       system           ONLINE       0     0     0
         mirror-0       ONLINE       0     0     0
           gpt/system0  ONLINE       0     0     0
           gpt/system1  ONLINE       0     0     0

errors: No known data errors
Comments
Great walk-through, thanks. I had to add the -f flag to `zpool replace`, but everything else was spot on.
-- c0w
Tuesday, April 8, 2014, 22:16:42