A Gentle Introduction to ZFS, Part 4: Optimizing and Maintaining a ZFS Storage Pool for NAS

Distributed RAID Theory

dRAID is the new hotness.
Hard Disk Drive Geometry: Sectors are a small patch of physical space on a disk

Optimizing Hard Disks

ZFS Storage Pool Design with Hot Spare for 9 HDDs

Simulate Drive Failures with QEMU

QEMU Logo
Buckle up, this is where the fun starts!
$ sudo apt install qemu-utils pv
$ qemu-img create -f qcow2 /tmp/0.qcow2 100M
$ qemu-img create -f qcow2 /tmp/1.qcow2 100M
$ qemu-img create -f qcow2 /tmp/2.qcow2 100M
$ qemu-img create -f qcow2 /tmp/3.qcow2 100M
$ qemu-img create -f qcow2 /tmp/4.qcow2 100M
$ qemu-img create -f qcow2 /tmp/5.qcow2 100M
$ qemu-img create -f qcow2 /tmp/6.qcow2 100M
$ qemu-img create -f qcow2 /tmp/7.qcow2 100M
$ qemu-img create -f qcow2 /tmp/8.qcow2 100M
$ sudo modprobe nbd max_part=32
$ sudo qemu-nbd -c /dev/nbd0 /tmp/0.qcow2
$ sudo qemu-nbd -c /dev/nbd1 /tmp/1.qcow2
$ sudo qemu-nbd -c /dev/nbd2 /tmp/2.qcow2
$ sudo qemu-nbd -c /dev/nbd3 /tmp/3.qcow2
$ sudo qemu-nbd -c /dev/nbd4 /tmp/4.qcow2
$ sudo qemu-nbd -c /dev/nbd5 /tmp/5.qcow2
$ sudo qemu-nbd -c /dev/nbd6 /tmp/6.qcow2
$ sudo qemu-nbd -c /dev/nbd7 /tmp/7.qcow2
$ sudo qemu-nbd -c /dev/nbd8 /tmp/8.qcow2
ZFS Storage Pool Design with Hot Spare for 9 HDDs
$ sudo zpool create -f test draid1:3d:1s:9c /dev/nbd0 /dev/nbd1 /dev/nbd2 /dev/nbd3 /dev/nbd4 /dev/nbd5 /dev/nbd6 /dev/nbd7 /dev/nbd8
$ sudo chmod 0777 /test
$ pv < /dev/urandom > /test/urandom.bin
155MiB 0:00:29 [1.64MiB/s] [ <=> ]
pv: write failed: No space left on device
$ ls /test -lh
total 158M
-rw-rw-r-- 1 simeon simeon 158M Dec 30 00:24 urandom.bin
$ zpool status
pool: test
state: ONLINE
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
draid1:3d:9c:1s-0 ONLINE 0 0 0
nbd0 ONLINE 0 0 0
nbd1 ONLINE 0 0 0
nbd2 ONLINE 0 0 0
nbd3 ONLINE 0 0 0
nbd4 ONLINE 0 0 0
nbd5 ONLINE 0 0 0
nbd6 ONLINE 0 0 0
nbd7 ONLINE 0 0 0
nbd8 ONLINE 0 0 0
spares
draid1-0-0 AVAIL

errors: No known data errors
$ sudo qemu-nbd -d /dev/nbd0
/dev/nbd0 disconnected
$ sudo qemu-nbd -d /dev/nbd1
/dev/nbd1 disconnected
$ sudo zpool scrub test
$ zpool status test
pool: test
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
scan: scrub repaired 0B in 00:00:00 with 0 errors on Wed Dec 30 00:22:48 2020
config:
NAME STATE READ WRITE CKSUM
test DEGRADED 0 0 0
draid1:3d:9c:1s-0 DEGRADED 0 0 0
nbd0 UNAVAIL 0 0 0 corrupted data
nbd1 ONLINE 0 0 0
nbd2 ONLINE 0 0 0
nbd3 ONLINE 0 0 0
nbd4 ONLINE 0 0 0
nbd5 ONLINE 0 0 0
nbd6 ONLINE 0 0 0
nbd7 ONLINE 0 0 0
nbd8 ONLINE 0 0 0
spares
draid1-0-0 AVAIL
errors: No known data errors
$ qemu-img create -f qcow2 /tmp/9.qcow2 100M
$ sudo qemu-nbd -c /dev/nbd9 /tmp/9.qcow2
$ sudo zpool replace test /dev/nbd0 /dev/nbd9
$ zpool status
pool: test
state: ONLINE
scan: resilvered 16.6M in 00:00:02 with 0 errors on Wed Dec 30 00:42:21 2020
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
draid1:3d:9c:1s-0 ONLINE 0 0 0
nbd9 ONLINE 0 0 0
nbd1 ONLINE 0 0 0
nbd2 ONLINE 0 0 0
nbd3 ONLINE 0 0 0
nbd4 ONLINE 0 0 0
nbd5 ONLINE 0 0 0
nbd6 ONLINE 0 0 0
nbd7 ONLINE 0 0 0
nbd8 ONLINE 0 0 0
spares
draid1-0-0 AVAIL
errors: No known data errors
$ sudo su
# pv < /dev/urandom > /dev/nbd1
77.1MiB 0:00:01 [77.1MiB/s] [==============================================================================> ] 77% ETA 0:00:00
$ exit
$ sudo zpool scrub test
$ zpool status
pool: test
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub in progress since Wed Dec 30 00:49:25 2020
212M scanned at 52.9M/s, 212M issued at 52.9M/s, 212M total
8.73M repaired, 100.00% done, 00:00:00 to go
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
draid1:3d:9c:1s-0 ONLINE 0 0 0
nbd9 ONLINE 0 0 0
nbd1 ONLINE 0 0 273 (repairing)
nbd2 ONLINE 0 0 0
nbd3 ONLINE 0 0 0
nbd4 ONLINE 0 0 0
nbd5 ONLINE 0 0 0
nbd6 ONLINE 0 0 0
nbd7 ONLINE 0 0 0
nbd8 ONLINE 0 0 0
spares
draid1-0-0 AVAIL
errors: No known data errors
$ sudo su
# pv < /dev/urandom > /dev/nbd2
77.1MiB 0:00:01 [77.1MiB/s] [==============================================================================> ] 77% ETA 0:00:00
pv: write failed: No space left on device
# pv < /dev/urandom > /dev/nbd7
77.4MiB 0:00:01 [77.3MiB/s] [==============================================================================> ] 77% ETA 0:00:00
pv: write failed: No space left on device
$ exit
$ sudo zpool scrub test
$ sudo zpool status -v
pool: test
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub repaired 52.8M in 00:00:04 with 0 errors on Wed Dec 30 02:31:08 2020
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
draid1:3d:9c:1s-0 ONLINE 0 0 0
nbd0 ONLINE 0 0 7
nbd1 ONLINE 0 0 0
nbd2 ONLINE 0 0 410
nbd3 ONLINE 0 0 0
nbd4 ONLINE 0 0 0
nbd5 ONLINE 0 0 0
nbd6 ONLINE 0 0 0
nbd7 ONLINE 0 0 935
nbd8 ONLINE 0 0 6
spares
draid1-0-0 AVAIL
errors: No known data errors
$ sudo su
# pv < /dev/urandom > /dev/nbd2
77.1MiB 0:00:01 [77.1MiB/s] [==============================================================================> ] 77% ETA 0:00:00
pv: write failed: No space left on device
# pv < /dev/urandom > /dev/nbd4
77.5MiB 0:00:01 [77.4MiB/s] [==============================================================================> ] 77% ETA 0:00:00
pv: write failed: No space left on device
# pv < /dev/urandom > /dev/nbd6
76.9MiB 0:00:01 [76.8MiB/s] [=============================================================================> ] 76% ETA 0:00:00
pv: write failed: No space left on device
$ exit
$ sudo zpool scrub test
$ sudo zpool status -v
pool: test
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 7.77M in 00:00:02 with 713 errors on Wed Dec 30 02:42:47 2020
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
draid1:3d:9c:1s-0 ONLINE 0 0 0
nbd0 ONLINE 0 0 166
nbd1 ONLINE 0 0 391
nbd2 ONLINE 0 0 557
nbd3 ONLINE 0 0 0
nbd4 ONLINE 0 0 768
nbd5 ONLINE 0 0 0
nbd6 ONLINE 0 0 784
nbd7 ONLINE 0 0 209
nbd8 ONLINE 0 0 370
spares
draid1-0-0 AVAIL
errors: Permanent errors have been detected in the following files:<metadata>:<0x3f>
test:<0x0>
/test/urandom.bin
$ rm /test/urandom.bin 
$ pv < /dev/urandom > /test/urandom.bin
156MiB 0:00:29 [1.11MiB/s] [ <=> ]
pv: write failed: No space left on device
$ sudo zpool status -v
pool: test
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 7.77M in 00:00:02 with 713 errors on Wed Dec 30 02:42:47 2020
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
draid1:3d:9c:1s-0 ONLINE 0 0 0
nbd0 ONLINE 0 0 166
nbd1 ONLINE 0 0 391
nbd2 ONLINE 0 0 557
nbd3 ONLINE 0 0 0
nbd4 ONLINE 0 0 768
nbd5 ONLINE 0 0 0
nbd6 ONLINE 0 0 784
nbd7 ONLINE 0 0 209
nbd8 ONLINE 0 0 370
spares
draid1-0-0 AVAIL
errors: Permanent errors have been detected in the following files:<metadata>:<0x3f>
test:<0x0>
test:<0x2>
ZFS Storage Pool without Hot Spare for 9 HDDs
RAID speed up factor, Phi — the higher the better from Qiao et al.
Examples of RAID Speed Up Factors for Various Pool Configurations

Optimizing ZFS Parameters

sudo zpool create -f \
-m /tank \
-O atime=off \
-O canmount=on \
-O overlay=on \
-O checksum=skein \
-O compression=zstd-1 \
-O dedup=off \
-O filesystem_limit=none \
-O quota=none \
-O refquota=none \
-O refreservation=none \
-O reservation=none \
-O snapshot_limit=none \
-O sharenfs=off \
-O sharesmb=off \
tank \
draid3:6d:0s:9c \
wwn-0x0000000000000001 \
wwn-0x0000000000000002 \
wwn-0x0000000000000003 \
wwn-0x0000000000000004 \
wwn-0x0000000000000005 \
wwn-0x0000000000000006 \
wwn-0x0000000000000007 \
wwn-0x0000000000000008 \
wwn-0x0000000000000009
$ sudo zpool create -f \
-m /tank \
-O atime=off \
-O canmount=on \
-O overlay=on \
-O checksum=skein \
-O compression=zstd-1 \
-O dedup=off \
-O filesystem_limit=none \
-O quota=none \
-O refquota=none \
-O refreservation=none \
-O reservation=none \
-O snapshot_limit=none \
-O sharenfs=off \
-O sharesmb=off \
tank \
draid3:6d:0s:9c \
wwn-0x0000000000000001 \
wwn-0x0000000000000002 \
wwn-0x0000000000000003 \
wwn-0x0000000000000004 \
wwn-0x0000000000000005 \
wwn-0x0000000000000006 \
wwn-0x0000000000000007 \
wwn-0x0000000000000008 \
wwn-0x0000000000000009

Configuring Error Recovery Control

The sector number and ECC enables hard drives to detect and respond to such events. When either event occurs during a read, hard drives will retry the read many times until they either succeed or conclude that the data cannot be read. The latter case can take a substantial amount of time and consequently, IO to the drive will stall.

Enterprise hard drives and some consumer hard drives implement a feature called Time-Limited Error Recovery (TLER) by Western Digital, Error Recovery Control (ERC) by Seagate and Command Completion Time Limit by Hitachi and Samsung, which permits the time drives are willing to spend on such events to be limited by the system administrator.

Drives that lack such functionality can be expected to have arbitrarily high limits. Several minutes is not impossible. Drives with this functionality typically default to 7 seconds. ZFS does not currently adjust this setting on drives. However, it is advisable to write a script to set the error recovery time to a low value, such as 0.1 seconds until ZFS is modified to control it. This must be done on every boot.

$ sudo smartctl -l scterc /dev/sda1
smartctl 6.6 2016-05-31 r4324 [aarch64-linux-4.9.140-tegra] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
SCT Error Recovery Control:
Read: 70 (7.0 seconds)
Write: 70 (7.0 seconds)
$ sudo smartctl -l scterc,0,0 /dev/sda
smartctl 6.6 2016-05-31 r4324 [aarch64-linux-4.9.140-tegra] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
SCT Error Recovery Control set to:
Read: Disabled
Write: Disabled
$ sudo smartctl -l scterc,0,0 /dev/sdd
smartctl 6.6 2016-05-31 r4324 [aarch64-linux-4.9.140-tegra] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
SCT Error Recovery Control command not supported

ZFS Storage Pool Maintenance

$ crontab -e
# zpool scrub every month
0 2 1 * * /usr/local/bin/zpool scrub tank
0 13 1 * * /usr/local/bin/zpool status -v

Summary

Further Resources

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store