A Gentle Introduction to ZFS, Part 4: Optimizing and Maintaining a ZFS Storage Pool for NAS

Simeon Trieu
16 min readJan 3, 2021

In the previous articles, we’ve built the ZFS kernel module to access the new dRAID vdev, setup and mounted our ZFS storage pool, then shared it with the network through NFS. Now, we will optimize and maintain our ZFS storage pool, to both increase performance and to create safety from data loss.

Distributed RAID Theory

dRAID is the new hotness.

Declustered RAID (dRAID) allows us to distribute parity throughout all disks, but to be effective at doing this, disks are ideally the same size. dRAID has fixed stripe width and pads missing data as zeros. Therefore, it’s best and most efficient to use drives of the same capacity.

Also, dRAID excels more at large file storage rather than small file storage. This is because the fixed stripe width causes blocks to be allocated in D * S sizes, where:

  • D is the number of data drives, and
  • S is the disk sector size. This is the smallest unit of accessible data from a disk. It’s common for HDDs these days to use the Advanced Format (AF) disk sector size of 4KB. Each sector is striped (matched) with a sector from another hard disk drive in the array.
Hard Disk Drive Geometry: Sectors are a small patch of physical space on a disk

Optimizing Hard Disks

For D = 6 and a 4KB sector size, that makes the minimum allocation size 6 * 4KB = 24KB. This means that if a file is 1KB, it will still take up 24KB of space. So, storage pools that store many small files will see a lot of wasted space due to the minimum allocation size. This is less of a problem for files much larger than 24KB.

Also, if the disk capacity is not the same, then dRAID, like RAIDz{1,2,3} before it, will use the smallest capacity drive to set the maximum capacity per disk. For example, if we have a smattering of disks: a 2TB, 4TB, 6TB, and 8TB, the capacity of ZFS storage pool with 3-drive parity will be 2TB x 4 - Parity = 5.33TB. So, capacity of the ZFS storage pool is traded-off with parity (safety).

ZFS Storage Pool Design with Hot Spare for 9 HDDs

Finally, dRAID can provide the same level of redundancy and performance as RAIDz{1,2,3}, while also providing a fast integrated distributed hot spare drive. Before dRAID, the hot spare drive was simply not used, but since the “spare” is distributed, just like data and parity allocation units, all drives can participate in driving I/O bandwidth instead of laying dormant.

Simulate Drive Failures with QEMU

QEMU Logo

It’s often helpful to practice on virtual disks first before using real physical disks. We’ll do this via Quick EMUlator (QEMU) tools. QEMU is a generic and open source machine emulator and virtualizer. We’ll be using it to analyze two case studies: a dual replication group storage pool with hot spare and a striped group with triple parity and no hot spare. Hang tight and buckle up; this is where it gets fun!

Buckle up, this is where the fun starts!

Install QEMU and dependencies for this exercise:

$ sudo apt install qemu-utils pv

Create 9 x 100MB images:

$ qemu-img create -f qcow2 /tmp/0.qcow2 100M
$ qemu-img create -f qcow2 /tmp/1.qcow2 100M
$ qemu-img create -f qcow2 /tmp/2.qcow2 100M
$ qemu-img create -f qcow2 /tmp/3.qcow2 100M
$ qemu-img create -f qcow2 /tmp/4.qcow2 100M
$ qemu-img create -f qcow2 /tmp/5.qcow2 100M
$ qemu-img create -f qcow2 /tmp/6.qcow2 100M
$ qemu-img create -f qcow2 /tmp/7.qcow2 100M
$ qemu-img create -f qcow2 /tmp/8.qcow2 100M

Next, we need to make virtual block devices with qemu-nbd (the QEMU Disk Network Block Device Server). First, let’s make sure the kernel module is loaded (with a limit on the number of partitions that can be created as a safeguard):

$ sudo modprobe nbd max_part=32

Create the virtual block devices:

$ sudo qemu-nbd -c /dev/nbd0 /tmp/0.qcow2
$ sudo qemu-nbd -c /dev/nbd1 /tmp/1.qcow2
$ sudo qemu-nbd -c /dev/nbd2 /tmp/2.qcow2
$ sudo qemu-nbd -c /dev/nbd3 /tmp/3.qcow2
$ sudo qemu-nbd -c /dev/nbd4 /tmp/4.qcow2
$ sudo qemu-nbd -c /dev/nbd5 /tmp/5.qcow2
$ sudo qemu-nbd -c /dev/nbd6 /tmp/6.qcow2
$ sudo qemu-nbd -c /dev/nbd7 /tmp/7.qcow2
$ sudo qemu-nbd -c /dev/nbd8 /tmp/8.qcow2
  • -c connects a file to a virtual NBD block device

Create the virtual ZFS storage pool (3 data and 1 parity per replication group and 1 hot spare) and fill it with random bits. We’ll use /dev/urandom to generate the random bits and use pv to observe the pipe.

Case Study 1: dRAID Configuration (2 replication groups, 3 data, 1 parity, 1 hot spare)

ZFS Storage Pool Design with Hot Spare for 9 HDDs

We’ll study two configurations to see which works best for 9 hard disk drives (HDD). The first has the following configuration:

  • 2 replication groups
  • 3 data drives per replication group
  • 1 parity drive per replication group
  • 1 hot spare

Again, we’ll be using QEMU NBD to simulate virtual disk drives to test out various ZFS storage pool configurations.

$ sudo zpool create -f test draid1:3d:1s:9c /dev/nbd0 /dev/nbd1 /dev/nbd2 /dev/nbd3 /dev/nbd4 /dev/nbd5 /dev/nbd6 /dev/nbd7 /dev/nbd8
$ sudo chmod 0777 /test
$ pv < /dev/urandom > /test/urandom.bin
155MiB 0:00:29 [1.64MiB/s] [ <=> ]
pv: write failed: No space left on device
$ ls /test -lh
total 158M
-rw-rw-r-- 1 simeon simeon 158M Dec 30 00:24 urandom.bin
$ zpool status
pool: test
state: ONLINE
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
draid1:3d:9c:1s-0 ONLINE 0 0 0
nbd0 ONLINE 0 0 0
nbd1 ONLINE 0 0 0
nbd2 ONLINE 0 0 0
nbd3 ONLINE 0 0 0
nbd4 ONLINE 0 0 0
nbd5 ONLINE 0 0 0
nbd6 ONLINE 0 0 0
nbd7 ONLINE 0 0 0
nbd8 ONLINE 0 0 0
spares
draid1-0-0 AVAIL

errors: No known data errors

Now that we’ve created the virtual storage pool and filled it with random bits, let’s simulate a complete drive failure:

$ sudo qemu-nbd -d /dev/nbd0
/dev/nbd0 disconnected
$ sudo qemu-nbd -d /dev/nbd1
/dev/nbd1 disconnected
$ sudo zpool scrub test
$ zpool status test
pool: test
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
scan: scrub repaired 0B in 00:00:00 with 0 errors on Wed Dec 30 00:22:48 2020
config:
NAME STATE READ WRITE CKSUM
test DEGRADED 0 0 0
draid1:3d:9c:1s-0 DEGRADED 0 0 0
nbd0 UNAVAIL 0 0 0 corrupted data
nbd1 ONLINE 0 0 0
nbd2 ONLINE 0 0 0
nbd3 ONLINE 0 0 0
nbd4 ONLINE 0 0 0
nbd5 ONLINE 0 0 0
nbd6 ONLINE 0 0 0
nbd7 ONLINE 0 0 0
nbd8 ONLINE 0 0 0
spares
draid1-0-0 AVAIL
errors: No known data errors

Notice that the virtual drive nbd0 is unavailable, but isn’t being replaced by a hot spare. That’s because ZFS only (automatically) uses a hot spare when data errors are detected, but not when a drive fails completely. Instead, what we’d need to do, in a real life scenario, is physically replace the disk and replace the drive. We’ll simulate the drive replacement by creating a new virtual block device, /dev/nbd9 and use it to replace /dev/nbd0:

$ qemu-img create -f qcow2 /tmp/9.qcow2 100M
$ sudo qemu-nbd -c /dev/nbd9 /tmp/9.qcow2
$ sudo zpool replace test /dev/nbd0 /dev/nbd9
$ zpool status
pool: test
state: ONLINE
scan: resilvered 16.6M in 00:00:02 with 0 errors on Wed Dec 30 00:42:21 2020
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
draid1:3d:9c:1s-0 ONLINE 0 0 0
nbd9 ONLINE 0 0 0
nbd1 ONLINE 0 0 0
nbd2 ONLINE 0 0 0
nbd3 ONLINE 0 0 0
nbd4 ONLINE 0 0 0
nbd5 ONLINE 0 0 0
nbd6 ONLINE 0 0 0
nbd7 ONLINE 0 0 0
nbd8 ONLINE 0 0 0
spares
draid1-0-0 AVAIL
errors: No known data errors

Notice that the drive is now replaced, resilvered and ONLINE. This is what would happen with a real world drive replacement. But what would happen with a block device corruption? Let’s simulate that condition also on /dev/nbd1:

$ sudo su
# pv < /dev/urandom > /dev/nbd1
77.1MiB 0:00:01 [77.1MiB/s] [==============================================================================> ] 77% ETA 0:00:00
$ exit
$ sudo zpool scrub test
$ zpool status
pool: test
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub in progress since Wed Dec 30 00:49:25 2020
212M scanned at 52.9M/s, 212M issued at 52.9M/s, 212M total
8.73M repaired, 100.00% done, 00:00:00 to go
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
draid1:3d:9c:1s-0 ONLINE 0 0 0
nbd9 ONLINE 0 0 0
nbd1 ONLINE 0 0 273 (repairing)
nbd2 ONLINE 0 0 0
nbd3 ONLINE 0 0 0
nbd4 ONLINE 0 0 0
nbd5 ONLINE 0 0 0
nbd6 ONLINE 0 0 0
nbd7 ONLINE 0 0 0
nbd8 ONLINE 0 0 0
spares
draid1-0-0 AVAIL
errors: No known data errors

Notice that nbd1 is being repaired. Let’s try corrupting two virtual block devices:

$ sudo su
# pv < /dev/urandom > /dev/nbd2
77.1MiB 0:00:01 [77.1MiB/s] [==============================================================================> ] 77% ETA 0:00:00
pv: write failed: No space left on device
# pv < /dev/urandom > /dev/nbd7
77.4MiB 0:00:01 [77.3MiB/s] [==============================================================================> ] 77% ETA 0:00:00
pv: write failed: No space left on device
$ exit
$ sudo zpool scrub test
$ sudo zpool status -v
pool: test
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub repaired 52.8M in 00:00:04 with 0 errors on Wed Dec 30 02:31:08 2020
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
draid1:3d:9c:1s-0 ONLINE 0 0 0
nbd0 ONLINE 0 0 7
nbd1 ONLINE 0 0 0
nbd2 ONLINE 0 0 410
nbd3 ONLINE 0 0 0
nbd4 ONLINE 0 0 0
nbd5 ONLINE 0 0 0
nbd6 ONLINE 0 0 0
nbd7 ONLINE 0 0 935
nbd8 ONLINE 0 0 6
spares
draid1-0-0 AVAIL
errors: No known data errors

Lucky. I’ve corrupted two drives and seen both drives become UNAVAIL before. Let’s try three virtual drive corruptions:

$ sudo su
# pv < /dev/urandom > /dev/nbd2
77.1MiB 0:00:01 [77.1MiB/s] [==============================================================================> ] 77% ETA 0:00:00
pv: write failed: No space left on device
# pv < /dev/urandom > /dev/nbd4
77.5MiB 0:00:01 [77.4MiB/s] [==============================================================================> ] 77% ETA 0:00:00
pv: write failed: No space left on device
# pv < /dev/urandom > /dev/nbd6
76.9MiB 0:00:01 [76.8MiB/s] [=============================================================================> ] 76% ETA 0:00:00
pv: write failed: No space left on device
$ exit
$ sudo zpool scrub test
$ sudo zpool status -v
pool: test
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 7.77M in 00:00:02 with 713 errors on Wed Dec 30 02:42:47 2020
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
draid1:3d:9c:1s-0 ONLINE 0 0 0
nbd0 ONLINE 0 0 166
nbd1 ONLINE 0 0 391
nbd2 ONLINE 0 0 557
nbd3 ONLINE 0 0 0
nbd4 ONLINE 0 0 768
nbd5 ONLINE 0 0 0
nbd6 ONLINE 0 0 784
nbd7 ONLINE 0 0 209
nbd8 ONLINE 0 0 370
spares
draid1-0-0 AVAIL
errors: Permanent errors have been detected in the following files:<metadata>:<0x3f>
test:<0x0>
/test/urandom.bin

Okay, now we’re seeing some metadata and data corruption. I expected three drive failures to harm the integrity of the pool, or even take down the pool. So, I’m pleasantly surprised to see the pool still online, albeit with some permanent errors. Let’s try to restore the urandom.bin file from “backup”:

$ rm /test/urandom.bin 
$ pv < /dev/urandom > /test/urandom.bin
156MiB 0:00:29 [1.11MiB/s] [ <=> ]
pv: write failed: No space left on device
$ sudo zpool status -v
pool: test
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 7.77M in 00:00:02 with 713 errors on Wed Dec 30 02:42:47 2020
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
draid1:3d:9c:1s-0 ONLINE 0 0 0
nbd0 ONLINE 0 0 166
nbd1 ONLINE 0 0 391
nbd2 ONLINE 0 0 557
nbd3 ONLINE 0 0 0
nbd4 ONLINE 0 0 768
nbd5 ONLINE 0 0 0
nbd6 ONLINE 0 0 784
nbd7 ONLINE 0 0 209
nbd8 ONLINE 0 0 370
spares
draid1-0-0 AVAIL
errors: Permanent errors have been detected in the following files:<metadata>:<0x3f>
test:<0x0>
test:<0x2>

So, I’m still able to write to the virtual storage pool, but the pool metadata is still corrupted.

From these results, I would conclude that parity drives > hot spares for home NAS purposes. I could not get the hot spare to trigger and more issues come about from a lack of parity drives, such as permanent errors in metadata and file data.

Case Study 2: dRAID Configuration (1 replication group, 6 data, 3 parity, 0 hot spares)

ZFS Storage Pool without Hot Spare for 9 HDDs

9 HDDs can also be configured as 6 data and 3 parity. dRAID will still provide benefits of decentralization (the “d” in dRAID) and sequential resilvering. The loss of a spare and only having a single replication group will cause the resilvering to take longer (almost three times as long). Qiao et al. explored dRAID and found a simple formula to help calculate the RAID speed up as a function of replication groups (k), data drives (d), parity drives (p), hot spares (s), and failed drives (f). The authors also indicate a dRAID overhead factor (H), but the overhead does not appear to be a significant factor in comparing different dRAID configurations.

RAID speed up factor, Phi — the higher the better from Qiao et al.
Examples of RAID Speed Up Factors for Various Pool Configurations

One of the reasons to choose multiple replication groups is the speed increase when resilvering in dRAID (and subsequent scrub). Because all parity is decentralized, all drives can take part instead of a few parity drives, and when replication groups are multiplied, the speed increases dramatically. For example, the dRAID syntax for the Case 1 configuration is draid1:3d:1s:9c. For example, the speed up for this configuration is 2.40. The Case 2 configuration is draid3:6d:0s:9c and has an associated speed up of 0.9, or 2.7x slower than Case 1’s associated speed up of 2.4.

What do we get from Case 2’s configuration? A whole lot more disk space. It doubles from 12TB to 24TB. Both configurations should have the same data access speeds, especially due to the hard drives being attached by network ethernet. So, the extra space without the cost of safety is a good deal.

Optimizing ZFS Parameters

ZFS has accumulated a number of parameters over the 30 years of development. There are a lot more than listed here, but these are some I recommend setting (that are different from the defaults):

sudo zpool create -f \
-m /tank \
-O atime=off \
-O canmount=on \
-O overlay=on \
-O checksum=skein \
-O compression=zstd-1 \
-O dedup=off \
-O filesystem_limit=none \
-O quota=none \
-O refquota=none \
-O refreservation=none \
-O reservation=none \
-O snapshot_limit=none \
-O sharenfs=off \
-O sharesmb=off \
tank \
draid3:6d:0s:9c \
wwn-0x0000000000000001 \
wwn-0x0000000000000002 \
wwn-0x0000000000000003 \
wwn-0x0000000000000004 \
wwn-0x0000000000000005 \
wwn-0x0000000000000006 \
wwn-0x0000000000000007 \
wwn-0x0000000000000008 \
wwn-0x0000000000000009

There’s a lot there, so let’s take this one section at a time:

$ sudo zpool create -f \
-m /tank \

This creates a ZFS storage pool, disables checks for existing ZFS partitions, and mounts the storage pool at /tank.

-O atime=off \
-O canmount=on \
-O overlay=on \
  • atime=off turns off recording of access time. Whenever you read a file, it doesn’t require a subsequent write, thereby saving bandwidth.
  • canmount=on allows the ZFS storage pool to be mounted to the local file system.
  • overlay=on allows the ZFS storage pool to be mounted even if the directory isn’t empty.
-O checksum=skein \
-O compression=zstd-1 \
-O dedup=off \
  • checksum=skein uses the skein hashing algorithm by Schneier et al. (2010), which has an 80% performance improvement compared to the previous SHA-256 algorithm without sacrificing security, like Edon-R did.
  • compression=zstd-1 uses Facebook’s zstandard compression algorithm at the lightest compression level (zstd-1). This is to reduce CPU load on read/write access while still providing good compression.
  • dedup=off uses memory to store additional information that can help reduce disk access. The idea is that data is often duplicated, like header data in a data packet. So, if this data is cached, it can reduce disk access. But this is not often the case, especially for blocks of large data, like photos or video. So, I have disabled this feature. Also, memory is not plentiful on an embedded system, so I’ll save that for more ARC.
-O filesystem_limit=none \
-O quota=none \
-O refquota=none \
-O refreservation=none \
-O reservation=none \
-O snapshot_limit=none \

These are all limits, minimums, and quotas of various features in the pool, but I will not be limiting my pool in any way. I explicitly state the parameter here because it wasn’t clear to me that there were any defaults set for these in the documentation.

-O sharenfs=off \
-O sharesmb=off \

From my previous article on ZFS and NFS, we used the NFS package and fstab by itself and simply mounted ZFS to the local file system and not managing NFS through ZFS, if that makes sense.

tank \
draid3:6d:0s:9c \
wwn-0x0000000000000001 \
wwn-0x0000000000000002 \
wwn-0x0000000000000003 \
wwn-0x0000000000000004 \
wwn-0x0000000000000005 \
wwn-0x0000000000000006 \
wwn-0x0000000000000007 \
wwn-0x0000000000000008 \
wwn-0x0000000000000009
  • tank is the name of the storage pool.
  • draid3:6d:0s:9c is the dRAID configuration. It means triple parity, 6 data drives, 0 hot spares, and 9 children (vdevs) in total.
  • wwn-0x000000000000000{1–9} is the World Wide Name (wwn) of the drive. It is unique and won’t change, no matter what. This can also be a partition, if necessary. It’s generally frowned upon placing multiple partitions from the same drive in the pool, since a drive failure would take out multiple vdevs at once, increasing the risk of data loss.

Configuring Error Recovery Control

When a hard drive has trouble reading a sector, it may read the sector multiple times, or simply time out and conclude that the sector can’t be read anymore. Since we have parity in ZFS for error correction, this is redundant and slow, since the I/O needs to block while the error recovery control (ERC) takes place:

The sector number and ECC enables hard drives to detect and respond to such events. When either event occurs during a read, hard drives will retry the read many times until they either succeed or conclude that the data cannot be read. The latter case can take a substantial amount of time and consequently, IO to the drive will stall.

Enterprise hard drives and some consumer hard drives implement a feature called Time-Limited Error Recovery (TLER) by Western Digital, Error Recovery Control (ERC) by Seagate and Command Completion Time Limit by Hitachi and Samsung, which permits the time drives are willing to spend on such events to be limited by the system administrator.

Drives that lack such functionality can be expected to have arbitrarily high limits. Several minutes is not impossible. Drives with this functionality typically default to 7 seconds. ZFS does not currently adjust this setting on drives. However, it is advisable to write a script to set the error recovery time to a low value, such as 0.1 seconds until ZFS is modified to control it. This must be done on every boot.

So, we will tune the ERC to 0.1s per drive, if possible, as recommended in the ZFS documentation. But first, let’s check what the current setting is:

$ sudo smartctl -l scterc /dev/sda1
smartctl 6.6 2016-05-31 r4324 [aarch64-linux-4.9.140-tegra] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
SCT Error Recovery Control:
Read: 70 (7.0 seconds)
Write: 70 (7.0 seconds)

As it turns out, at least for my storage pool, I’m going to disable ERC, since the drive’s firmware’s allows for exactly two settings: 7 seconds or disabled:

$ sudo smartctl -l scterc,0,0 /dev/sda
smartctl 6.6 2016-05-31 r4324 [aarch64-linux-4.9.140-tegra] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
SCT Error Recovery Control set to:
Read: Disabled
Write: Disabled

Do this for all drives, if it’s supported. Sometimes, it’s not:

$ sudo smartctl -l scterc,0,0 /dev/sdd
smartctl 6.6 2016-05-31 r4324 [aarch64-linux-4.9.140-tegra] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
SCT Error Recovery Control command not supported

ZFS Storage Pool Maintenance

A scrub is when ZFS reads every block in the storage pool and checks the checksum, correcting data as necessary. It’s recommended to scrub the pool once a month. We don’t need to remember to do this; instead, setup a cron job:

$ crontab -e

Enter the cron scrub job (replace tank with your ZFS storage pool name):

# zpool scrub every month
0 2 1 * * /usr/local/bin/zpool scrub tank
0 13 1 * * /usr/local/bin/zpool status -v

Summary

  • Learned about declustered RAID theory.
  • Optimized hard drive size, quantity, and topology.
  • Simulated drive failures for various ZFS storage pool topologies with QEMU NBD.
  • Configured and optimized the ZFS storage pool.
  • Reduced or disabled the Error Recovery Control for each hard disk drive to speed up access and let ZFS take over error correction.
  • Setup a cron job to scrub the ZFS storage pool every month.

Further Resources

--

--

Simeon Trieu

Imaging Systems Ninjaneer, Computer Vision, Photographer and Videographer, VR Athlete, Pianist