Highly-available ZFS-on-Linux, on top of DRBD

[2018.06.14]

The goal of this setup is to provide:

High-Availability (HA) with genuinely no Single Point Of Failure (SPOF), including storage resources (disks) themselves, as opposed to multipath iSCSI setups, where the hosting bays (or SATA/SAS backplanes, or … whatever) ends up being SPOFs (however unlikely).
In such a way to limit re-synchronization of peers as much as possible, especially in the (likely) event of one peer unavailability (e.g. during a reboot).

The idea of layering ZFS on top of DRBD spawns from several considerations:

The opposite setup - DRBD on top of ZFS - requires dealing with ZVOLs, which come with their own hindrances (see below), as well as a mandatoy additional filesystem layer on top of the DRBD resource, which ought to be ZFS again to benefit from its data integrity guarantee (checksums) where it is most needed.
DRBD is very efficient at managing de-/re-synchronization of data blocks, thanks to its block-change tracking bitmap. Thus, when resynchronization is needed, only the desynchronized blocks are transferred between peers. As opposed to ZFS resilvering, which must go through all used data blocks to resynchronize a device should it become (even slightly) desynchronized.
OpenZFS does not support hybrid raidz/mirror VDEVs, thus invalidating iSCSI as an alternative to DRBD, where each mirror VDEV would be constituted of the local disk and the remote iSCSI-shared disk. Even if such setup was possible, it would suffer from ZFS resilvering being suboptimal compared to DRBD resynchronization.
Having ZFS at the very top as a filesystem allows to benefit optimally from its many advantages: checksums, snapshots, etc.

The ZVOL hindrances hinted at in the first point above are:

ZVOL have their size fixed at the time of creation
ZVOL used/free space may not match the above layer(s) used/free space if TRIM is not fully supported (by all layers). This may in turn severely affect snapshots size (and consume available capacity much faster than anticipated).

For the sake of consistency in the following chapters, we’ll refer to:

Local resources/devices/disks as the ones located on the peer where the DRBD resources are Primary and the ZFS pool imported.
Remote resources/devices/disks as the ones on the other peer, namely where DRBD resources are all Secondary and the ZFS pool exported.

A Big WARNING!

Setting up ZFS on top of DRBD is rather straight-forward and works perfectly well until one starts testing its high-availability and resilience to failures.

ZFS on top of DRBD will fail miserably - full system lockup - should a local disk fail, even though DRBD is supposed to continue operating using the (still healthy) remote disk.

The key to having ZFS on top of DRBD work even when failures start to show up is to make sure to detach both the local and the remote DRBD resource in case of I/O errors.

Only then will ZFS be aware of the underlying DRBD resource I/O error (be it local or remote), mark the corresponding ZPOOL device FAULTED, and continue operation thanks to its redundant VDEV (raidz or mirror).

DRBD on-io-error Handler

In order to heed the above warning, DRBD must use a custom on-io-error handler (script):

drbd+zfs-io-error.sh

Mark the big WARNING (again):

## !!! WARNING !!! WARNING !!! WARNING !!! WARNING !!! WARNING !!! WARNING !!!
#
#                     ??? DRBD bug <-???-> ZFS bug ???
#  DRBD resources MUST be detached on both peers to prevent total failure
#  of all network interfaces, and all other DRBD resources, and eventual
#  full system deadlock.
#                     ??? DRBD bug <-???-> ZFS bug ???
#
#  Also, DRBD 'on-io-error' handler MUST complete (exit) for the I/O error
#  to be propagated to ZFS. Any ZFS transaction group (TXG) WILL stall until
#  it does, with corresponding process (txg_sync) deadlock! As well as any
#  other attempted ZFS command!
#
## !!! WARNING !!! WARNING !!! WARNING !!! WARNING !!! WARNING !!! WARNING !!!

This handler achieves its purpose in four steps:

Detach the DRBD resource on both local and remote peers (the latter via SSH). NO ZFS commands are executed what-so-ever, to avoid deadlocking with txg_sync
Call itself again - asynchronously (forking) - with different arguments, to carry on with the ZFS commands that are to be executed after txg_sync resumes
Exit and allow DRBD to continue processing the error and pass it on to the upper layer - namely ZFS - and allow txg_sync to resume
The forked sibling will be able to carry on - along with its required ZFS commands - now that txg_sync is back in business

To enable the handler, DRBD must be configured as per the following snippets:

handlers {
  local-io-error "/etc/drbd.d/drbd+zfs-io-error.sh drbd-detach --remote <peer> | mail -s '[DRBD/ZFS] I/O Error' root";
}

disk {
  on-io-error call-local-io-error;
}

ZFS Initialization

Given our setup specifics:

The DRBD “synchronization unit” is 4KiB - ZFS ashift=12 - corresponding to a single bit in the block-change tracking bitmap.
The HA stack being eventually responsible from importing/exporting the ZFS pool(s), we must make sure to use a ZPOOL cache file that will not result in the system automatically attempting to import the pool at boot (e.g. through systemd zfs-import-cache service).
```
zpool create \
  -o cachefile=/etc/ha.d/zpool-tank.cache \
  -o ashift=12 \
  tank \
  raidz3 /dev/drbd{0..10} \
  spare /dev/drbd11
```

HA Resource Agents and Configuration

The standard Corosync+Pacemaker Linux HA stack does not provide the OCF Resource Agents (RA) required to properly manage a ZFS on top of DRBD setup:

Multiple DRBD resources must be managed as one primitive, where a given number of failed individual DRBD resources shall not be considered as global failure (since ZFS VDEV redundancy is here to handle the corresponding failed devices).
Importing/exporting the ZPOOL must also managed as a HA primitive, along the underlying DRBD resources (primitive).

Thus the following custom RAs:

Which are to be used in Pacameker as:

<resources>
  <group id="ZPool-tank">
    <primitive id="ZPool-tank-drbd" class="ocf" provider="custom" type="DrbdMultiResources">
      <instance_attributes id="ZPool-tank-drbd-IA">
        <nvpair id="ZPool-tank-drbd-IA-resources" name="resources" value="drbd{0..11}"/>
        <nvpair id="ZPool-tank-drbd-IA-allow_nonstarted" name="allow_nonstarted" value="3"/>
      </instance_attributes>
      <operations>
        <op id="ZPool-tank-drbd-OP-monitor" name="monitor" interval="60s" timeout="10s"/>
      </operations>
    </primitive>
    <primitive id="ZPool-tank-zpool" class="ocf" provider="custom" type="ZfsPool">
      <instance_attributes id="ZPool-tank-zpool-IA">
        <nvpair id="ZPool-tank-zpool-IA-pool" name="pool" value="tank"/>
        <nvpair id="ZPool-tank-zpool-IA-cachefile" name="cachefile" value="/etc/ha.d/zpool-tank.cache"/>
      </instance_attributes>
      <operations>
        <op id="ZPool-tank-zpool-OP-monitor" name="monitor" interval="60s" timeout="10s"/>
      </operations>
    </primitive>
  </group>
</resources>

Dealing with Failures

The very first rule that one must abide to:

Thou shall always detach both local and remote DRBD resources in case of failure on either side (or face the wrath of the System Deadlock Demon)!
```
drbdadm detach <resource>
```
This ought to be taken care of by DRBD on-io-error handler.

Once DRBD resources are detached on both peer:

ZFS should gracefully offline the corresponding device (-> FAULTED).
For reason yet unknown, ZED might not automatically activate the spare device.

To be on the safe side, the DRBD on-io-error handler should have taken care of it too.

zpool offline -f <pool> <device>

zpool replace <pool> <device> <spare>
# OR (should the faulty resource be a spare)
zpool remove <pool> <spare>

Now on to repairing the faulty disk/resource/device:

Make sure to switch the local DRBD resource to Secondary (which will be possible only if the ZPOOL device is properly FAULTED)
```
drbdadm secondary <resource>
```
Replace the faulty disk and initialize it for DRBD usage
```
drbdadm create-md <resource>
```
Re-attach both DRBD resources (which is safe once the corresponding ZPOOL device is FAULTED)
```
drbdadm attach <resource>
```
Re-connect both DRBD resources (this will be necessary only if one of the peer has been rebooted in the mean time)
```
drbdadm connect <resource>
```

Verify the DRBD resource is fully resynchronized

cat /proc/drbd
# => ... Connected Secondary/Secondary UpToDate/UpToDate

Switch the DRBD resource back to Primary
```
drbdadm primary <resource>
```

Online the ZPOOL device again

zpool clear <pool> <device>
# OR (should the repaired resource be a spare)
zpool add <pool> spare <spare>

Annex: JBOD Creation

If disks are hooked to a hardware RAID controller, they must be exported as JBOD, which can then be used individually as DRBD resources backing disks.

Example given for an Adaptec controller:

arcconf CREATE 1 JBOD <channel> <ID> [<channel> <ID> ...]
echo '- - -' > /sys/class/scsi_host/host0/scan
ls -l /dev/disk/by-path/pci-*-scsi-*

PS: Why not use the controller’s RAID for redundancy ? Because ZFS is so much better!