Got Linux ?

Blah blah blah... Mostly technical thoughts, rants and gibberish

Highly-available ZFS-on-Linux, on top of DRBD


The goal of this setup is to provide:

The idea of layering ZFS on top of DRBD spawns from several considerations:

The ZVOL hindrances hinted at in the first point above are:

For the sake of consistency in the following chapters, we’ll refer to:


Setting up ZFS on top of DRBD is rather straight-forward and works perfectly well until one starts testing its high-availability and resilience to failures.

ZFS on top of DRBD will fail miserably - full system lockup - should a local disk fail, even though DRBD is supposed to continue operating using the (still healthy) remote disk.

The key to having ZFS on top of DRBD work even when failures start to show up is to make sure to detach both the local and the remote DRBD resource in case of I/O errors.

Only then will ZFS be aware of the underlying DRBD resource I/O error (be it local or remote), mark the corresponding ZPOOL device FAULTED, and continue operation thanks to its redundant VDEV (raidz or mirror).

DRBD on-io-error Handler

In order to heed the above warning, DRBD must use a custom on-io-error handler (script):


Mark the big WARNING (again):

#                     ??? DRBD bug <-???-> ZFS bug ???
#  DRBD resources MUST be detached on both peers to prevent total failure
#  of all network interfaces, and all other DRBD resources, and eventual
#  full system deadlock.
#                     ??? DRBD bug <-???-> ZFS bug ???
#  Also, DRBD 'on-io-error' handler MUST complete (exit) for the I/O error
#  to be propagated to ZFS. Any ZFS transaction group (TXG) WILL stall until
#  it does, with corresponding process (txg_sync) deadlock! As well as any
#  other attempted ZFS command!

This handler achieves its purpose in four steps:

  1. Detach the DRBD resource on both local and remote peers (the latter via SSH). NO ZFS commands are executed what-so-ever, to avoid deadlocking with txg_sync

  2. Call itself again - asynchronously (forking) - with different arguments, to carry on with the ZFS commands that are to be executed after txg_sync resumes

  3. Exit and allow DRBD to continue processing the error and pass it on to the upper layer - namely ZFS - and allow txg_sync to resume

  4. The forked sibling will be able to carry on - along with its required ZFS commands - now that txg_sync is back in business

To enable the handler, DRBD must be configured as per the following snippets:

handlers {
  local-io-error "/etc/drbd.d/drbd+zfs-io-error.sh drbd-detach --remote <peer> | mail -s '[DRBD/ZFS] I/O Error' root";

disk {
  on-io-error call-local-io-error;

ZFS Initialization

Given our setup specifics:

HA Resource Agents and Configuration

The standard Corosync+Pacemaker Linux HA stack does not provide the OCF Resource Agents (RA) required to properly manage a ZFS on top of DRBD setup:

  1. Multiple DRBD resources must be managed as one primitive, where a given number of failed individual DRBD resources shall not be considered as global failure (since ZFS VDEV redundancy is here to handle the corresponding failed devices).

  2. Importing/exporting the ZPOOL must also managed as a HA primitive, along the underlying DRBD resources (primitive).

Thus the following custom RAs:

Which are to be used in Pacameker as:

  <group id="ZPool-tank">
    <primitive id="ZPool-tank-drbd" class="ocf" provider="custom" type="DrbdMultiResources">
      <instance_attributes id="ZPool-tank-drbd-IA">
        <nvpair id="ZPool-tank-drbd-IA-resources" name="resources" value="drbd{0..11}"/>
        <nvpair id="ZPool-tank-drbd-IA-allow_nonstarted" name="allow_nonstarted" value="3"/>
        <op id="ZPool-tank-drbd-OP-monitor" name="monitor" interval="60s" timeout="10s"/>
    <primitive id="ZPool-tank-zpool" class="ocf" provider="custom" type="ZfsPool">
      <instance_attributes id="ZPool-tank-zpool-IA">
        <nvpair id="ZPool-tank-zpool-IA-pool" name="pool" value="tank"/>
        <nvpair id="ZPool-tank-zpool-IA-cachefile" name="cachefile" value="/etc/ha.d/zpool-tank.cache"/>
        <op id="ZPool-tank-zpool-OP-monitor" name="monitor" interval="60s" timeout="10s"/>

Dealing with Failures

The very first rule that one must abide to:

Once DRBD resources are detached on both peer:

Now on to repairing the faulty disk/resource/device:

  1. Make sure to switch the local DRBD resource to Secondary (which will be possible only if the ZPOOL device is properly FAULTED)

    drbdadm secondary <resource>
  2. Replace the faulty disk and initialize it for DRBD usage

    drbdadm create-md <resource>
  3. Re-attach both DRBD resources (which is safe once the corresponding ZPOOL device is FAULTED)

    drbdadm attach <resource>
  4. Re-connect both DRBD resources (this will be necessary only if one of the peer has been rebooted in the mean time)

    drbdadm connect <resource>
  5. Verify the DRBD resource is fully resynchronized

    cat /proc/drbd
    # => ... Connected Secondary/Secondary UpToDate/UpToDate
  6. Switch the DRBD resource back to Primary

    drbdadm primary <resource>
  7. Online the ZPOOL device again

    zpool clear <pool> <device>
    # OR (should the repaired resource be a spare)
    zpool add <pool> spare <spare>

Annex: JBOD Creation

If disks are hooked to a hardware RAID controller, they must be exported as JBOD, which can then be used individually as DRBD resources backing disks.

Example given for an Adaptec controller:

arcconf CREATE 1 JBOD <channel> <ID> [<channel> <ID> ...]
echo '- - -' > /sys/class/scsi_host/host0/scan
ls -l /dev/disk/by-path/pci-*-scsi-*

PS: Why not use the controller’s RAID for redundancy ? Because ZFS is so much better!