โ—index ๐Ÿ”’luks-dropbear-raid6.md ๐Ÿท๏ธtags ๐Ÿ‘คabout

๐Ÿ”’ Encrypting Everything: LUKS + Dropbear + RAID6 on a Headless Cluster

Second post in the k3s homelab series. If you missed the first one, it's about tunneling 40 services through a VPS to escape CGNAT.

So, corellia, my control plane node, an 8-core N305 mini PC, has six SATA drives hanging off it. Five active in a RAID6 array, one hot spare. That's ~33 TB of usable storage, serving the entire cluster over NFS. Media, backups, databases, git repos.. everything lives on /share.

But here's the thing: every amd64 node in the cluster has full-disk encryption. Root partitions on corellia, mandalore, and tatooine are all LUKS-encrypted. The Raspberry Pis are the exception, they boot from SD cards and the threat model is different (and honestly, encrypting a Pi SD card is more pain than it's worth).

This is great until one of the machines reboots and nobody is home to type the passphrase ๐Ÿ˜….

The problem: encrypted disks on headless machines

LUKS full-disk encryption means the kernel can't mount the root filesystem without a passphrase at boot time. On a desktop you type it in. On a headless server in a closet? You're stuck.

The common solutions are:

  • Don't encrypt, sure, if you don't care about someone walking off with your drives
  • TPM auto-unlock, works great on modern hardware, but my mini PCs don't have TPM chips
  • Tang/Clevis, network-bound encryption, elegant but requires running a Tang server somewhere
  • Dropbear in initramfs, SSH into the machine during boot and type the passphrase remotely

I went with Dropbear. It's the simplest thing that works, and I already SSH into everything anyway.

How Dropbear initramfs works

The idea is beautifully stupid. Linux initramfs (the tiny filesystem that loads before the real root) can run a minimal SSH server. You SSH in, pipe the passphrase to cryptsetup, and boot continues normally.

Here's what happens when corellia reboots:

  1. BIOS โ†’ GRUB โ†’ kernel loads initramfs
  2. Initramfs brings up the network interface with a static IP
  3. Dropbear SSH server starts on port 22
  4. I SSH in and unlock the root LUKS volume
  5. Root mounts, RAID auto-unlocks using a keyfile on root (/etc/luks-md0)
  6. RAID assembles, NFS starts, k3s joins the cluster
โฏ_bashโ€บ2 lines
  1# The unlock command, pipe passphrase to cryptsetup
  2โฏโฏโฏ ssh root@corellia "echo -n 'hunter2' > /lib/cryptsetup/passfifo"

That one command is all it takes; the passphrase goes to a named pipe that cryptsetup reads from during initramfs, and the machine finishes booting on its own.

The clever part is the RAID encryption. Corellia has two LUKS layers: the root partition (nvme0n1p3) unlocked by Dropbear, and the RAID array (md0) unlocked automatically by a keyfile stored on the now-decrypted root. So I only type one passphrase, but both the OS and the 33 TB array end up encrypted. mandalore and tatooine only have the root partition to unlock, no RAID.

The RAID6 array

storage stack diagram

The storage stack on corellia is layered like a cake:

๐Ÿ“„textโ€บ1 lines
  16 ร— SATA drives โ†’ RAID6 (md0) โ†’ LUKS (md0_crypt) โ†’ ext4 โ†’ /share โ†’ NFS

RAID6 gives me two-disk fault tolerance. Any two drives can fail simultaneously and I lose nothing. The hot spare kicks in automatically on the first failure, so in practice I'd need three drives to die before I lose data. With 8,000+ power-on hours on most drives and zero SMART errors (except sdb with 68 read errors that I'm keeping an eye on ๐Ÿ‘€), I sleep fine.

Current status from Prometheus:

๐Ÿ“„textโ€บ4 lines
  1RAID6 (md0): HEALTHY, 5/5 active, 0 failed, 1 spare
  2Used: 9,794 GB / 33,393 GB (29.3%)
  3Drives: sda 44ยฐC, sdb 41ยฐC, sdc 45ยฐC, sdd 42ยฐC, sde 43ยฐC, sdf 42ยฐC
  4NVMe boot: 40ยฐC, wear 3%, power-on 2,523h

Why RAID6 and not ZFS?

I know, I know. Every r/homelab post will tell you to use ZFS. But RAID6 via mdadm is:

  • Dead simple. mdadm --detail /dev/md0 tells me everything
  • No special kernel modules or memory requirements
  • Works with any filesystem on top (I use ext4 because it's boring and reliable)
  • Been around for decades, I trust it with my data

ZFS is great software. But I don't need snapshots, dedup, or inline compression badly enough to take on the operational complexity. mdadm + LUKS + ext4 is a stack I understand completely, and when something breaks at 3 AM I want simple.

Ansible: 16 roles, 7 nodes, one command

The entire bare-metal provisioning, from fresh Debian install to k3s-ready node, is handled by Ansible. 16 roles, applied in order:

๐Ÿ“‹yamlโ€บ12 lines
  1# site.yml, the full playbook structure
  2- Base provisioning (all hosts):       packages, sysctl
  3- Static networking (home + parents):  networking
  4- LUKS + Dropbear remote unlock:       luks_dropbear    # corellia, mandalore, tatooine
  5- RAID6 array (corellia):              raid
  6- NFS server (corellia):               nfs_server
  7- NFS client mounts:                   nfs_client       # mandalore, tatooine, kamino, jakku
  8- Tailscale:                           tailscale        # all nodes
  9- SD card wear minimization:           sdcard           # RPi nodes only
 10- Scarif firewall:                     scarif_firewall
 11- Headscale:                           headscale        # scarif only
 12- Startup scripts:                     startup          # per-node tuning

The inventory is grouped by function, not just location:

๐Ÿ“‹yamlโ€บ5 lines
  1# Functional groups in hosts.yml
  2luks_hosts:    [corellia, mandalore, tatooine]  # encrypted amd64 nodes
  3nfs_server:    [corellia]                        # the one true NFS server
  4nfs_clients:   [mandalore, tatooine, kamino, jakku]
  5rpi_hosts:     [kamino, jakku, dagobah]          # SD card wear tricks

This means I can provision a single host (ansible-playbook site.yml -l corellia), a single role (--tags sysctl), or the entire cluster in one shot. Dry-run with --check --diff before anything destructive.

The RAID role is intentionally defensive

The RAID Ansible role does not create the array. It only verifies it exists and prints instructions if it doesn't:

๐Ÿ“‹yamlโ€บ13 lines
  1# raid/tasks/main.yml, verify, don't create
  2- name: Fail if RAID array does not exist
  3  ansible.builtin.fail:
  4    msg: |
  5      RAID array {{ raid_array }} does not exist.
  6      To create it manually, run:
  7        mdadm --create {{ raid_array }} --level=6 \
  8          --raid-devices=5 /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sdf \
  9          --spare-devices=1 /dev/sde
 10      Then set up LUKS:
 11        cryptsetup luksFormat {{ raid_array }}
 12        cryptsetup luksOpen {{ raid_array }} md0_crypt
 13  when: raid_status.rc != 0

Creating a RAID array is a one-time, destructive operation. I don't want Ansible doing that automatically. The role verifies the array, checks mdadm.conf and /etc/crypttab, and warns if anything is missing. The actual creation was done by hand, once, with me staring at the terminal making sure I had the right drives.

Dropbear setup with Ansible

The Dropbear role deploys three things:

  1. SSH authorized keys into the initramfs
  2. Dropbear config with hardened options
  3. Static IP for the initramfs network
โฏ_bashโ€บ2 lines
  1# Dropbear options: no password auth, no forwarding, 10 min idle timeout
  2DROPBEAR_OPTIONS="-I 600 -j -k -p 22 -s"

The static IP is the interesting part. Each encrypted host needs network configured inside initramfs, before the real OS loads. The format is a single string that the kernel's ip= parameter parses:

๐Ÿ“„textโ€บ1 lines
  1IP::GATEWAY:NETMASK:HOSTNAME:INTERFACE

For corellia that's 10.0.1.252::10.0.1.254:255.255.254.0:corellia:enp4s0. Every change triggers update-initramfs -u to rebuild the boot image.

NFS: serving 33 TB to the cluster

Once corellia's RAID is unlocked and mounted at /share, it exports via NFS to every worker node:

โฏ_bashโ€บ3 lines
  1# /etc/exports, two subnets: LAN + Tailscale overlay
  2/share    10.0.0.0/23(rw,async,no_root_squash,no_subtree_check,wdelay)
  3/share    100.64.0.0/10(rw,async,no_root_squash,no_subtree_check,wdelay)

Two subnets because worker nodes reach corellia either via LAN (10.0.x.x) or via the Tailscale overlay (100.64.x.x). The no_root_squash is required for k3s, containers run as root and need to create files on NFS volumes.

The client mount options are tuned for k3s workloads:

โฏ_bashโ€บ3 lines
  1# NFS client mount on worker nodes
  2โฏโฏโฏ mount | grep share
  3nfs.crisidev.lan:/share on /share type nfs4 (rw,vers=4.2,hard,nconnect=4,rsize=1048576,wsize=1048576)

The key options:

  • vers=4.2, NFSv4.2, modern protocol with better locking and performance
  • hard, retry forever if the server is unreachable (don't fail pods just because NFS hiccupped)
  • nconnect=4, four parallel TCP connections per mount (huge throughput improvement)
  • rsize=1048576,wsize=1048576, 1 MB I/O buffers (default is 32 KB)
  • x-systemd.automount, lazy mount on first access, plays nice with boot ordering

Keeping Raspberry Pis alive: the SD card war

kamino and jakku are Raspberry Pi 4Bs. They're great little arm64 workers, but they boot from SD cards. SD cards have limited write endurance. Every log entry, every journal write, every atime update is slowly killing the card.

The sdcard Ansible role is ruthlessly focused on minimizing writes:

Disable atime, one of the biggest wins. By default, Linux updates the "last accessed" timestamp on every file read. That's a write for every read. Insane on flash storage.

โฏ_bashโ€บ2 lines
  1# /etc/fstab, noatime on root
  2/dev/mmcblk0p2  /  ext4  defaults,noatime  0  1

Volatile journal, systemd journal writes to RAM only, never touches the SD card:

๐Ÿ“„textโ€บ3 lines
  1[Journal]
  2Storage=volatile
  3RuntimeMaxUse=50M

50 MB of logs in RAM. When the Pi reboots, logs are gone. That's fine, I have Loki collecting everything anyway.

Kill swap, use zram, disk-based swap on an SD card is murder. Instead, systemd-zram-generator creates compressed in-memory swap. The RAM compresses roughly 2:1, so the 8 GB Pi effectively gets ~12 GB of memory without touching the card.

Performance tuning: the startup script

Every node runs a startup script via a systemd oneshot service. The default script enables GRO (Generic Receive Offload) forwarding for NFS performance:

โฏ_bashโ€บ4 lines
  1# default-startup.sh, all nodes
  2โฏโฏโฏ cat /usr/local/bin/startup.sh
  3NETDEV=$(ip -o route get 8.8.8.8 | cut -f 5 -d " ")
  4ethtool -K "$NETDEV" rx-udp-gro-forwarding on rx-gro-list off

Corellia gets extra tuning for the RAID array:

โฏ_bashโ€บ8 lines
  1# corellia-startup.sh, RAID performance
  2โฏโฏโฏ cat /usr/local/bin/startup.sh
  3NETDEV=$(ip -o route get 8.8.8.8 | cut -f 5 -d " ")
  4ethtool -K "$NETDEV" rx-udp-gro-forwarding on rx-gro-list off
  5
  6echo "max_performance" | tee /sys/class/scsi_host/host*/link_power_management_policy
  7echo 256 | tee /sys/block/sd*/queue/read_ahead_kb
  8echo 32768 | tee /sys/block/md0/md/stripe_cache_size

Three tuning knobs:

  • Link power management: disable aggressive power saving on SATA links (latency vs power)
  • Read-ahead: 256 KB per disk (better sequential throughput for media streaming)
  • Stripe cache: 32 MB for RAID6 parity calculations (the default is absurdly small)

The startup role looks for {hostname}-startup.sh first, falls back to default-startup.sh. Simple convention, no conditional logic needed.

The 3 AM reboot scenario

Let's walk through what happens when corellia reboots unexpectedly:

  1. T+0s: Power comes back, BIOS โ†’ GRUB โ†’ kernel โ†’ initramfs
  2. T+15s: Dropbear starts, static IP configured on enp4s0
  3. T+15s: Alertmanager fires NodeDown โ†’ Gotify notification on my phone ๐Ÿ“ฑ
  4. T+?: I wake up, see the notification, grab my phone
  5. T+?: ssh root@corellia "echo -n '<passphrase>' > /lib/cryptsetup/passfifo"
  6. T+30s: Root unlocks, RAID keyfile available, md0 auto-decrypts
  7. T+45s: RAID assembles, NFS starts, k3s agent rejoins, pods reschedule
  8. T+60s: Cluster healthy, go back to sleep

The gap between step 3 and step 5 is the problem. If I'm traveling, asleep with my phone on silent, or just not paying attention, the cluster runs without its storage node. Worker nodes with hard NFS mounts will hang waiting for /share to come back, they won't crash, but they'll be useless until corellia unlocks.

Is this acceptable? For a homelab, yes. I've considered Tang/Clevis for automated unlock, but that means the encryption key is recoverable from the network, which defeats part of the purpose. For now, Dropbear + a phone notification gets me unlocked within minutes on a normal day.

Temperatures and health

The whole cluster runs cool. Corellia with six spinning drives maxes out at 45ยฐC. The RPis run warmer (kamino at 59ยฐC, jakku at 64ยฐC) because passive cooling. mandalore hits 71ยฐC under load, it's the busiest worker node and could use better ventilation.

๐Ÿ“„textโ€บ6 lines
  1corellia:  45ยฐC (8-core N305 + 6 SATA + NVMe)
  2mandalore: 71ยฐC (4-core N100, busiest worker)
  3tatooine:  55ยฐC (4-core N100)
  4kamino:    59ยฐC (RPi 4B, passive cooled)
  5jakku:     64ยฐC (RPi 4B, passive cooled)
  6scarif:    cloud VPS, no sensors

All within spec, but mandalore is on my radar. A 3D-printed fan duct is in the future.

The result

The RAID6 array is LUKS-encrypted at rest and served over NFSv4.2 to the rest of the cluster, unlockable remotely via SSH in initramfs ๐Ÿ”.. The amd64 nodes all have encrypted root partitions, with corellia's RAID auto-unlocking via a keyfile on the decrypted root. Every node provisions from bare Debian to k3s-ready with a single Ansible command, and the Raspberry Pis run on SD cards which should last years instead of months thanks to aggressive write minimisation.

The whole provisioning repo is a modest pile of Ansible roles, which is the pleasant thing about mature infrastructure work: the boring code that reliably does its job tends to be the code worth keeping ๐Ÿ’—.

Next up: 150 Pods on 32 Cores, multi-arch scheduling across x86 and ARM, priority classes, and why my cluster uses less power than a gaming PC.

:discuss share / comment on Mastodon โ†’