The Alerting Pipeline That Pages My Phone

Last post in the k3s homelab series. Previously: CGNAT tunneling, LUKS + Dropbear + RAID6, multi-arch scheduling, self-healing automation, and GitOps.

I run a full observability stack on a homelab 📈.. Prometheus scrapes a pile of time series across every target group I could think of. Loki ingests logs from every node and pod. Grafana ties it all together. Alertmanager routes to my phone. This is how it's wired up.

The stack

Five components, deployed as subcharts of a single monitoring Helm chart:

kube-prometheus-stack: Prometheus, Alertmanager, Grafana, and all the default scrape configs
Loki: Log aggregation (single-binary mode)
Alloy: Log collector, runs on every node as a DaemonSet
Gotify: Notification server with an Android app
OpenTelemetry Collector: Trace collection for services that support OTEL

All five deploy to the monitoring namespace. Prometheus and Loki use local-path storage for I/O performance. Alertmanager and Gotify use NFS.

Prometheus: 617K series, 1 year retention

Prometheus is the heart of it. It scrapes every 30 seconds and keeps data for a year:

📋yaml›9 lines

  1retention: "1y"
  2retentionSize: "200GiB"
  3storageSpec:
  4  volumeClaimTemplate:
  5    spec:
  6      storageClassName: local-path
  7      resources:
  8        requests:
  9          storage: 200Gi

A year of retention is overkill for a homelab, but it means I can compare this December's power consumption to last December's, or see how memory usage changed after a Kubernetes version upgrade six months ago. The 200GB size limit acts as a safety net — if series cardinality explodes, Prometheus drops the oldest data instead of filling the disk.

Right now, Prometheus tracks 3,598 unique metric names across 617,405 active time series, ingesting about 21,000 samples per second. The priority class is critical so the descheduler won't touch it.

What gets scraped

Kubernetes service discovery handles most of it automatically. Any pod with prometheus.io/scrape: "true" annotations gets picked up. But non-Kubernetes targets need static scrape configs:

📋yaml›12 lines

  1# External scrape targets (via Tailscale IPs)
  2- job_name: node-exporter
  3  static_configs:
  4    - targets: ['100.64.0.7:9100']   # falcon (monitoring host)
  5      labels: { hostname: falcon }
  6    - targets: ['10.0.1.254:9100']    # hoth (router)
  7      labels: { hostname: hoth }
  8
  9- job_name: smartctl-exporter
 10  static_configs:
 11    # All 6 nodes with disks
 12    - targets: ['corellia:9633', 'mandalore:9633', ...]

The scrape secret also covers traefik-edge metrics on scarif, the headscale coordination server, and CrowdSec intrusion detection stats. Everything reachable via the Tailscale overlay, so no ports exposed to the internet.

The relabel_configs copy the hostname label to instance, so Grafana dashboards show "corellia" instead of "10.0.1.252:9100". Small detail, big readability improvement.

Loki: 90 days of logs

Loki runs as a single binary with filesystem-backed storage. There's no object store and no microservice split, just one replica backed by one PVC:

📋yaml›18 lines

  1deploymentMode: SingleBinary
  2loki:
  3  storage:
  4    type: filesystem
  5  schemaConfig:
  6    configs:
  7      - from: "2026-01-01"
  8        store: tsdb
  9        object_store: filesystem
 10        schema: v13
 11        index:
 12          prefix: index_
 13          period: 24h
 14  limits_config:
 15    retention_period: 2160h   # 90 days
 16  compactor:
 17    retention_enabled: true
 18    delete_request_cancel_period: 2h

90 days is enough to debug anything I've ever needed to go back and check. The compactor runs hourly and garbage-collects expired chunks.

Alloy: the log collector

Alloy replaces Promtail as the log shipping agent. It runs on all 7 nodes as a DaemonSet (including scarif), collecting:

Container logs from the Kubernetes API
Systemd journal entries (kernel messages, k3s agent logs, SSH auth)

Everything ships to Loki with namespace, pod, container, and node labels. A typical LogQL query looks like:

📄text›1 lines

  1{namespace="pangolin"} |= "error" | logfmt

The DaemonSet tolerates the scarif edge taint, so logs from traefik-edge and gerbil are captured too.

Alert rules: 25 categories of things that can break

The stock kube-prometheus-stack alert rules cover the basics (pod crashlooping, node down, disk pressure). But a homelab has its own failure modes. I've added custom rules for everything that's ever broken:

Hardware:

MaxTempHigh: any hwmon sensor above 95°C (the N100s run warm under sustained load)
NodeDiskIOSaturation: weighted I/O queue time above threshold — 12 hours for RAID (it's always somewhat busy), 30 minutes for single disks
SD card saturation on the Pis: mmcblk0 I/O time above 90%

Storage:

PersistentVolumeUsageHigh (>85%) and Critical (>95%)
FillingUp4h: linear prediction that a PVC will fill within 4 hours
Filesystem read-only detection

Network & services:

TargetDown: any scrape target unreachable for more than 5 minutes
QBittorrentDown, NetworkDisconnected, Firewalled — the VPN tunnel is fragile
GarageDown, GarageNodeDown — S3 storage health

Databases:

CNPGDown: PostgreSQL cluster health (CNPG operator)
MysqlDown, MysqlReplicationLag
etcd alerts (k3s embeds etcd): no leader, high fsync durations, high commit latency, database size approaching limit

Certificates:

CertificateNotReady: cert-manager certificate stuck for more than 15 minutes
CertificateExpiringSoon: less than 7 days until expiry

Cluster operations:

K3sUpgradeAvailable: a new k3s version exists that the system-upgrade controller can apply

I've disabled a handful of upstream defaults that don't apply to k3s (standalone etcd rules, some kubelet alerts) and replaced them with custom versions that account for the cluster's quirks.

Alertmanager: routing to my phone

Alertmanager receives firing alerts and routes them to Gotify:

📋yaml›12 lines

  1route:
  2  receiver: gotify
  3  group_by: [alertname]
  4  routes:
  5    - matchers: [severity =~ "critical"]
  6      repeat_interval: 15m
  7    - matchers: [severity =~ "warning"]
  8      repeat_interval: 2h
  9inhibit_rules:
 10  - source_matchers: [severity = "critical"]
 11    target_matchers: [severity = "warning"]
 12    equal: [alertname]

Critical alerts repeat every 15 minutes until resolved. Warnings repeat every 2 hours. The inhibit rule prevents duplicate noise: if NodeDown fires as critical, the warning-level version is suppressed. The Watchdog alert (a dead man's switch that fires continuously to prove the pipeline is working) routes to a null receiver.

Gotify bridge

The alertmanager-gotify-bridge translates Alertmanager webhooks into Gotify API calls. It formats messages with emoji prefixes and links back to the alert source:

📄text›6 lines

  1🔥 [FIRING] NodeFilesystemUsageHigh
  2⚠️ warning | firing
  3
  4Filesystem on mandalore is at 87% capacity.
  5
  6🔗 Prometheus | Alertmanager | Silence

Gotify runs in-cluster with its own Android app, and push notifications arrive within seconds of an alert firing without routing through any third-party notification service.

Grafana: the dashboards

Grafana uses PostgreSQL as its backend (via the CNPG database operator in the database namespace) and authenticates through Google OAuth. Dashboards are backed up to charts/grafana-dashboards/ as JSON files and can be restored with a script.

The dashboards I actually look at regularly:

Home: cluster overview with CPU, memory, disk, network, power consumption per node
Traefik: request rates, response codes, latency percentiles per ingress
Loki: log volume by namespace, error rate trends
Node: per-node deep dive with all hardware metrics (temperatures, fan speeds, disk I/O, network)
RAID: md0 health, individual disk SMART attributes, rebuild progress if a drive fails

Alert rules link directly to relevant Grafana panels. When PersistentVolumeUsageHigh fires, the Gotify message includes a link to the disk usage panel with the right time range.

Extra exporters

Some things need their own exporter to get metrics into Prometheus:

smartctl-exporter: SMART disk health attributes from every node with physical disks. Runs as a DaemonSet.
shelly-exporter: Queries Shelly smart plugs over the local network. Reports real-time power consumption — that 102W cluster figure from the scheduling post comes from this.
smokeping-prober: Tailscale latency probes to every node. Measures mesh network health and detects when a node's tunnel goes stale.
exportarr: Sidecar containers in the media stack (Sonarr, Radarr, Prowlarr) that expose application metrics to Prometheus.

The dead man's switch

The Watchdog alert is always firing, and that's intentional. If I stop getting Watchdog notifications, it means something in the pipeline is broken: Prometheus isn't evaluating rules, Alertmanager isn't routing, or Gotify is down.

It's the simplest and most important alert in the system. Everything else could have a bug in its PromQL expression. Watchdog just fires, unconditionally, forever. If it stops, I know the fire alarm itself is broken.

The result

Prometheus scrapes a few thousand series across the cluster at a rate I've honestly stopped looking at, Loki holds about ninety days of logs from every pod and node, and the custom alert rules cover the usual suspects: hardware, storage, network, databases, certificates, cluster operations. Alerts route to my phone in seconds via Gotify, with smart deduplication and severity-based repeat intervals.

The monitoring stack runs on about 2.5GB of memory total (Prometheus is the hungriest). It's pinned to amd64 nodes for performance but the Alloy log collector and node-exporter DaemonSets run on all 7 nodes including the Pis and the edge VPS. The whole thing deploys as a single Helm chart with one install monitoring command.

It's more monitoring than a homelab needs. But the alternative is finding out something broke two days later when you try to watch a movie and the media server is down 🍿.. I'd rather get a push notification at 3 AM and fix it in the morning.

That wraps up the series. Six posts covering the full stack: CGNAT tunneling, encrypted storage, multi-arch scheduling, self-healing automation, GitOps, and monitoring. The whole thing runs on a handful of nodes, under a hundred watts from the wall. Thanks for reading 💗.

j / k	scroll down / up
gg / G	top / bottom
Ctrl-d / Ctrl-u	half page down / up
n / N	next / prev search match
Ctrl-p	telescope (find posts)

:	command mode
/	search
?	this help
Esc	clear / cancel

:q	close (or go to index)
:q!	force close tab
:w	"save"
:wq	save position + close
:help	toggle help
:about	about page
:tags	browse tags
:tag name	filter by tag
:N	open post #N
:Telescope	fuzzy post finder
:term	floating terminal
:Lazy	plugin manager
:checkhealth	diagnostics
:!neofetch	system info
:!cowsay	moo
:read !fortune	random quote
:colorscheme <name>	tokyonight · gruvbox · catppuccin · nord · rose-pine (suffix `-day` for light)

V	visual line mode
y	yank selection
T	open terminal