๐ The Alerting Pipeline That Pages My Phone
Last post in the k3s homelab series. Previously: CGNAT tunneling, LUKS + Dropbear + RAID6, multi-arch scheduling, self-healing automation, and GitOps.
I run a full observability stack on a homelab ๐.. Prometheus scrapes a pile of time series across every target group I could think of. Loki ingests logs from every node and pod. Grafana ties it all together. Alertmanager routes to my phone. This is how it's wired up.
The stack
Five components, deployed as subcharts of a single monitoring Helm chart:
- kube-prometheus-stack: Prometheus, Alertmanager, Grafana, and all the default scrape configs
- Loki: Log aggregation (single-binary mode)
- Alloy: Log collector, runs on every node as a DaemonSet
- Gotify: Notification server with an Android app
- OpenTelemetry Collector: Trace collection for services that support OTEL
All five deploy to the monitoring namespace. Prometheus and Loki use local-path storage for I/O performance. Alertmanager and Gotify use NFS.
Prometheus: 617K series, 1 year retention
Prometheus is the heart of it. It scrapes every 30 seconds and keeps data for a year:
1retention: "1y"
2retentionSize: "200GiB"
3storageSpec:
4 volumeClaimTemplate:
5 spec:
6 storageClassName: local-path
7 resources:
8 requests:
9 storage: 200Gi
A year of retention is overkill for a homelab, but it means I can compare this December's power consumption to last December's, or see how memory usage changed after a Kubernetes version upgrade six months ago. The 200GB size limit acts as a safety net โ if series cardinality explodes, Prometheus drops the oldest data instead of filling the disk.
Right now, Prometheus tracks 3,598 unique metric names across 617,405 active time series, ingesting about 21,000 samples per second. The priority class is critical so the descheduler won't touch it.
What gets scraped
Kubernetes service discovery handles most of it automatically. Any pod with prometheus.io/scrape: "true" annotations gets picked up. But non-Kubernetes targets need static scrape configs:
1# External scrape targets (via Tailscale IPs)
2- job_name: node-exporter
3 static_configs:
4 - targets: # falcon (monitoring host)
5 labels:
6 - targets: # hoth (router)
7 labels:
8
9- job_name: smartctl-exporter
10 static_configs:
11 # All 6 nodes with disks
12 - targets:
The scrape secret also covers traefik-edge metrics on scarif, the headscale coordination server, and CrowdSec intrusion detection stats. Everything reachable via the Tailscale overlay, so no ports exposed to the internet.
The relabel_configs copy the hostname label to instance, so Grafana dashboards show "corellia" instead of "10.0.1.252:9100". Small detail, big readability improvement.
Loki: 90 days of logs
Loki runs as a single binary with filesystem-backed storage. There's no object store and no microservice split, just one replica backed by one PVC:
1deploymentMode: SingleBinary
2loki:
3 storage:
4 type: filesystem
5 schemaConfig:
6 configs:
7 - from: "2026-01-01"
8 store: tsdb
9 object_store: filesystem
10 schema: v13
11 index:
12 prefix: index_
13 period: 24h
14 limits_config:
15 retention_period: 2160h # 90 days
16 compactor:
17 retention_enabled: true
18 delete_request_cancel_period: 2h
90 days is enough to debug anything I've ever needed to go back and check. The compactor runs hourly and garbage-collects expired chunks.
Alloy: the log collector
Alloy replaces Promtail as the log shipping agent. It runs on all 7 nodes as a DaemonSet (including scarif), collecting:
- Container logs from the Kubernetes API
- Systemd journal entries (kernel messages, k3s agent logs, SSH auth)
Everything ships to Loki with namespace, pod, container, and node labels. A typical LogQL query looks like:
1{namespace="pangolin"} |= "error" | logfmt
The DaemonSet tolerates the scarif edge taint, so logs from traefik-edge and gerbil are captured too.
Alert rules: 25 categories of things that can break
The stock kube-prometheus-stack alert rules cover the basics (pod crashlooping, node down, disk pressure). But a homelab has its own failure modes. I've added custom rules for everything that's ever broken:
Hardware:
MaxTempHigh: any hwmon sensor above 95ยฐC (the N100s run warm under sustained load)NodeDiskIOSaturation: weighted I/O queue time above threshold โ 12 hours for RAID (it's always somewhat busy), 30 minutes for single disks- SD card saturation on the Pis: mmcblk0 I/O time above 90%
Storage:
PersistentVolumeUsageHigh(>85%) andCritical(>95%)FillingUp4h: linear prediction that a PVC will fill within 4 hours- Filesystem read-only detection
Network & services:
TargetDown: any scrape target unreachable for more than 5 minutesQBittorrentDown,NetworkDisconnected,Firewalledโ the VPN tunnel is fragileGarageDown,GarageNodeDownโ S3 storage health
Databases:
CNPGDown: PostgreSQL cluster health (CNPG operator)MysqlDown,MysqlReplicationLag- etcd alerts (k3s embeds etcd): no leader, high fsync durations, high commit latency, database size approaching limit
Certificates:
CertificateNotReady: cert-manager certificate stuck for more than 15 minutesCertificateExpiringSoon: less than 7 days until expiry
Cluster operations:
K3sUpgradeAvailable: a new k3s version exists that the system-upgrade controller can apply
I've disabled a handful of upstream defaults that don't apply to k3s (standalone etcd rules, some kubelet alerts) and replaced them with custom versions that account for the cluster's quirks.
Alertmanager: routing to my phone
Alertmanager receives firing alerts and routes them to Gotify:
1route:
2 receiver: gotify
3 group_by:
4 routes:
5 - matchers:
6 repeat_interval: 15m
7 - matchers:
8 repeat_interval: 2h
9inhibit_rules:
10 - source_matchers:
11 target_matchers:
12 equal:
Critical alerts repeat every 15 minutes until resolved. Warnings repeat every 2 hours. The inhibit rule prevents duplicate noise: if NodeDown fires as critical, the warning-level version is suppressed. The Watchdog alert (a dead man's switch that fires continuously to prove the pipeline is working) routes to a null receiver.
Gotify bridge
The alertmanager-gotify-bridge translates Alertmanager webhooks into Gotify API calls. It formats messages with emoji prefixes and links back to the alert source:
1๐ฅ [FIRING] NodeFilesystemUsageHigh
2โ ๏ธ warning | firing
3
4Filesystem on mandalore is at 87% capacity.
5
6๐ Prometheus | Alertmanager | Silence
Gotify runs in-cluster with its own Android app, and push notifications arrive within seconds of an alert firing without routing through any third-party notification service.
Grafana: the dashboards
Grafana uses PostgreSQL as its backend (via the CNPG database operator in the database namespace) and authenticates through Google OAuth. Dashboards are backed up to charts/grafana-dashboards/ as JSON files and can be restored with a script.
The dashboards I actually look at regularly:
- Home: cluster overview with CPU, memory, disk, network, power consumption per node
- Traefik: request rates, response codes, latency percentiles per ingress
- Loki: log volume by namespace, error rate trends
- Node: per-node deep dive with all hardware metrics (temperatures, fan speeds, disk I/O, network)
- RAID: md0 health, individual disk SMART attributes, rebuild progress if a drive fails
Alert rules link directly to relevant Grafana panels. When PersistentVolumeUsageHigh fires, the Gotify message includes a link to the disk usage panel with the right time range.
Extra exporters
Some things need their own exporter to get metrics into Prometheus:
- smartctl-exporter: SMART disk health attributes from every node with physical disks. Runs as a DaemonSet.
- shelly-exporter: Queries Shelly smart plugs over the local network. Reports real-time power consumption โ that 102W cluster figure from the scheduling post comes from this.
- smokeping-prober: Tailscale latency probes to every node. Measures mesh network health and detects when a node's tunnel goes stale.
- exportarr: Sidecar containers in the media stack (Sonarr, Radarr, Prowlarr) that expose application metrics to Prometheus.
The dead man's switch
The Watchdog alert is always firing, and that's intentional. If I stop getting Watchdog notifications, it means something in the pipeline is broken: Prometheus isn't evaluating rules, Alertmanager isn't routing, or Gotify is down.
It's the simplest and most important alert in the system. Everything else could have a bug in its PromQL expression. Watchdog just fires, unconditionally, forever. If it stops, I know the fire alarm itself is broken.
The result
Prometheus scrapes a few thousand series across the cluster at a rate I've honestly stopped looking at, Loki holds about ninety days of logs from every pod and node, and the custom alert rules cover the usual suspects: hardware, storage, network, databases, certificates, cluster operations. Alerts route to my phone in seconds via Gotify, with smart deduplication and severity-based repeat intervals.
The monitoring stack runs on about 2.5GB of memory total (Prometheus is the hungriest). It's pinned to amd64 nodes for performance but the Alloy log collector and node-exporter DaemonSets run on all 7 nodes including the Pis and the edge VPS. The whole thing deploys as a single Helm chart with one install monitoring command.
It's more monitoring than a homelab needs. But the alternative is finding out something broke two days later when you try to watch a movie and the media server is down ๐ฟ.. I'd rather get a push notification at 3 AM and fix it in the morning.
That wraps up the series. Six posts covering the full stack: CGNAT tunneling, encrypted storage, multi-arch scheduling, self-healing automation, GitOps, and monitoring. The whole thing runs on a handful of nodes, under a hundred watts from the wall. Thanks for reading ๐.