๐๏ธ 150 Pods on 32 Cores: Multi-Arch Scheduling Across x86 and ARM
Third post in the k3s homelab series. Previously: tunneling through CGNAT and LUKS + Dropbear + RAID6.
My entire cluster uses less power than a gaming PC ๐ก.. Seven nodes, 32 cores, 151 pods, and the busiest machine idles at 20% CPU. The trick isn't raw hardware, is scheduling. Putting the right pod on the right node, and letting Kubernetes handle the rest.
The nodes
Here's what I'm working with:
1Node CPU RAM Arch Zone Role
2โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
3corellia 8c 16GB amd64 home control-plane + worker
4mandalore 4c 16GB amd64 home worker
5tatooine 4c 16GB amd64 home worker
6kamino 4c 8GB arm64 home worker (RPi 4B)
7jakku 4c 4GB arm64 home worker (RPi 4B)
8dagobah 4c 2GB arm64 parents worker (RPi 4B)
9scarif 4c 8GB amd64 cloud edge VPS
Three Intel mini PCs (N305 + two N100s), three Raspberry Pi 4Bs, and one cloud VPS. Total: 32 cores, ~70 GB of RAM.
The mini PCs do the heavy lifting: databases, media transcoding, storage. The Pis handle lightweight workloads: ArgoCD, cert-manager, Glance, event-watcher, MetalLB. Scarif sits in a datacenter running the internet-facing edge stack (Pangolin, gerbil, traefik-edge).
Labels: teaching Kubernetes about your hardware
Stock Kubernetes knows two things about your nodes: architecture (kubernetes.io/arch) and hostname. That's not enough. I add custom labels to encode the things that matter for scheduling:
1# Node labels (applied via k3s agent flags)
2kubernetes.io/zone: home # corellia, mandalore, tatooine, kamino, jakku
3kubernetes.io/zone: cloud # scarif
4kubernetes.io/location: parents # dagobah
Zone separates home nodes from the cloud VPS. Location tags dagobah at my parents' house, so workloads that need low-latency access to NFS or the local network can exclude it. Architecture labels come for free from k3s.
These labels power three scheduling patterns: "run only on amd64", "run only at home", and "never run on the edge".
The scarif taint: keeping the edge clean
Scarif is special. It has a public IP, runs the Pangolin/gerbil/traefik-edge stack, and should not run random workloads. A taint keeps everything off unless explicitly tolerated:
1# Scarif taint (applied via k3s agent flags)
2dedicated=edge:NoSchedule
Only a handful of things tolerate this taint: the pangolin chart (pangolin + gerbil + traefik-edge + crowdsec), the newt edge tunnel, the geoblock-manager, and DaemonSets that need to run everywhere (node-exporter, alloy).
1# Pangolin chart values.yaml
2nodeSelector:
3 kubernetes.io/hostname: scarif
4tolerations:
5 - key: "dedicated"
6 operator: "Equal"
7 value: "edge"
8 effect: "NoSchedule"
Everything else, all 140+ pods, stays on home nodes. The taint is the fence, the toleration is the key.
Multi-arch: making arm64 a first-class citizen
kamino and jakku are Raspberry Pi 4Bs. ARM. Most container images ship multi-arch manifests these days, so they just work. But a few don't:
1# Images that only support amd64
2ghcr.io/kieraneglin/pinchflat # YouTube archiver
3dxflrs/amd64_garage # Garage S3 (it's in the name)
4ghcr.io/keel-hq/keel # Image auto-updater
For these, I add an explicit architecture selector:
1nodeSelector:
2 kubernetes.io/arch: amd64
Before adding any new workload, I check:
1 |
2
3
If arm64 is in the list, it can schedule anywhere. If not, it gets the amd64 selector. Simple rule, no surprises.
The Pis currently run 53 pods between them. That's a third of the cluster. ArgoCD controllers, cert-manager, MetalLB, Glance, event-watcher, reflector, NFS CSI nodes, monitoring agents, the newt home tunnel, and more. They're not decoration, they're load-bearing.
GPU pinning: distributing hardware transcoding
All three mini PCs have Intel integrated GPUs with QuickSync hardware transcoding. The GPU is exposed as /dev/dri via hostPath mounts, and workloads that need it are pinned to specific nodes:
1# GPU-accelerated deployment (pinned to a specific node)
2volumes:
3 - name: dri
4 hostPath:
5 path: /dev/dri
6nodeSelector:
7 kubernetes.io/hostname: corellia
Media transcoding is distributed across all three mini PCs. Each node runs a transcode worker as an independent deployment, pinned by hostname. Three nodes, three GPUs, three parallel transcode streams. Background library transcoding uses a DaemonSet with a node affinity to exclude the node where the server runs:
1# Transcode DaemonSet, exclude the server node
2affinity:
3 nodeAffinity:
4 requiredDuringSchedulingIgnoredDuringExecution:
5 nodeSelectorTerms:
6 - matchExpressions:
7 - key: kubernetes.io/hostname
8 operator: NotIn
9 values:
10 - mandalore
The NotIn operator is underrated. Instead of listing every node where something should run, you list the ones where it shouldn't. When I add a fourth amd64 node someday, the DaemonSet will automatically pick it up without changing any config.
Priority classes: who gets evicted first
When a node runs low on resources, Kubernetes needs to decide what to kill. Priority classes make this explicit:
1# Three tiers
2apiVersion: scheduling.k8s.io/v1
3kind: PriorityClass
4metadata:
5 name: critical
6value: 1000000 # never evict these
7---
8metadata:
9 name: normal
10value: 100000 # default for all pods
11globalDefault: true
12---
13metadata:
14 name: best-effort
15value: 10000 # evict these first
Critical (1,000,000): databases, Prometheus, Alertmanager, cert-manager, Vaultwarden, Loki. Things where data loss or downtime actually matters. The descheduler won't touch anything at or above this threshold.
Normal (100,000): everything else by default. Media apps, ArgoCD, Gitea. Important but restartable.
Best-effort (10,000): expendable services. Recyclarr, cleanuparr. Nice to have, fine to kill.
When corellia's memory hits 75% (which it does, it's the busiest node), the kubelet evicts best-effort pods first, then normal, never critical. The databases stay up.
The descheduler: rebalancing every 5 minutes
Kubernetes has a dirty secret: it schedules pods once. After that, they stay where they landed forever, even if the cluster becomes wildly unbalanced. Deploy 10 services in a row? They might all land on corellia because it had the most free memory at the time. The other nodes sit idle while corellia sweats.
The descheduler fixes this. It runs every 5 minutes and evicts pods that should move:
1# Descheduler LowNodeUtilization config
2profiles:
3 - name: default
4 plugins:
5 balance:
6 enabled:
7 - LowNodeUtilization
8 - RemoveDuplicates
9 deschedule:
10 enabled:
11 - RemovePodsViolatingNodeTaints
12 - RemovePodsViolatingNodeAffinity
13 - RemovePodsHavingTooManyRestarts
14 pluginConfig:
15 - name: LowNodeUtilization
16 args:
17 thresholds:
18 cpu: 25
19 memory: 35
20 pods: 25
21 targetThresholds:
22 cpu: 50
23 memory: 55
24 pods: 40
If a node drops below 25% CPU / 35% memory / 25% pod count, it's "underutilized". If another node is above 50% CPU / 55% memory / 40% pods, it's "overutilized". The descheduler evicts pods from overutilized nodes so the default scheduler can place them on underutilized ones.
The DefaultEvictor respects priority: anything at 1,000,000 (critical) is untouchable. It also handles pods with PVCs, which is important since most of my stateful apps use NFS persistent volumes.
RemoveDuplicates ensures replicas of the same deployment spread across nodes โ no point having 2 replicas on the same machine. RemovePodsHavingTooManyRestarts catches crashlooping pods (threshold: 100 restarts) and evicts them, giving them a fresh start on a potentially different node. Sometimes a pod is crashlooping because of something specific to that node, and a reschedule is all it takes.
DaemonSets: the everywhere pattern
Some workloads need to run on every node. Or almost every node. The "almost" is where it gets interesting.
Monitoring is the obvious one โ alloy (log collector) and node-exporter (hardware metrics) run on all 7 nodes. No exceptions, no excuses. If a node exists, I want its logs and metrics.
But most DaemonSets need carve-outs. MetalLB speakers only run on home nodes (scarif doesn't participate in L2 announcements). The transcode workers only run on amd64 nodes with GPUs, and I exclude mandalore because it's already the busiest worker. LibreTranslate runs everywhere except dagobah, because a translation model on a 2 GB Raspberry Pi at my parents' house is just cruel.
The pattern that surprised me most: k3s upgrades. I run the system-upgrade-controller to auto-upgrade k3s, but scarif runs a different build (cloud-optimised, different flannel config). Learned that the hard way when an upgrade broke scarif's networking and took down all external traffic for 20 minutes ๐ . Now upgrades are zone-restricted:
1# k3s system-upgrade controller โ home zone only
2affinity:
3 nodeAffinity:
4 requiredDuringSchedulingIgnoredDuringExecution:
5 nodeSelectorTerms:
6 - matchExpressions:
7 - key: kubernetes.io/zone
8 operator: In
9 values:
Storage affinity: keeping data close
Most apps use NFS PVCs (storage class nfs-csi-k3s), which work from any home node since they all mount /share from corellia. But a few things use local-path for performance โ databases, Loki, Garage. These PVs are bound to specific nodes.
I learned this lesson when the descheduler moved a PostgreSQL pod from corellia to kamino. The local-path PV stayed on corellia. The pod couldn't mount its data. Database down. Prometheus went nuts. My phone lit up like a Christmas tree at 2 AM ๐.
Now databases have nodeAffinity pinning them to the node where their data lives:
1# Local PV, bound to home amd64 nodes only
2nodeAffinity:
3 required:
4 nodeSelectorTerms:
5 - matchExpressions:
6 - key: kubernetes.io/hostname
7 operator: In
8 values:
9 - corellia
10 - mandalore
11 - tatooine
Scarif uses local-path exclusively (no NFS access from the cloud), so its PVs are bound to scarif. The Pis use NFS for everything, keeping their SD cards write-free โ a whole topic I cover in the LUKS + RAID post.
Power consumption: 102 watts from the wall
A Shelly Plug S sits between the power strip and the wall, reporting real-time wattage to Prometheus via a shelly-exporter pod in the monitoring namespace.
1
2
That's everything: three mini PCs, three Raspberry Pis, six SATA spinning drives, a network switch, and a router. For context, an RTX 4090 graphics card alone pulls 450W under load. At ยฃ0.25/kWh, that's about ยฃ18/month for the whole cluster. Less than a single cloud VM with equivalent specs.
The result
151 pods on 32 cores across three architectures (amd64, arm64, and "cloud VPS"). Seven nodes, three zones, three priority tiers. GPU transcoding distributed across all three mini PCs. The descheduler keeps things balanced. The taint keeps the edge clean. Custom labels and affinities put workloads where they belong. And the whole thing sips 102 watts.
The scheduling isn't complicated, but is deliberate. Every nodeSelector, every taint, every priority class exists because something went wrong without it. Pods landing on scarif when they shouldn't. Databases getting evicted during a transcoding spike. A Pi running out of memory because the descheduler dumped 40 pods on it ๐คฆ.. You learn by breaking things, then you write the rule which prevents it from breaking again.
Next up: self-healing DNS, auto-generated diagrams, and automation that runs while I sleep.