โ—index ๐Ÿ—๏ธscheduling.md ๐Ÿท๏ธtags ๐Ÿ‘คabout

๐Ÿ—๏ธ 150 Pods on 32 Cores: Multi-Arch Scheduling Across x86 and ARM

Third post in the k3s homelab series. Previously: tunneling through CGNAT and LUKS + Dropbear + RAID6.

My entire cluster uses less power than a gaming PC ๐Ÿ’ก.. Seven nodes, 32 cores, 151 pods, and the busiest machine idles at 20% CPU. The trick isn't raw hardware, is scheduling. Putting the right pod on the right node, and letting Kubernetes handle the rest.

The nodes

Here's what I'm working with:

๐Ÿ“„textโ€บ9 lines
  1Node         CPU    RAM    Arch    Zone    Role
  2โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  3corellia     8c     16GB   amd64   home    control-plane + worker
  4mandalore    4c     16GB   amd64   home    worker
  5tatooine     4c     16GB   amd64   home    worker
  6kamino       4c     8GB    arm64   home    worker (RPi 4B)
  7jakku        4c     4GB    arm64   home    worker (RPi 4B)
  8dagobah      4c     2GB    arm64   parents worker (RPi 4B)
  9scarif       4c     8GB    amd64   cloud   edge VPS

Three Intel mini PCs (N305 + two N100s), three Raspberry Pi 4Bs, and one cloud VPS. Total: 32 cores, ~70 GB of RAM.

The mini PCs do the heavy lifting: databases, media transcoding, storage. The Pis handle lightweight workloads: ArgoCD, cert-manager, Glance, event-watcher, MetalLB. Scarif sits in a datacenter running the internet-facing edge stack (Pangolin, gerbil, traefik-edge).

Labels: teaching Kubernetes about your hardware

Stock Kubernetes knows two things about your nodes: architecture (kubernetes.io/arch) and hostname. That's not enough. I add custom labels to encode the things that matter for scheduling:

๐Ÿ“‹yamlโ€บ4 lines
  1# Node labels (applied via k3s agent flags)
  2kubernetes.io/zone: home       # corellia, mandalore, tatooine, kamino, jakku
  3kubernetes.io/zone: cloud      # scarif
  4kubernetes.io/location: parents # dagobah

Zone separates home nodes from the cloud VPS. Location tags dagobah at my parents' house, so workloads that need low-latency access to NFS or the local network can exclude it. Architecture labels come for free from k3s.

These labels power three scheduling patterns: "run only on amd64", "run only at home", and "never run on the edge".

The scarif taint: keeping the edge clean

Scarif is special. It has a public IP, runs the Pangolin/gerbil/traefik-edge stack, and should not run random workloads. A taint keeps everything off unless explicitly tolerated:

๐Ÿ“‹yamlโ€บ2 lines
  1# Scarif taint (applied via k3s agent flags)
  2dedicated=edge:NoSchedule

Only a handful of things tolerate this taint: the pangolin chart (pangolin + gerbil + traefik-edge + crowdsec), the newt edge tunnel, the geoblock-manager, and DaemonSets that need to run everywhere (node-exporter, alloy).

๐Ÿ“‹yamlโ€บ8 lines
  1# Pangolin chart values.yaml
  2nodeSelector:
  3  kubernetes.io/hostname: scarif
  4tolerations:
  5  - key: "dedicated"
  6    operator: "Equal"
  7    value: "edge"
  8    effect: "NoSchedule"

Everything else, all 140+ pods, stays on home nodes. The taint is the fence, the toleration is the key.

Multi-arch: making arm64 a first-class citizen

kamino and jakku are Raspberry Pi 4Bs. ARM. Most container images ship multi-arch manifests these days, so they just work. But a few don't:

๐Ÿ“‹yamlโ€บ4 lines
  1# Images that only support amd64
  2ghcr.io/kieraneglin/pinchflat     # YouTube archiver
  3dxflrs/amd64_garage               # Garage S3 (it's in the name)
  4ghcr.io/keel-hq/keel              # Image auto-updater

For these, I add an explicit architecture selector:

๐Ÿ“‹yamlโ€บ2 lines
  1nodeSelector:
  2  kubernetes.io/arch: amd64

Before adding any new workload, I check:

โฏ_bashโ€บ3 lines
  1โฏโฏโฏ docker manifest inspect ghcr.io/some/image:latest | jq '.[].platform.architecture'
  2"amd64"
  3"arm64"

If arm64 is in the list, it can schedule anywhere. If not, it gets the amd64 selector. Simple rule, no surprises.

The Pis currently run 53 pods between them. That's a third of the cluster. ArgoCD controllers, cert-manager, MetalLB, Glance, event-watcher, reflector, NFS CSI nodes, monitoring agents, the newt home tunnel, and more. They're not decoration, they're load-bearing.

GPU pinning: distributing hardware transcoding

All three mini PCs have Intel integrated GPUs with QuickSync hardware transcoding. The GPU is exposed as /dev/dri via hostPath mounts, and workloads that need it are pinned to specific nodes:

๐Ÿ“‹yamlโ€บ7 lines
  1# GPU-accelerated deployment (pinned to a specific node)
  2volumes:
  3  - name: dri
  4    hostPath:
  5      path: /dev/dri
  6nodeSelector:
  7  kubernetes.io/hostname: corellia

Media transcoding is distributed across all three mini PCs. Each node runs a transcode worker as an independent deployment, pinned by hostname. Three nodes, three GPUs, three parallel transcode streams. Background library transcoding uses a DaemonSet with a node affinity to exclude the node where the server runs:

๐Ÿ“‹yamlโ€บ10 lines
  1# Transcode DaemonSet, exclude the server node
  2affinity:
  3  nodeAffinity:
  4    requiredDuringSchedulingIgnoredDuringExecution:
  5      nodeSelectorTerms:
  6        - matchExpressions:
  7            - key: kubernetes.io/hostname
  8              operator: NotIn
  9              values:
 10                - mandalore

The NotIn operator is underrated. Instead of listing every node where something should run, you list the ones where it shouldn't. When I add a fourth amd64 node someday, the DaemonSet will automatically pick it up without changing any config.

Priority classes: who gets evicted first

When a node runs low on resources, Kubernetes needs to decide what to kill. Priority classes make this explicit:

๐Ÿ“‹yamlโ€บ15 lines
  1# Three tiers
  2apiVersion: scheduling.k8s.io/v1
  3kind: PriorityClass
  4metadata:
  5  name: critical
  6value: 1000000        # never evict these
  7---
  8metadata:
  9  name: normal
 10value: 100000         # default for all pods
 11globalDefault: true
 12---
 13metadata:
 14  name: best-effort
 15value: 10000          # evict these first

Critical (1,000,000): databases, Prometheus, Alertmanager, cert-manager, Vaultwarden, Loki. Things where data loss or downtime actually matters. The descheduler won't touch anything at or above this threshold.

Normal (100,000): everything else by default. Media apps, ArgoCD, Gitea. Important but restartable.

Best-effort (10,000): expendable services. Recyclarr, cleanuparr. Nice to have, fine to kill.

When corellia's memory hits 75% (which it does, it's the busiest node), the kubelet evicts best-effort pods first, then normal, never critical. The databases stay up.

The descheduler: rebalancing every 5 minutes

Kubernetes has a dirty secret: it schedules pods once. After that, they stay where they landed forever, even if the cluster becomes wildly unbalanced. Deploy 10 services in a row? They might all land on corellia because it had the most free memory at the time. The other nodes sit idle while corellia sweats.

The descheduler fixes this. It runs every 5 minutes and evicts pods that should move:

๐Ÿ“‹yamlโ€บ24 lines
  1# Descheduler LowNodeUtilization config
  2profiles:
  3  - name: default
  4    plugins:
  5      balance:
  6        enabled:
  7          - LowNodeUtilization
  8          - RemoveDuplicates
  9      deschedule:
 10        enabled:
 11          - RemovePodsViolatingNodeTaints
 12          - RemovePodsViolatingNodeAffinity
 13          - RemovePodsHavingTooManyRestarts
 14    pluginConfig:
 15      - name: LowNodeUtilization
 16        args:
 17          thresholds:
 18            cpu: 25
 19            memory: 35
 20            pods: 25
 21          targetThresholds:
 22            cpu: 50
 23            memory: 55
 24            pods: 40

If a node drops below 25% CPU / 35% memory / 25% pod count, it's "underutilized". If another node is above 50% CPU / 55% memory / 40% pods, it's "overutilized". The descheduler evicts pods from overutilized nodes so the default scheduler can place them on underutilized ones.

The DefaultEvictor respects priority: anything at 1,000,000 (critical) is untouchable. It also handles pods with PVCs, which is important since most of my stateful apps use NFS persistent volumes.

RemoveDuplicates ensures replicas of the same deployment spread across nodes โ€” no point having 2 replicas on the same machine. RemovePodsHavingTooManyRestarts catches crashlooping pods (threshold: 100 restarts) and evicts them, giving them a fresh start on a potentially different node. Sometimes a pod is crashlooping because of something specific to that node, and a reschedule is all it takes.

DaemonSets: the everywhere pattern

Some workloads need to run on every node. Or almost every node. The "almost" is where it gets interesting.

Monitoring is the obvious one โ€” alloy (log collector) and node-exporter (hardware metrics) run on all 7 nodes. No exceptions, no excuses. If a node exists, I want its logs and metrics.

But most DaemonSets need carve-outs. MetalLB speakers only run on home nodes (scarif doesn't participate in L2 announcements). The transcode workers only run on amd64 nodes with GPUs, and I exclude mandalore because it's already the busiest worker. LibreTranslate runs everywhere except dagobah, because a translation model on a 2 GB Raspberry Pi at my parents' house is just cruel.

The pattern that surprised me most: k3s upgrades. I run the system-upgrade-controller to auto-upgrade k3s, but scarif runs a different build (cloud-optimised, different flannel config). Learned that the hard way when an upgrade broke scarif's networking and took down all external traffic for 20 minutes ๐Ÿ˜…. Now upgrades are zone-restricted:

๐Ÿ“‹yamlโ€บ9 lines
  1# k3s system-upgrade controller โ€” home zone only
  2affinity:
  3  nodeAffinity:
  4    requiredDuringSchedulingIgnoredDuringExecution:
  5      nodeSelectorTerms:
  6        - matchExpressions:
  7            - key: kubernetes.io/zone
  8              operator: In
  9              values: ["home"]

Storage affinity: keeping data close

Most apps use NFS PVCs (storage class nfs-csi-k3s), which work from any home node since they all mount /share from corellia. But a few things use local-path for performance โ€” databases, Loki, Garage. These PVs are bound to specific nodes.

I learned this lesson when the descheduler moved a PostgreSQL pod from corellia to kamino. The local-path PV stayed on corellia. The pod couldn't mount its data. Database down. Prometheus went nuts. My phone lit up like a Christmas tree at 2 AM ๐ŸŽ„.

Now databases have nodeAffinity pinning them to the node where their data lives:

๐Ÿ“‹yamlโ€บ11 lines
  1# Local PV, bound to home amd64 nodes only
  2nodeAffinity:
  3  required:
  4    nodeSelectorTerms:
  5      - matchExpressions:
  6          - key: kubernetes.io/hostname
  7            operator: In
  8            values:
  9              - corellia
 10              - mandalore
 11              - tatooine

Scarif uses local-path exclusively (no NFS access from the cloud), so its PVs are bound to scarif. The Pis use NFS for everything, keeping their SD cards write-free โ€” a whole topic I cover in the LUKS + RAID post.

Power consumption: 102 watts from the wall

A Shelly Plug S sits between the power strip and the wall, reporting real-time wattage to Prometheus via a shelly-exporter pod in the monitoring namespace.

โฏ_bashโ€บ2 lines
  1โฏโฏโฏ promql 'shelly_meter_overpower_watts{channel="0"}'
  2Studio Lamp: 102.11 W

That's everything: three mini PCs, three Raspberry Pis, six SATA spinning drives, a network switch, and a router. For context, an RTX 4090 graphics card alone pulls 450W under load. At ยฃ0.25/kWh, that's about ยฃ18/month for the whole cluster. Less than a single cloud VM with equivalent specs.

The result

151 pods on 32 cores across three architectures (amd64, arm64, and "cloud VPS"). Seven nodes, three zones, three priority tiers. GPU transcoding distributed across all three mini PCs. The descheduler keeps things balanced. The taint keeps the edge clean. Custom labels and affinities put workloads where they belong. And the whole thing sips 102 watts.

The scheduling isn't complicated, but is deliberate. Every nodeSelector, every taint, every priority class exists because something went wrong without it. Pods landing on scarif when they shouldn't. Databases getting evicted during a transcoding spike. A Pi running out of memory because the descheduler dumped 40 pods on it ๐Ÿคฆ.. You learn by breaking things, then you write the rule which prevents it from breaking again.

Next up: self-healing DNS, auto-generated diagrams, and automation that runs while I sleep.

:discuss share / comment on Mastodon โ†’