๐ฎ MrWolf: What I Learned Giving an AI the Keys to My Infrastructure
Third post in the MrWolf series. Previously: I Gave an AI Tools to Run My Homelab and Rust, Zero Boilerplate.
MrWolf is about two months old ๐บ.. In that time it went from a weekend project with a handful of tools to a sprawling collection of them running in the cluster. It's how I operate the homelab now. Not as a novelty, as the primary interface.
But two months is enough to learn what works, what breaks, and what I'd do differently.
What works better than expected
The killer feature isn't any individual tool. It's tool chaining. I didn't design MrWolf to diagnose problems, I designed it to expose data. Claude does the diagnosis by combining tools in ways I didn't anticipate.
The Pangolin healthcheck story is the canonical example: list resources โ pull logs โ correlate timestamps โ fix with API call โ write a CronJob to prevent recurrence. Five tools, one conversation, zero planning on my part. I didn't build a "diagnose Pangolin" tool. Claude assembled one from parts.
This happens constantly. "Why is this pod crashlooping?" becomes: check pod status โ get logs โ check node pressure โ check events โ "the node is out of memory because Loki is eating 12 GB, here's what I'd do." I get a diagnosis and a recommendation, built from raw data I could have gathered myself but would have taken 10 minutes of terminal juggling.
The other surprise: Claude is really good at reading Prometheus output. I was worried the raw metric format would confuse it. Nope. It parses the node names, compares percentages, spots anomalies, and explains them in plain English. The resolve_node_name function that maps IPs to Star Wars planet names probably helps, "corellia is at 89% CPU" is easier to reason about than "10.0.1.252:9100 is at 89% CPU."
What doesn't work (yet)
Large responses fill the context window. This is the biggest practical issue. query_loki can return thousands of log lines. list_pods across all namespaces returns 150+ pods. Claude tries to process all of it, the context fills up, and subsequent tool calls get slower or start losing earlier context. I've added limit parameters to most tools, but Claude doesn't always use them on the first call. It'll ask for all pods, realise it's too much, and then re-query with a filter. Two calls instead of one.
Claude sometimes picks the wrong tool. With so many tools available there's overlap. get_cluster_health and query_prometheus can both answer "what's the CPU usage?" but the first gives a curated summary and the second gives raw PromQL results. Claude usually picks the right one, but not always. Better tool descriptions help; I've rewritten several to be more specific about when to use each.
Formatting for LLMs is different from formatting for humans. Early versions of MrWolf returned output optimised for terminal readability, aligned columns, ASCII tables, fancy separators. Turns out Claude doesn't care about alignment. What it needs is structured, compact text with clear labels. corellia: 23.4% CPU, 67.8% memory is better than a pretty table. I've been progressively simplifying the output format.
The MCP protocol itself has quirks. Parameters sometimes arrive as strings instead of numbers ("10" instead of 10), which is why I have those lenient serde deserializers. Session management can be finicky, long-running conversations occasionally lose the MCP connection and need to reconnect. These are protocol-level issues that'll improve as the ecosystem matures.
Observability: watching the watcher
Here's a fun one: MrWolf instruments itself. Every tool call gets three Prometheus metrics: call count, duration, and response size. Every upstream HTTP request gets tracked with service labels. I have a Grafana dashboard that shows me how Claude uses the cluster.
Some things I've learned from the metrics:
get_cluster_healthis by far the most called tool. Makes sense, it's the "start here" tool.list_podshas the largest average response size. I should probably add namespace filtering as a default.query_prometheusis the slowest tool on average, some PromQL queries take 2-3 seconds on my single-node Prometheus. Not MrWolf's fault, but it shows up in the latency histogram.- The media tools get used in bursts, usually when I ask "what downloaded overnight?" and Claude chains 4-5 calls together.
- The retry middleware catches about 3% of upstream requests as transient failures. Mostly Loki being slow under load. Without retries, those would be tool errors that Claude would have to handle.
The mrwolf_upstream_up gauge is the one I alert on. If Prometheus or Alertmanager goes down, MrWolf's HTTP client records it, and ironically Alertmanager fires an alert about itself being unreachable. The snake eating its own tail ๐.
Lessons learned
Two months, a few hard lessons. Here's what I'd tell someone building their own MCP server for infrastructure.
Start with read-only tools. I built all the monitoring and query tools first, used them for weeks, and only then added write operations. By the time I added restart_pod and scale_deployment, I trusted the confirmation pattern and understood how Claude would use them. If I'd started with write tools, I would have been too nervous to actually use them.
Every tool must handle errors gracefully. An LLM seeing Error: reqwest::Error { kind: Request, url: "http://prometheus:9090/api/v1/query" ... } is worse than useless. It'll try to parse the error, get confused, and give bad advice. The tool_body! macro catches every error and returns "Error: [human-readable message]. The upstream service may be restarting." Claude reads that and either retries or tells the user. Simple, predictable.
Auto-enable based on credentials, not feature flags. I used to have MRWOLF_MEDIA_ENABLED=true and MRWOLF_PANGOLIN_ENABLED=true. Then I'd deploy with the API key but forget the flag, or set the flag but forget the key. Now the server checks if the key is non-empty and enables itself. Zero booleans, zero "why isn't this working" debugging sessions.
Format output for your consumer, not for humans. MrWolf's consumer is an LLM. It doesn't need aligned columns or box-drawing characters. It needs labeled data in a consistent format. Every tool returns plain text with clear structure: "Node: corellia, CPU: 23.4%, Memory: 67.8%". Claude parses this perfectly every time.
Confirmation gates are non-negotiable for write ops. Every mutating tool requires confirmed=true. No exceptions. It's a simple pattern and it's the reason I sleep at night while an AI has kubectl access to my cluster.
What's next
The tools I have now are individual queries. Each one answers a specific question. The next step is composite tools, tools that answer compound questions by combining multiple data sources in one call.
"Is my backup healthy?" should check the CronJob's last success time, verify the restic repository exists in S3, check the backup size trend in Prometheus, and maybe even run restic snapshots to verify integrity. Right now Claude chains several tools to answer that. A single get_backup_status tool could do it in one call with a richer, pre-correlated answer.
Same for "is the media pipeline working?", check if the VPN tunnel is up, check download queue health, check disk space on the media volume, check if any services are crashlooping. One tool instead of five, with the correlation already done.
The other direction is smarter context management. MrWolf currently returns everything and lets Claude figure out what's relevant. A better approach might be progressive disclosure: return a summary first, let Claude ask for details on specific items. "Here are 150 pods, 3 have issues" is better than dumping all 150 pod descriptions into the context window.
And honestly? I want to open-source it. MrWolf is tightly coupled to my cluster right now, hardcoded node names, specific service URLs, my particular stack. But the patterns are generic. The tool_body! macro, the composite server dispatch, the middleware sandwich, the confirmation gates, all of that works for any MCP server talking to any infrastructure. Extracting the framework from the implementation is the goal.
The bigger picture
We're early. MCP is young. The tooling is rough, the ecosystem is small, and most people haven't tried giving an AI real infrastructure access yet. But the experience is transformative.
I've been managing this homelab for years. I know every service, every quirk, every failure mode. And yet, having an AI operator has changed how I think about the cluster ๐ค.. I spend less time in terminals and more time thinking about what I want the cluster to do. The gap between intent and execution got smaller.
The homelab is a playground, but the pattern scales. Every ops team has the same problem: too many dashboards, too many runbooks, too much context in one person's head. An AI with the right tools doesn't replace the engineer, it gives them a faster feedback loop. "Check this, now check that, now fix it" in 30 seconds instead of 10 minutes.
Every homelab operator should build their own Mr. Wolf. Not because it's practical (it's a homelab, nothing is practical ๐ ), but because it's the best way to understand what AI-assisted infrastructure actually feels like. You can read about it or you can build a pile of tools yourself and watch an AI chain them together to fix a problem you didn't even know you had.
MrWolf started as a weekend project. Two months later it's how I run my cluster. The best tools are the ones you build for yourself, they fit your hands perfectly because you shaped them that way ๐บ.