Infrastructure as Intent – O’Reilly

There’s an open secret on the earth of DevOps: No one trusts the CMDB. The Configuration Administration Database (CMDB) is meant to be the “supply of fact”—the central map of each server, service, and utility in your enterprise. In principle, it’s the muse for safety audits, value evaluation, and incident response. In observe, it’s a piece of fiction. The second you populate a CMDB, it begins to rot. Engineers deploy a brand new microservice however neglect to register it. An autoscaling group spins up 20 new nodes, however the database solely data the unique three. . .
We name this configuration drift, and for many years, our trade’s resolution has been to throw extra scripts on the drawback. We write huge, brittle ETL (Extract-Rework-Load) pipelines that try to scrape the world and shove it right into a relational database. It by no means works. The “world”—particularly the fashionable cloud native world—strikes too quick.
We realized we couldn’t resolve this drawback by writing higher scripts. We needed to change the elemental structure of how we sync knowledge. We stopped making an attempt to boil the ocean and repair the complete enterprise directly. As a substitute, we centered on one notoriously tough surroundings: Kubernetes. If we might construct an autonomous agent able to reasoning concerning the advanced, ephemeral state of a Kubernetes cluster, we might show a sample that works in every single place else. This text explores how we used the newly open-sourced Codex CLI and theMannequin Context Protocol (MCP) to construct that agent. Within the course of, we moved from passive code era to lively infrastructure operation, remodeling the “stale CMDB” drawback from an information entry job right into a logic puzzle.
The Shift: From Code Technology to Infrastructure Operation with Codex CLI and MCP
The rationale most CMDB initiatives fail is ambition. They attempt to observe each change port, digital machine, and SaaS license concurrently. The result’s an information swamp—an excessive amount of noise, not sufficient sign. We took a special strategy. We drew a small circle round a selected area: Kubernetes workloads. Kubernetes is the right testing floor for AI brokers as a result of it’s high-velocity and declarative. Issues change always. Pods die; deployments roll over; providers change selectors. A static script struggles to tell apart between a CrashLoopBackOff (a short lived error state) and a purposeful scale-down. We hypothesized that a big language mannequin (LLM), performing as an operator, might perceive this nuance. It wouldn’t simply copy knowledge; it might interpret it.
The Codex CLI turned this speculation right into a tangible structure by enabling a shift from “code era” to “infrastructure operation.” As a substitute of treating the LLM as a junior programmer that writes scripts for people to overview and run, Codex empowers the mannequin to execute code itself. We offer it with instruments—executable capabilities that act as its fingers and eyes—by way of the Model Context Protocol. MCP defines a transparent interface between the AI mannequin and the surface world, permitting us to reveal high-level capabilities like cmdb_stage_transaction with out instructing the mannequin the advanced inside API of our CMDB. The mannequin learns to make use of the software, not the underlying API.
The structure of company
Our system, which we name k8s-agent, consists of three distinct layers. This isn’t a single script working high to backside; it’s a cognitive structure.
The cognitive layer (Codex + contextual directions): That is the Codex CLI working a selected system immediate. We don’t fine-tune the mannequin weights. Infrastructure strikes too quick for fine-tuning: A mannequin skilled on Kubernetes v1.25 can be hallucinating by v1.30. As a substitute, we use context engineering—the artwork of designing the surroundings during which the AI operates. This entails software design (creating atomic, deterministic capabilities), immediate structure (structuring the system immediate), and data structure (deciding what info to cover or expose). We feed the mannequin a persistent context file (AGENTS.md) that defines its persona: “You’re a meticulous infrastructure auditor. Your purpose is to make sure the CMDB precisely displays the state of the Kubernetes cluster. It’s essential to prioritize security: Don’t delete data except you might have constructive affirmation; they’re orphans.”
The software layer: Utilizing MCP, we expose deterministic Python capabilities to the agent.
- Sensors: k8s_list_workloads, cmdb_query_service, k8s_get_deployment_spec
- Actuators: cmdb_stage_create, cmdb_stage_update, cmdb_stage_delete
Word that we observe workloads (Deployments, StatefulSets), not Pods. Pods are ephemeral; monitoring them in a CMDB is an antipattern that creates noise. The agent understands this distinction—a semantic rule that’s exhausting to implement in a inflexible script.
The state layer (the security web): LLMs are probabilistic; infrastructure should be deterministic. We bridge this hole with a staging sample. The agent by no means writes on to the manufacturing database. It writes to a staged diff. This permits a human (or a coverage engine) to overview the proposed adjustments earlier than they’re dedicated.
The OODA Loop in Motion
How does this differ from an ordinary sync script? A script follows a linear path: Join → Fetch → Write. If any step fails or returns surprising knowledge, the script crashes or corrupts knowledge. Our agent follows the Observe-Orient-Decide-Act (OODA) loop, popularized by army strategists. Not like a linear script that executes blindly, the OODA loop forces the agent to pause and synthesize info earlier than taking motion. This cycle permits it to deal with incomplete knowledge, confirm assumptions, and adapt to altering situations—traits important for working in a distributed system.
Let’s stroll by means of an actual state of affairs we encountered throughout our pilot, the Ghost Deployment, to discover the advantages of utilizing an OODA loop. A developer had deleted a deployment named payment-processor-v1 from the cluster however forgot to take away the document from the CMDB. A regular script may pull the record of deployments, see payment-processor-v1 is lacking, and instantly situation a DELETE to the database. The danger is clear: What if the API server was simply timing out? What if the script had a bug in its pagination logic? The script blindly destroys knowledge primarily based on the absence of proof.
The agent strategy is basically totally different. First, it observes: Calling k8s_list_workloads and cmdb_query_service, noticing the discrepancy. Second, it orients: Checking its context directions to “confirm orphans earlier than deletion” and deciding to name k8s_get_event_history. Third, it decides: Seeing a “delete” occasion within the logs, it causes that the useful resource is lacking and that there’s been a deletion occasion. Lastly, it acts: Calling cmdb_stage_delete with a remark confirming the deletion. The agent didn’t simply sync knowledge; it investigated. It dealt with the anomaly that often breaks automation.
Fixing the “Semantic Hole”
This particular Kubernetes use case highlights a broader drawback in IT operations: the “semantic hole.” The information in our infrastructure (JSON, YAML, logs) is filled with implicit that means. A label “env: manufacturing” adjustments the criticality of a useful resource. A standing CrashLoopBackOff means “damaged,” however Accomplished means “completed efficiently.” Conventional scripts require us to hardcode each permutation of this logic, leading to hundreds of traces of unmaintainable if/else statements. With the Codex CLI, we exchange these hundreds of traces of code with a number of sentences of English within the system immediate: “Ignore jobs which have accomplished efficiently. Sync failing Jobs so we will observe instability.” The LLM bridges the semantic hole. It understands what “instability” implies within the context of a job standing. We’re describing our intent, and the agent is dealing with the implementation.
Scaling Past Kubernetes
We began with Kubernetes as a result of it’s the “exhausting mode” of configuration administration. In a manufacturing surroundings with hundreds of workloads, issues change always. A regular script sees a snapshot and sometimes will get it incorrect. An agent, nonetheless, can work by means of the complexity. It’d run its OODA loop a number of occasions to resolve a single situation—by checking logs, verifying dependencies, and confirming guidelines earlier than it ever makes a change. This capacity to attach reasoning steps permits it to deal with the dimensions and uncertainty that breaks conventional automation.
However the sample we established, agentic OODA Loops by way of MCP, is common. As soon as we proved the mannequin labored for Pods and Providers, we realized we might lengthen it. For legacy infrastructure, we may give the agent instruments to SSH into Linux VMs. For SaaS administration, we may give it entry to Salesforce or GitHub APIs. For cloud governance, we will ask it to audit AWS Safety Teams. The great thing about this structure is that the “mind” (the Codex CLI) stays the identical. To assist a brand new surroundings, we don’t have to rewrite the engine; we simply hand it a brand new set of instruments. Nevertheless, shifting to an agentic mannequin forces us to confront new trade-offs. Probably the most quick is value versus context. We discovered the exhausting manner that you simply shouldn’t give the AI the uncooked YAML of a Kubernetes deployment—it consumes too many tokens and distracts the mannequin with irrelevant particulars. As a substitute, you create a software that returns a digest—a simplified JSON object with solely the fields that matter. That is context optimization, and it’s the key to working brokers cost-effectively.
Conclusion: The Human within the Cockpit
There’s a worry that AI will exchange the DevOps engineer. Our expertise with the Codex CLI suggests the alternative. This expertise doesn’t take away the human; it elevates them. It promotes the engineer from a “script author” to a “mission commander.” The stale CMDB was by no means actually an information drawback; it was a labor drawback. It was merely an excessive amount of work for people to manually observe and too advanced for easy scripts to automate. By introducing an agent that may purpose, we lastly have a mechanism able to maintaining with the cloud.
We began with a small Kubernetes cluster. However the vacation spot is an infrastructure that’s self-documenting, self-healing, and basically intelligible. The period of the brittle sync script is over. The period of infrastructure as intent has begun!