Anatomy of a Production AI Agent

Last month, a team running 200K+ CI jobs per week asked us why they shouldn't just point Claude Code at their failing builds. Fair question. We use Claude Code ourselves every day, and we love it. But after watching Mendral close 16,000+ CI investigations a month autonomously, we can explain exactly why a specialist agent outperforms a generalist one, even when both run on the same LLMs.

This post is about how we built Mendral's agent architecture, the technical decisions behind it, and the principles we've learned shipping an AI agent in production for teams like PostHog and Luminai.

Why we're building this

Coding agents are accelerating the creation of code. That's great for shipping features. It's terrible for CI.

The teams we work with are seeing it firsthand. Teams adopting AI coding tools are seeing significantly more CI activity. More PRs, more test runs, more failures surfacing. Pipelines are slower because there's more code being tested. Flaky tests that were annoying at 10 engineers become a tax on everyone's productivity at 100. And the engineers generating all that code with Copilot and Claude Code aren't the ones debugging the CI failures. They've already moved on.

We spent a decade building and scaling CI systems at Docker and Dagger. The work was always the same: stare at logs, correlate failures, figure out what changed. Mendral is the agent we wished we'd had back then.

Specialist vs. generalist

Claude Code is a generalist Software Engineer. Mendral is a specialist. Despite running on the same Anthropic models, Mendral consistently outperforms Claude Code at diagnosing and fixing CI failures.

The difference isn't the model. It's the context.

Claude Code sees your codebase. It can read files, run commands, and reason about code. But when a CI job fails, the useful signal isn't just in the code. It's in the logs from this run, the logs from the last 50 runs, the test execution history, the failure patterns across branches, the infrastructure conditions at the time of execution. Claude Code doesn't have access to any of that.

Mendral does. We built a log ingestion pipeline that processes billions of CI log lines per week into ClickHouse, compressed at 35:1, queryable in milliseconds. Our agent writes its own SQL queries to investigate failures. A typical investigation scans 335K rows across 3+ queries. At P95, it scans 940 million rows. The agent can trace a flaky test back to a dependency bump three weeks ago by correlating across hundreds of CI runs simultaneously, something no human would have the patience to do.

The whole implementation is ours, from the system prompt to every tool. Our agent can grab specific logs from a run, query historical failure rates across months, trace which commit introduced a regression, check if a test has been flaky on other branches, and cross-reference all of this in seconds. Claude Code can't do any of that because it doesn't have the tools or the data.

One agent to the customer, a team of agents behind the scenes

From the outside, Mendral is one agent. You install a GitHub App, it joins your Slack, and it starts investigating CI failures. Internally, it's a team of specialized agents coordinating through our Go backend.

Mendral's multi-agent architecture: triggers, model routing, tools, and outputs wrapped in durable execution

We use all three Anthropic model tiers: Haiku, Sonnet, and Opus. Not because we want to use everything available, but because different tasks have fundamentally different cognitive demands, and using the wrong model for a task is either wasteful or insufficient.

Opus handles root cause analysis and implementation. When the agent needs to form a hypothesis about why a test is failing, reason about complex interactions between test suites, or write a non-trivial fix that touches CI configuration and test code simultaneously, Opus takes over. This is where the frontier model makes a measurable difference. The cost is higher, but for root cause analysis, the quality gap justifies it.

Sonnet collects facts and deduplicates issues. It reads logs, writes SQL queries, gathers evidence from the repository, and correlates failures with code changes. Sonnet is the right balance of intelligence and cost for structured, evidence-gathering work.

Haiku handles log parsing and data extraction. Classifying failure types, formatting structured output, extracting relevant snippets from raw logs. These are tasks where the solution space is constrained and we need throughput. We process thousands of these per day.

How we route between models is an area we're actively iterating on. We review model assignments regularly as models improve. Work that required Sonnet six months ago sometimes runs fine on Haiku today. We'll write more about our model routing architecture in a future post.

This multi-agent architecture means we can keep costs predictable while delivering quality where it matters. A full CI investigation might involve a dozen sub-agent calls across all three tiers.

The agent loop

Our agent loop runs on our Go backend. This is a deliberate choice. We don't use LangChain, LangGraph, or any off-the-shelf agent framework. The loop is ours, written in Go, because we need full control over execution, concurrency, and failure handling.

The core loop is straightforward: the agent receives a trigger (a CI failure, a Slack message, a scheduled analysis), assembles context, makes an LLM call, processes tool calls, and iterates until it reaches a conclusion or exhausts its budget. Each iteration is an LLM call with the accumulated context and available tools.

Some tools are pure Go functions. Querying ClickHouse, fetching GitHub metadata, looking up repository structure, checking PR status. These are fast, deterministic operations that don't need isolation. They run in-process.

Some tools require a sandbox. When the agent needs to clone a repository, run tests, apply patches, or execute arbitrary code to validate a fix, it needs an isolated environment. We provision Firecracker microVMs on Blaxel for this. Each sandbox is a lightweight VM with its own kernel, providing hardware-level isolation between tenants. The sandbox boots in under 125ms, the agent operates on it, and when the session ends, the sandbox is destroyed. No data leaks between customers, no shared kernel vulnerabilities.

Between tool calls, the sandbox is suspended. The agent doesn't hold compute while it's thinking. When the LLM returns the next tool call, the sandbox resumes in under 25ms with full filesystem and memory state preserved. This matters because a single investigation can involve 10+ tool calls with LLM reasoning in between. Paying for idle compute during LLM inference would be wasteful. Suspend and resume eliminates that cost.

There's another pattern specific to CI: the agent sometimes needs to wait hours for a pipeline to complete after pushing a fix. The sandbox suspends during that wait. When CI finishes and the agent needs to verify the result, the sandbox resumes with full state intact. Without suspend/resume, you'd either pay for hours of idle compute or lose the entire execution context and start over.

This architecture means the agent gets the security of full VM isolation, the performance of near-instant resume, and we only pay for compute when tools are actually executing.

LLMs are messy. Plan for it.

Here's the thing nobody tells you about building agents in production: the models are the easy part. Everything around them is hard.

LLM APIs are slow. A single Sonnet call takes 2-10 seconds depending on context size. Opus can take 30+ seconds for complex reasoning. Tool calls hit external APIs (GitHub, Slack, ClickHouse) that have their own latency and failure modes. A single CI investigation involves 10-20 LLM calls and 30-50 tool executions. The whole chain takes minutes, and any step can fail.

An LLM call that fails costs you the entire accumulated context if you have to start over. A GitHub API call that times out after you've already spent 30 seconds on an Opus reasoning step is expensive to retry from scratch. The failure modes compound: rate limits, network timeouts, API errors, malformed LLM output, context window overflows.

We solve this with durable execution. Both our agent loop and our data ingestion pipeline run on Inngest, a durable execution engine. Every meaningful operation is a step that can be retried independently. If a GitHub API call fails on step 7 of a 15-step investigation, we retry step 7, not the entire investigation. The state of all previous steps is persisted and memoized.

This is critical for agent reliability. Without durable execution, you need to build your own retry logic, state recovery, and deduplication for every function. Every interrupted operation needs to be reconciled. With Inngest, a rate limit response from GitHub is just a pause. We read the Retry-After header, add jitter to avoid thundering herd, and suspend execution. When the wait is over, the function resumes at exactly the point it left off. No re-initialization, no duplicate work.

The practical effect: our agent doesn't crash on transient failures. It doesn't re-do expensive LLM calls because a downstream API hiccupped. It doesn't lose state when infrastructure restarts. It just picks up where it left off.

We break our agent functions into steps at every boundary that can fail: LLM calls, API calls, database writes, sandbox operations. Each step is individually retried with configurable backoff. This is the difference between an agent that works in demos and one that runs reliably on production CI for teams processing hundreds of thousands of jobs per week.

A single Mendral investigation traced in Inngest. Each step is independently retried. LLM calls take 3-8 seconds. Tool calls return in under a second.

A decade of CI expertise, encoded

A common mistake in agent development is assuming that a powerful model plus the right tools equals good performance. It doesn't. The system prompt matters, but it's not the whole picture. Prompts, tools, and the data you feed the model all work together. The expertise isn't in any one of these. It's in how they combine.

We spent a decade debugging CI at Docker and Dagger. We know the patterns. Race conditions in parallel test execution. Shared state between test suites causing order-dependent failures. Infrastructure variance causing timing-sensitive tests to fail on slower runners. Dependency resolution differences between CI and local environments. Cache invalidation bugs that surface only under specific build orders.

All of that is encoded in Mendral. In the prompts, in the tools, in what data gets retrieved and when. The agent knows to check for these patterns because we built it to, based on years of doing it manually. It knows that a test failing intermittently on CI but passing locally is almost never "random." It knows to look at resource constraints, concurrent test execution, and shared state before blaming the code. It knows that a sudden spike in failures after a dependency bump is likely a transitive dependency issue, not a flake.

This is what makes Mendral feel like hiring an experienced Platform Engineer rather than pointing a general-purpose AI at your CI. The prompts encode judgment that takes years to develop.

And we keep refining them. Every session Mendral runs, the customer can provide a thumbs up or thumbs down on the result. We track these signals across all sessions and use them to identify where the agent's reasoning breaks down. When we see a pattern of negative feedback on a specific failure type, we update the prompts and tools to handle it better. This is continuous improvement driven by real production data, not synthetic benchmarks.

We're currently helping teams like PostHog (575K CI jobs per week, 1.18 billion log lines), Korint, and Stockline keep their pipelines healthy. Each of these teams pushes the agent in different ways, surfacing edge cases we couldn't have anticipated. The prompts and tools get better every week because we're learning from real failures at real scale.

Over time, this compounds. Mendral evolves into the most experienced DevOps engineer you can hire. It works nights and weekends, it never context-switches away from your CI, and it costs a fraction of what a human Platform Engineer costs. Not because the models are magic, but because we're continuously encoding expertise into a system that applies it consistently across every failure, every commit, every repository.

Observability is not optional

You can't improve what you can't see. Every LLM call, every tool execution, every decision point in the agent loop is traced. We log the full prompt, the response, the tool calls, the results, and the time each step took. When a session produces a bad diagnosis, we can replay the exact sequence of decisions and identify where the reasoning went wrong.

This is table stakes for production agents. Without it, debugging is guesswork. "The agent gave a bad answer" is not actionable. "The agent queried failure rates for the wrong time window at step 4, which caused it to misclassify a regression as a flake at step 7" is actionable.

We version our prompts and tools together. A prompt change ships with corresponding tool changes and evaluation results. If a new prompt version causes a regression in diagnosis quality, we can pin it to the exact change and roll back.

What we'd tell you if you're building an agent

If you're building a production agent, here's what we've learned:

Use the right model for each task. Don't default to the biggest model. Most agent work is structured and repetitive. Use Haiku or Sonnet for 80% of it and route to Opus only when the task genuinely requires deeper reasoning. Review your model assignments regularly as models improve.

Own your agent loop. Frameworks are fine for prototypes. In production, you need control over execution flow, concurrency, error handling, and context management. Write the loop yourself.

Invest in durable execution early. Retrying at the function level instead of restarting entire workflows saves enormous time and cost. Every external call can fail. Plan for it from day one.

Sandbox anything that touches untrusted code. If your agent clones repos, runs tests, or applies patches, it needs VM-level isolation. Containers share a kernel. Firecracker microVMs don't. The security difference matters when you're running across multiple customer tenants.

Your domain expertise is your product. The model is a commodity. What matters is the combination of tools, prompts, and data that you build around it. Version your prompts and tools together. Test them. Improve them continuously based on production feedback.

Build feedback loops from day one. Thumbs up/down on agent sessions gives you a continuous signal for improvement. Without it, you're flying blind.

Human software engineers should stay focused on building the product. Let a specialist agent handle the CI.

We're building Mendral (YC W26). We spent a decade building and scaling CI systems at Docker and Dagger. Mendral is an always-on AI DevOps engineer that diagnoses CI failures, catches flaky tests, and opens PRs with fixes. If your team is burning engineering time on CI issues, we'd love to look at your setup.

Get started.