Back to blog

Manifesto

The Product Context Chain: Why AI Coding Agents Forget Everything That Matters

An AI agent wiped a startup’s production database in 9 seconds. The model wasn’t broken — it just had no idea what “production” meant in that company’s world. That’s the missing layer in AI-assisted dev, and bigger context windows won’t fix it.

Mark RoachFounder, Codara5 min read

In April 2026, a Cursor agent running Claude Opus 4.6 wiped a startup's entire production database — and the volume-level backups — in a single API call. The whole thing took nine seconds. The agent had been asked to fix a credential mismatch in staging. The token it was holding was overscoped, and the cleanest path to the goal, as the model saw it, was to delete the production volume and rebuild.

The model was working exactly as designed. It simply had no idea that “staging” and “production” were different things that mattered in that company's world.

This is the central failure mode of AI-assisted software development in 2026, and a better model won't fix it.

TL;DR

  • AI agents see code. They don't see the chain of decisions that produced the code — initiatives, specs, designs, architecture decisions. That chain is where intent lives.
  • Bigger context windows, MCP, and “let the agent search the wiki” don't fix it. Product context lives in tools the agent has no structured path to.
  • The fix is a connected data model — a product context chain — that every agent reads end to end. That's what we're building.

The gap is bigger than it feels

In July 2025, METR, the group that tests frontier models for governance, paid 16 experienced open-source maintainers to complete 246 tasks on repositories they already knew well. Half the tasks were done with Cursor Pro and Claude Sonnet; the other half unaided.

Before starting, the developers predicted AI would speed them up by 24%. After completing the tasks, they reported they'd been 20% faster. They were 19% slower.

AI coding tools earn their place. The trouble is that the gap between how productive they feel and how productive they are is wide, and it is structural. Generating code feels fast. The slow part comes afterward: reviewing code that doesn't fit the codebase and unwinding the wrong assumptions it baked in. That work never registers as effort. It just quietly eats the afternoon.

The everyday version is duller than a deleted database. It's the PR where the agent reimplemented logic that already existed two services away. The diff that contradicted a decision the team made three sprints ago. The “clean” refactor that quietly broke a constraint the legal team added in a Slack thread last quarter. In every case the agent had the intelligence to write the code and no way to reach the context that would have made it the right code.

Context engineering, practiced wrong

“Context engineering” is the consensus name for the discipline that's replacing “prompt engineering.” Anthropic's own engineering team puts it bluntly: “Claude is already smart enough — intelligence is not the bottleneck, context is.”

In practice, the work under that heading mostly means writing better CLAUDE.md files, tuning the agent harness, adding a retrieval step, wiring up a few MCP servers. All of that helps at the margins. The root cause sits upstream of all of it.

The information the agent needs doesn't live in the code. It lives upstream: the initiative the work belongs to, the spec the PM wrote, the design the designer signed off on, the ADR from three months ago, the comment thread where scope quietly shifted. Every tool that holds that information was built for a human reader, someone who would go read Slack, ask a colleague, dig through commits, and make a call.

The product context chain

Every meaningful piece of software work follows the same chain. The labels vary; the structure doesn't:

  1. An initiative. A funded bet on a business outcome.
  2. A product spec. Problem framing, user, success metrics, scope, acceptance criteria.
  3. A design.Decisions about surface, flow, content. Has a version that's the agreed one and several that aren't.
  4. A technical design or ADR.Architectural tradeoffs. The thing you write so the team doesn't relitigate the same question in two months.
  5. A story. The scoped unit of work. Acceptance criteria, owner, a thread of comments where scope shifted.
  6. Code. The artifact. The thing AI coding agents currently operate on.

Each layer constrains the one below it and records the decisions the next layer is supposed to honor. The chain is how a team turns a business goal into code that does what they meant.

AI coding agents see layer 6. Maybe a slice of layer 5, if you remembered to paste it in. The rest stays out of view. So they reimplement logic that already exists two directories over, make tradeoffs the team already weighed and rejected, and break invariants nobody wrote down because everyone in the room already knew them. Hand the model the same five layers a senior engineer carries into the task, and the output changes. Today it codes without them, and it shows.

The serious objections don't hold

“Bigger context windows will solve this.”Anthropic ships a 1M-token Claude. Gemini ships 2M. The natural assumption is that we're a model generation or two from windows big enough to hold an org's entire context.

Chroma's 2025 “context rot” research shows every frontier model degrading as input length grows, even on tasks well within the nominal window. Lost-in-the-middle is a property of how attention works over long sequences, and it is not something you buy your way out of with more tokens. A bigger window also says nothing about what belongs in it. Dump the whole company Confluence into every prompt and the model drowns.

“Stripe ships 1,300 AI PRs a week. The problem must be solvable.” This is the most important objection, because it's grounded in a working example. Stripe's autonomous “Minions” produce roughly 1,300 PRs per week, autonomously, across their codebase.

It works because Stripe is Stripe. Their agents run in cloud dev environments with internal-tool access already wired in. A custom harness, forked from the open-source Goose project, injects internal-tool context into every session. A decade of testing, CI/CD, ADRs, and house conventions gets pre-loaded before the agent writes a line. Stripe paid the context-engineering tax years ago, by hand. The Minions inherit the receipt.

Stripe's agents working was never the surprising part. What's surprising is that anyone expects the same outcome from buying Cursor and bolting on a Jira MCP server. Stripe is the proof of the thesis. The leverage came from the context chain they built by hand; the model just spends it.

The real moat

It's tempting to file this under problems the next model release will solve. Each generation widens the window and sharpens the reasoning, and eventually, the assumption goes, the agent just works out what we meant on its own.

It won't. The models will keep improving, and it still won't, because the information was never in the training data or the codebase to begin with, and scale won't conjure it into the context window. It lives in the decisions your team made. Until those decisions are captured and connected somewhere the agent can reach them, the agent keeps doing what it does today: shipping plausible code that misses what you meant.

The previous generation of SDLC tools was good enough because humans were the compensating layer. People held the chain in their heads. Take the humans out of the loop, or just hand the typing to agents, and the gaps in those tools turn from minor friction into the place where production disappears in nine seconds.

Whoever builds the connected data model first gets a context chain their agents can read end to end. No model release closes that gap.

That's the bet.

Codara

AI agents that share one product context, end to end.

We're rolling out access to early teams. Tell us your stack when you sign up and we'll prioritise the ones we can help fastest.