Back to blog

How-to

How to Write a Spec an AI Coding Agent Won’t Misinterpret

AI agents fill gaps in a spec by guessing. This is the six-part template — problem, success criteria, acceptance tests, constraints, out of scope, prior decisions — with the weak version vs. the strong version for each.

Mark RoachFounder, Codara7 min read

Here is how to write a spec for an AI coding agent that it won't misinterpret: write down the things you assume it already knows. An agent will not ask. When your spec leaves a gap, the model fills it with whatever its training data says is most likely, and “most likely in general” is rarely “correct in your codebase.” The template below closes the gaps a human reader would have closed on their own.

This is the practical follow-on to our argument about the product context chain. That post explains why agents miss intent. This one is the part you can act on this afternoon, with no new tooling: six sections, and for each one, the weak version that an agent gets wrong next to the strong version that pins it down.

TL;DR

  • A human reading a thin ticket asks around. An agent can't, so it guesses. A 2026 benchmark found that even strong coding models “step into the generator role” and write code rather than ask when a request is ambiguous.
  • What works is a short spec that names the few things that are load-bearing and easy to get wrong. Piling on requirements backfires — it has its own failure mode, the “curse of instructions.”
  • Six sections cover almost every gap: problem, success criteria, acceptance tests, named constraints, out of scope, and prior decisions to inherit.
  • Write acceptance tests an agent can run, not adjectives it has to interpret. “Returns 429 after the 6th request in 60s” beats “handles rate limiting gracefully.”

Why agents misread specs in the first place

Two failure modes, pulling in opposite directions.

The first is under-specification. When a request is ambiguous, a coding agent does not pause to ask what you meant. A 2026 benchmark called ClarEval tested eleven state-of-the-art code agents under deliberately ambiguous instructions and found that strong coders like GPT-5-Coder mostly skip the clarifying question and generate code directly. The model treats your missing detail as a detail it gets to invent. Most of the time the invention is plausible, compiles, and is wrong in a way you only catch at review.

The second is over-specification. The instinct after a bad result is to write more rules. But a January 2026 paper on the interference between instruction-following and task-solving showed that adding even self-evident constraints to a prompt measurably degraded performance on code generation, math, and multi-hop QA, including for Claude Sonnet 4.5. The failed runs spent more of their attention on the constraints and less on the task. A wall of instructions is its own kind of noise.

So the job of a good spec is narrow. Name the things a competent engineer would know that the model doesn't. Skip the things the model already handles. The six sections below are sorted roughly by how often the gap bites.

The six-part spec, weak version vs. strong version

Here is the whole template at a glance. The rest of the post works through each row with a worked example. The examples are illustrative, not from a real ticket.

SectionWeak version (agent guesses)Strong version (agent obeys)
Problem“Improve checkout performance.”“Checkout p95 is 3.1s; target under 1s. The bottleneck is N+1 tax queries.”
Success criteria“Make it fast and reliable.”“p95 < 1s at 200 rps; no increase in error rate.”
Acceptance tests“Handle errors gracefully.”“On tax-service timeout, return cached rate and log a warning; never block checkout.”
Constraints(unstated)“Don't add a caching dependency; use the existing Redis client in lib/cache.”
Out of scope(unstated)“Do not touch the tax provider's API contract or the checkout UI.”
Prior decisions(unstated)“We rejected a CDN edge cache in ADR-114 for compliance reasons. Don't reintroduce it.”

1. Problem statement

State what is wrong and how you know, with a number. The weak version, “improve checkout performance,” tells the agent the topic and nothing else. It will pick a plausible target — caching, a query rewrite, maybe a whole new service — and optimize for it. If the real issue is an N+1 query on tax lookups, a generic caching layer is wasted work that also adds a dependency you didn't want.

The strong version names the symptom (“p95 is 3.1s”) and the diagnosis (“N+1 tax queries”), then sets the target. Now the agent is solving your problem instead of a problem shaped like your sentence.

2. Success criteria

Success criteria are the measurable conditions that decide whether the work is done. The test is simple: could two people disagree about whether you met it? “Fast and reliable” fails — fast is a number nobody wrote down. “p95 under 1s at 200 rps with no rise in error rate” passes, because you can run it and get a yes or a no.

This is the section that prevents the agent from declaring victory early. Without a threshold, a model that shaved 200ms off a 3-second request will happily report that it made checkout faster. Technically true. Useless.

3. Acceptance tests

This section does the most work, and it's the one teams skip. Acceptance tests describe observable behavior at the edges — what happens on the unhappy path, when a dependency times out, when the input is empty, when two requests race — written as concrete input/output pairs the agent can turn into real, runnable tests instead of adjectives it has to interpret. Write the edge cases down. That is where the guessing happens.

“Handle errors gracefully” is an adjective. The agent will guess what graceful means — maybe a retry loop, maybe a swallowed exception. “On a tax-service timeout, return the last cached rate and log a warning; never block the checkout” is behavior. There is exactly one way to read it, and you can verify it ran.

This maps onto the older agile idea of a definition of done, and the mapping is worth making explicit for an agent. As one engineer put it when adapting DoR/DoD for coding agents, acceptance criteria should say what the code should do, what it should not do, and what happens when an external dependency fails. The third clause is the one humans leave in their heads and agents never recover.

4. Named constraints

Constraints are the boundaries the agent should treat as fixed: which library to use, which directory to put things in, what it must never touch. GitHub's engineering team analyzed more than 2,500 agents.md files and found a clear pattern: the most effective ones spell out commands, project structure, and explicit boundaries, and the single most common helpful constraint was simply “never commit secrets.” (Published November 2025.)

Their other finding is the useful framing for a spec: sort instructions into always do, ask first, and never do. That gives the model a decision rule instead of a paragraph it has to parse for intent. “Use the existing Redis client in lib/cache; never add a new caching dependency” is a constraint the agent can follow without judgment. Left unsaid, it'll reach for whatever package its training data favors.

5. Explicit out of scope

A human knows not to refactor the auth module while fixing a checkout query, because they know it's someone else's in-flight work. The agent doesn't. Left to its own sense of tidiness, it will “helpfully” reformat files, rename variables across a module, or upgrade a dependency it noticed was old — and bury your one-line fix in a 300-line diff.

One or two lines of out-of-scope shrink the blast radius. “Do not change the tax provider's API contract or touch the checkout UI” tells the agent where the fence is. This is also what makes the resulting PR reviewable, which is where the real time goes — see our note on the felt-vs-measured productivity gap.

6. Prior decisions to inherit

The last section is the one the other tools can't hold. Every codebase carries decisions that were argued out once and written down somewhere the agent can't reach — an ADR, a Slack thread, a comment on a closed ticket. The agent has no memory of them and, as we've argued, no way to follow the pointer to where they live.

So inline the verdict. “We rejected a CDN edge cache in ADR-114 for compliance reasons; don't reintroduce it” saves the agent from re-proposing the exact solution your team already killed. You don't need to paste the whole ADR. You need the one sentence that stops the rerun.

How long should the spec be?

Shorter than you fear, longer than a one-liner. The instruction-interference researchsays every extra constraint costs some attention, so don't pad. Addy Osmani's test in his O'Reilly piece on specs for agents is the right bar: a good spec is one where the reader could build the thing “without asking a single clarifying question.” Aim for that, and stop. Most useful specs for a scoped change land in a screen or two, not a document.

Questions people ask

Should I just ask the agent to write the spec?

Use it to draft, then edit hard. An agent will happily expand your one-liner into a tidy SPEC.md, and the structure it produces is usually fine. What it can't supply is the part that matters: your constraints, your out-of-scope, and the prior decisions it has no access to. Treat the generated spec as a skeleton you fill with the things only you know.

What's the difference between success criteria and acceptance tests?

Success criteria are the conditions that define done for this task (“p95 under 1s”). Acceptance tests are the specific, runnable checks that prove each one, especially on the unhappy path. Success criteria are the what; acceptance tests are the how you'll verify it. An agent needs both: the target, and the runnable definition of hitting it.

Does this replace a PRD?

No. A PRD frames the product bet for humans. This spec is the scoped, machine-readable unit a coding agent acts on — closer to a well-formed story than a strategy doc. The PRD sits a layer up; the agent works from the spec the PRD eventually produces.

Where should the spec live?

Wherever the agent reads from. Many teams keep a SPEC.mdin the repo and feed sections in per session, because it survives between runs. The deeper problem is that your constraints and prior decisions live in five other tools the agent can't see, which is why you end up retyping them into every ticket. That's the structural issue we're building Codara to fix.

The one-line test

Before you hand a spec to an agent, read it as if you were the agent: you have no memory of this project, you can't ask anyone, and you have to start typing now. Every place you find yourself wanting to ask a question is a place the agent will instead guess. Answer those questions in the spec and the guessing stops.

That's the whole discipline. Specs don't fail because they're too short. They fail because they assume a reader who fills the gaps, and the reader changed.

Codara

AI agents that share one product context, end to end.

We're rolling out access to early teams. Tell us your stack when you sign up and we'll prioritise the ones we can help fastest.