Research
The Ticket Isn’t Enough: 21 Real Tasks on What AI Agents Actually Need
We ran the same model on 21 real tasks twice — bare ticket vs. the upstream decisions behind it — and blind-graded both against the merged PR. From the ticket alone, the agent shipped a problem on 20 of 21. Most failures compiled and looked done.
One of our agents was asked to add storage and verification for a long-lived, high-privilege API token. It wrote a clean handler. It returned the secret once, kept only a hash, validated the input, wrote an audit entry. It compiled. A tired reviewer would have approved it. The hash was a plain SHA-256 digest — fast, unsalted, exactly the thing you do not use for a credential that lives for months. Leak that one row and the token falls to a rainbow table in an afternoon.
Nothing in the diff looked wrong. That is the whole problem. The agent had the ticket and the codebase. What it did not have was the one decision that mattered — this class of secret is hashed with an adaptive, salted algorithm, and the convention was sitting one folder away in a sibling token module. The bare ticket never said so, and the agent never went looking.
So we ran the experiment properly. Twenty-one real tasks, the same model on each one twice: once with the bare ticket, once with the upstream context. Then we graded both, blind, against the diff the team actually shipped. Here is what came back.
TL;DR
- We gave the same coding agent 21 real tasks from our own codebase, each run twice — bare ticket vs. a full-context decisions brief — and blind-graded both against the merged PR.
- The bare-ticket agent shipped a problem on 20 of 21 tasks (95%): 14 subtly wrong, 6 outright broken. The full-context agent passed cleanly on 76%. Context won 18 tasks, tied 2, lost 1.
- Most failures compiled and looked finished. The dangerous ones clustered in security and data-model work. But context is not magic — it only carries what you put in it, and on one task it made the agent worse.
Do AI coding agents need context, or is a good ticket enough?
On this evidence, a good ticket is not enough. A frontier coding agent given only the ticket produced output that compiled but failed on intent on 20 of 21 logic-bearing tasks. Hand the same agent the upstream decisions behind the work and the clean-pass rate went from 5% to 76%. The bottleneck was not the model's ability to write code. It was that the ticket does not contain the decisions the code is supposed to honor.
This is the testable version of an argument we've made before — that AI agents see the code but not the chain of decisions that produced it. That post was the theory. This one is us trying to break it with data from our own backlog.
What we actually did
We took 21 merged, human-reviewed pull requests from the Codara monorepo and treated each merged diff as ground truth: the answer the team converged on after review, which is a fairer bar than “one true solution.” Then for each task we ran the agent from the exact parent commit, twice:
- Condition A — bare ticket. The terse ask, plus full read access to the repository. Nothing else.
- Condition B — full context. The same ask, plus a short brief of the upstream decisions and conventions the work had to respect. The brief carried the why and the constraints, never a recipe or the answer.
Same model, same starting commit, A/B order randomized per task. A grading agent then scored every output blind — it did not know which condition produced which diff — against five dimensions: does it solve the stated problem, does it match the decision the team shipped, does it honor prior decisions, does it avoid rebuilding code that already exists, does it stay in scope. An output that compiled but failed any of those we call subtly wrong: it looks shippable and survives a careless review, but it misses what the team meant. Selection was mechanical and reverse-chronological over right-sized feat/fix PRs, narrowed to logic-bearing work — auth, the API gateway, web behavior. We dropped pure config and design-token plumbing on purpose, because a task with no hidden intent can't lose any.
The full method, the task list, and the ground-truth commit hashes live in a protocol we froze before collecting any data. More on the caveats below — there are real ones, and they matter.
The headline numbers
Across 21 tasks, the gap was not subtle. The bare-ticket agent fully passed exactly one task. The full-context agent fully passed sixteen.
| Condition | Full pass | Subtly wrong | Outright broken |
|---|---|---|---|
| A — bare ticket | 1 (5%) | 14 (67%) | 6 (29%) |
| B — full context | 16 (76%) | 4 (19%) | 1 (5%) |
Head-to-head, context won 18 of the 21 tasks, tied 2, and lost 1. The single most worrying line in that table is the bare-ticket “subtly wrong” column: 14 outputs that compiled, looked done, and were wrong about something the team had already decided.
Where the bare ticket got dangerous
The six outright-broken cases didn't scatter randomly. They clustered in security-sensitive, data-model-heavy backend work — exactly where an unstated decision does the most damage. The SHA-256 token from the opening was one. The others rhyme:
- A registration flow that skipped email verification, audit logging, and rate limiting, and checked uniqueness in a way that ignored soft-deleted accounts.
- A personal-API-token surface exposed on an unauthenticated route, with no caller identity or org scoping.
- A team-management feature with no role model and hard deletes, where the team had decided on soft deletes and a team-lead role.
- A single-sign-on initiation endpoint with the wrong transport, no replay protection, and a slug-enumeration hole.
Every one of those compiled. Every one looked like a finished feature. None of the missing pieces was exotic — they were ordinary house decisions, the kind a returning engineer carries in their head and a new agent has no way to know. The ticket said “add SCIM token management.” It did not say “and obviously hash it the way we hash every other long-lived secret,” because to the person who wrote the ticket, that part was obvious.
The subtly-wrong cases were quieter and just as instructive. One agent reinvented an environment-parsing utility that already existed in the repo. One re-architected a settings page onto a client-side data loader with duplicate schemas, when the house convention was server-side rendering through a typed wrapper. One fixed a hydration bug with the wrong-but-plausible technique instead of the SSR-correct one the team had standardized on. Every one was a plausible diff, and every one was wrong.
Context is not magic — and here's where it broke
The honest part of this study is the part that doesn't flatter the thesis. Full context was not a clean sweep. The context agent was subtly wrong four times and outright broken once, and on one task it lost to the bare ticket outright. If we'd buried that, you should distrust every other number here.
Three failure modes are worth naming, because they bound the claim:
- Near-misses.Twice the context agent got the architecture exactly right and fumbled one detail — in one case it stored an org's ID where the rest of the code expected its slug, a one-field interop bug inside an otherwise faithful implementation. Context bought the right shape, not flawless wiring.
- Over-building. On the one task it lost, the context agent built an elaborate preview of an IP-allowlist feature and then never wired up the enforcement, so the allowlist was decorative. The bare-ticket version was simpler and actually enforced the rule. More context can also mean more rope.
- Context only helps for what's in it. On one task both conditions were subtly wrong, for the same reason: our decisions brief omitted one gateway-path convention, so neither agent hit the live API. The brief is a context chain with a missing link, and the agent inherits the gap. Garbage in, plausible garbage out.
There's a fourth nuance hiding in the one task the bare ticket fully passed. It was the most trivial in the set — a one-line naming-convention fix the agent could read straight off the sibling code. Context matters least when the answer is already visible nearby, which is also the only time a bare ticket is safe. The trouble is you can't tell, from the ticket alone, which kind of task you're holding.
The caveats, stated plainly
A research post is only worth the integrity of its method, so here is everything that should make you hold these numbers loosely:
- The grades are model-produced and pending human verification.A blind grading agent scored every output; a human is verifying 100% of the subtly-wrong and broken calls plus a 20% random sample of the rest. The figures here are the model-graded baseline, not the final hand-checked numbers. We'll update if verification moves them.
- The context briefs were reconstructed, not contemporaneous.Condition B's briefs were rebuilt from each shipped PR's evident decisions, not from independently-sourced specs written at the time. We held them to decisions and conventions, not recipes, and condition A had the identical ticket and the same code access.
- n = 21, one codebase, one model, one harness.This is a directional single-repo study, not a powered multi-repo benchmark. It measures one team's conventions on one model on one day. Treat the direction as solid and the exact percentages as ours, not yours.
The selection criteria, the two conditions, and the scoring rubric were all fixed in writing before a single task ran, so the result isn't a method reverse-engineered from the outcome. We're working on a way to share the anonymized dataset so the numbers can be checked independently.
Why this happens, and what fixes it
The pattern underneath all 21 tasks is the same one. A ticket records a conclusion and trusts the reader to supply the reasoning behind it. That works when the reader is a human who'll scroll back through Slack, remember last quarter's incident, or just ask. It breaks the moment the reader is an agent that carries nothing between sessions and never asks a question. We've argued the structural version of this about the issue tracker itself — the data model assumes a human will go find what the ticket left out.
The fix is not a bigger model. Every broken output in this study came from a capable model that wrote competent code; intelligence was never the missing input. The fix is to put the decisions where the agent can read them — the same brief that took our clean-pass rate from 5% to 76%, made durable and connected to the work instead of reconstructed after the fact. That connected chain of decisions is what we're building Codara to hold.
Common questions
Does giving an AI coding agent more context actually improve its output?
In this experiment, yes, and by a wide margin: the clean-pass rate rose from 5% on bare tickets to 76% with a brief of the upstream decisions, on the same model and the same 21 tasks. The gain came from honoring decisions the ticket left implicit, with the model held constant.
What does “subtly wrong” mean here?
Code that compiles but fails on intent — it solves the wrong version of the problem, contradicts a prior decision, rebuilds logic that already exists, or quietly creeps out of scope. It's the dangerous category precisely because it passes a quick review. The bare-ticket agent produced 14 of these across 21 tasks.
Will a bigger context window solve this on its own?
A bigger window doesn't help if the decision was never written down anywhere the agent can reach. The information lives upstream of the code — in specs, designs, and prior decisions — and most of it isn't in the repository at all. The bottleneck is connection: getting those decisions to where the agent can read them.
Is one model on one codebase enough to conclude anything?
It's enough to establish a direction, not a universal rate. n = 21 on a single repo with one model and model-produced grades is directional evidence; the effect size on your stack will differ. We published the frozen method so you can run it on yours.
The takeaway
A capable coding agent with a bare ticket shipped a problem on 20 of 21 real tasks, and two-thirds of those problems compiled and looked done. The same agent, handed the decisions behind the work, passed cleanly on three-quarters of them. The model didn't get smarter between the two runs. It got the context. Until those decisions live somewhere an agent can read, the agent will keep doing what it did here: shipping plausible code that misses what you meant.
Codara
AI agents that share one product context, end to end.
We're rolling out access to early teams. Tell us your stack when you sign up and we'll prioritise the ones we can help fastest.