A Coding Agent's Value Wasn't "Writing Fast"

  • #context engineering
  • #ai
  • #productivity
  • #opensource

I spent three months building a personal open-source project, karasu, a text-based architecture modeling tool, alongside the coding agent Claude Code. The whole time I held to one method: document-driven development. ADRs, Design Docs, a Test Perspective Library (TPL), and acceptance tests all lived in the same repository. This is what I learned.

A coding agent’s greatest value wasn’t writing code fast. It was exposing the holes in my own thinking, early. Document-driven development was the machine that drew them out.


Why document-driven development?

The idea came from earlier work with AI coding tools.

Devin writes code on its own. But when my instructions were vague, it piled up experimental commits and never converged. When my own goal was fuzzy, it never built the feature I wanted. It would produce something that ran, and then I’d touch it and find it far from what I needed.

Cursor is a good environment, and its rules (project-specific rule definitions) impressed me most. On that project we kept the design docs in Notion and finished the design before building a feature. I pasted the Notion URL into Cursor to feed it the design, but its grip on the design itself stayed weak. I couldn’t count on it to follow related documents on its own and keep everything consistent.

I wanted to try Claude Code too, so I read a hands-on guide, 実践 Claude Code 入門, which describes spec-driven development. You commit not just the spec but a steering file that records what you just did, leaving it as material the agent can review later. That taught me something: put documents inside the repository and the agent can use them.

So what changes if the design docs and the code live in the same repository? I started karasu to find out.


Three months, by the numbers

What strikes me first is the density. Over the last 90 days: 1,032 commits, 244 ADRs, 210 acceptance tests, 67 TPLs. A solo OSS project, built while five or six sessions ran in parallel.

When I asked Claude Code what helped, its first answer surprised me: CLAUDE.md runs only 82 lines and does nothing but index the other documents. An agent whose context resets every time needs a map up front, a “when in doubt, look here.” That map cuts the cost of searching. Stuff in too many instructions and the agent skims past them. The restraint paid off.


How development actually ran

karasu started with a concept phase, and after that it ran as a single loop, from requirements to tests. Follow one turn of that loop and you can see how the ADRs, ATs, and TPLs piled up, and how the agent’s role shifted.

Concept and setup

My first job was to put into words what I wanted from karasu. I worked through that concept phase in conversation with Sonnet 4.6. We hammered out a modeling notation for describing architecture as text, and I captured the early technical decisions as ADRs.

I left those conclusions in the repository as CLAUDE.md and ADRs, then handed them to Claude Code to run the initial project setup.

From requirements to implementation

Development begins when I write the requirement into a GitHub Issue. Even at that stage, I work through the requirement with the coding agent.

Hand it the Issue link and the agent drafts an implementation plan. But I can’t see the background, or which options it weighed. So I asked it to write a Design Doc from the requirement first.

Once it wrote the Design Doc, the agent began listing the options it had considered and the information still missing for the design, as “open questions.” That let me choose a direction from real options, and find my own blind spots through those questions. I settle the questions I can answer before implementation, and once we agree on the approach, I hand the work to the agent. It implements against the Design Doc.

Review and recording

When the agent finishes implementing, it opens a PR. Early in karasu I reviewed every line myself. Later I started launching the code-review skill, which spins up subagents that check the recent Git history and run the review. Now I look closely at the code only after that review finishes.

For a PR that changes the UI, or one that touches MCP or an API, I always have the agent write a manual-check list. I operate the thing myself and confirm it does what I originally asked.

Once I’m satisfied, I merge the PR and record the Design Doc’s decisions as an ADR. I keep the ADRs so that next time, when the agent writes a Design Doc for a similar feature, it can design from the past technical decisions.

Tests and cross-cutting impact

As features piled up, I lost track of the manual-test steps themselves. So I added the acceptance-test skill, which leaves acceptance tests in the repository. When I open an implementation PR, the manual test steps stay in the repo and the PR points to them.

Once the acceptance tests (ATs) grew, I could run regression passes, but doing them by hand every time got tedious. I had the agent pick which ones to automate and brought in E2E tests. Since it marks which ATs are automated, I can even report an automation rate.

ATs and E2E only confirm that a given PR does its job. They miss changes with cross-cutting impact. I noticed this in one moment: while building the layout adjustment for the System diagram, the change never reached the Deploy and Org diagrams. Touch one diagram and another that should move with it breaks.

At my day job too, we wrote test perspectives before implementation and closed design gaps in advance. So I introduced a TPL (Test Perspective Library). A TPL isn’t mere insurance against missing test cases. It’s a cross-cutting impact map that states, at design time, “if you touch this, has the connected thing over there broken?” For bugs in existing features, or new features that ripple outward, I had the agent write these perspectives. Proactive TPL, a perspective that checks ahead of time whether a new feature breaks existing ones, grows from the same idea.

Once the ADRs, ATs, and TPLs were in place, the agent could draw on them to run requirements, design, implementation, and testing. And it started pointing out the holes in my thinking, one after another.


Where the bottleneck moved

For the consistency of ADRs and TPLs, I built my own auto-checkers, adr-tools and tpl-tools. With those, the documents almost never drifted out of sync. The ATs cut missing test cases, and E2E doubled as a coverage signal. Machinery could now string the net for tests and consistency.

Then the problem moved. It went from “not enough test cases” to “gaps in the spec itself.” The hard part settled onto the question of what to build in the first place. I read that as a sign that working with the agent had matured a step.

This shift is the flip side of the unease I felt with Devin before karasu.

  • Devin style: it runs off → something that sort of works appears → not what I wanted
  • karasu style: it asks back → the holes in the requirements surface → it converges

Document-driven development isn’t a discipline for writing documents. It’s an apparatus that keeps the holes in your thinking exposed, all the time, inside the repository.


The cost of high parallelism, and rules that don’t hold

It wasn’t all upside. Two shadows, and I’ll be straight about them.

High parallelism and double starts

Run this process and a gap opens before the agent comes back for a decision. I started handing other tasks into that gap, and before long five or six sessions were running all the time.

More work moved in parallel. But I also kept causing the same accident: several sessions starting the same Issue twice. I tracked work items as GitHub Issues and showed their status with labels. The dev skill even says to check the label before starting and update it once you do. Still, when things got busy the agent forgot, and started the same item twice.

You can run a hard item across several sessions on purpose, speculatively, and keep the best result. When you didn’t intend that, a double start is pure waste.

Why prose rules don’t hold

Writing a rule into a handbook or a skill is one thing. Whether it holds is another. To an agent, a prose procedure is something it follows probabilistically, not something guaranteed.

“Work under a worktree” and “don’t push straight to main” hold up because memory, skills, and a mechanical guard (lefthook and the like) back them in layers. The label check breaks because it lives only as a prose promise, with no mechanical enforcement.

A rule you want an agent to keep, you carve into the machinery, not the text. That is why harness engineering earns its place.


How to think about the cost

To verify quality I ran CI and E2E on GitHub Actions. On top of the GitHub Pro plan, another $20 a month landed on the bill. Document-driven development held the quality, but verifying that quality didn’t come cheap. That’s the raw feeling.

Flip the axis to person-months, though, and the verdict reverses. The karasu I built in three months would take at least two or three people to staff as a contract project. At roughly $1,000 total for Claude MAX and GitHub Actions, folding two or three person-months into that is a steal. The $20 overage on CI startled me, yet against person-months the whole thing costs almost nothing. That two-sidedness is my honest conclusion for now.


Takeaways

  • Put the documents in the same repository and the agent can follow related ones on its own to keep them consistent.
  • Putting documents in the repository means designing the context you hand the agent. Context is something you engineer.
  • Keep CLAUDE.md to an index. Cramming breeds skimming.
  • Design the TPL as a cross-cutting impact map, not as insurance against missing test cases.
  • The agent’s value is exposing the holes in the requirements early, not raw coding speed.
  • High parallelism breeds double starts. Carve the rules you want kept into the machinery.
  • By person-months, the cost is a steal.

Use a coding agent as a fast code generator and it won’t converge, the way Devin didn’t. Use it as a mirror for your own thinking and the story changes. You leave context in the repository as documents so the agent can read the reasoning behind a design. That craft, engineering the context, is what sharpens the mirror. Engineer not just the code but the context itself. That is where three months have landed me, for now.