Rendered at 17:13:39 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
beshrkayali 1 days ago [-]
> long contexts are still expensive and can also introduce additional noise (if there is a lot of irrelevant info)
I think spec-driven generation is the antithesis of chat-style coding for this reason. With tools like Claude Code, you are the one tracking what was already built, what interfaces exist, and why something was generated a certain way.
I built Ossature[1] around the opposite model. You write specs describing behavior, it audits them for gaps and contradictions before any code is written, then produces a build plan toml where each task declares exactly which spec sections and upstream files it needs. The LLM never sees more than that, and there is no accumulated conversation history to drift from. Every prompt and response is saved to disk, so traceability is built in rather than something you reconstruct by scrolling back through a chat. I used it over the last couple of days to build a CHIP-8 emulator entirely from specs[2]. I have some more example projects on GitHub[3]
I've been thinking a lot about this lately. It seems like what is missing with most coding agents is a central source of truth. Before the truth of what the company was building and alignment was distributed, people had context about what they did and what others did and are doing.
Now the coding agent starts fresh each time and its up to you to understand what you asked it and provide the feedback loop.
Instead of chat -> code, I think chat -> spec and then spec -> code is much more the future.
the spec -> code phase should be independent from any human. If the spec is unclear, ask the human to clarify the spec, then use the spec to generate the code.
What happens today is that something is unclear and there is a loop where the agent starts to uncover some broader understanding, but then it is lost the next chat. And then the Human also doesn't learn why their request was unclear. "Memories" and Agents files are all ducktape to this problem.
dkersten 7 hours ago [-]
I’m building something similar. It’s not public yet because it’s still early and I’m still working on exactly what it is supposed to be.
But the idea is similar in that I start with a spec and feed the LLM context that is a projection of the code and spec, rather than a conversation. The context is specific to the specific workflow stage (eg planning needs different context to implementing) and it doesn’t accumulate and grow (at least, the growth is limited and based on the tool call loop, not on the entire process).
My main goals are more focused context, no drift due to accumulated context, and code-driven workflows (the LLM doesn’t control the RPI workflow, my code does).
It’s built as a workflow engine so that it’s easy for me to experiment with and iterate on ideas.
I like your idea of using TOML as the artifact that flow between workflow stages, I will see if that’s something that might be useful for me too!
beshrkayali 6 hours ago [-]
Very much the same thinking. Ossature already structures work that way at the plan level during audit, so curious to see where you take it. Happy to share more about the TOML approach if useful. Feel free to reach out (me at my domain)
comboy 23 hours ago [-]
Hey, you seem to have similar view on this. I know ideas are cheap but hear me out:
You talk with agent A it only modifies this spec, you still chat and can say "make it prettier" but that agent only modifies the spec, the spec could also separate "explicit" from "inferred".
And of course agent B which builds only sees the spec.
User actually can care about diffs generated by agent A again, because nobody wants to verify diffs on agents generated code full of repetition and created by search and replace. I believe if somebody implements this right it will be the way things are done.
And of course with better models spec can be used to actually meaningfully improve the product.
Long story short what industry misses currently and what you seem to be understanding is that intent is sacred. It should be always stored, preferably verbatim and always with relevant context ("yes exactly" is obviously not enough). Current generation of LLMs can already handle all that. It would mean like 2-3x cost but seem so much worth it (and the cost on the long run could likely go below 1x given typical workflows and repetitions)
beshrkayali 21 hours ago [-]
Right, the spec/build separation is exactly the idea and Ossature is already built that way on the build side.
I agree a dedicated layer for intent capture makes a lot of sense. I thought about that as well, I am just not fully convinced it has to be conversational (or free-form conversational). Writing a prompt to get the right spec change is still a skill in itself, and it feels like it'd just be shifting the problem upstream rather than actually solving it. A structured editing experience over specs feels like it'd be more tractable to me. But the explicit vs inferred distinction you mention is interesting and worth thinking through more.
comboy 21 hours ago [-]
The spec manually crafted the user is ideal.
It's just that we're lazy. After being able to chat, I don't see people going back. You can't just paste some error into the specs, you can't paste it image and say it make it look more like this. Plus however well designed the spec, something like "actually make it always wait for the user feedback" can trigger changes in many places (even for the sake of removing contradictions).
ithkuil 19 hours ago [-]
The spec can be wrong for many reasons:
1. You can write a spec that builds something that is not what you actually wanted
2. You can write spec that is incoherent with itself or with the external world
3. You can write a spec that doesn't have sufficient mechanical sympathy with the tooling you have and so it requires you to all spec out more and more of the surrounding tech than you practically can.
All of those issues can be addressed by iterating on the spec with the help of agents. It's just an engineering practice, one that we have to become better at understanding
beshrkayali 3 hours ago [-]
All three of these are real. The audit pass in Ossature is meant to catch the first two before generation starts, it reads across all specs and flags underspecified behavior, missing details, and contradictions. You resolve those, update the specs, and re-audit until the plan is clean. It's not perfect but it shifts a lot of the discovery earlier in the process.
The third point is harder. You still need to know your tooling well enough to write a spec that works with it. That part hasn't gone away.
whattheheckheck 14 hours ago [-]
And what is a spec other than a program in a programming language? How do you prove the code artifact matches the spec or state machine
comboy 8 hours ago [-]
Program defines the exact computer instructions. Most of the time you don't care about that level of detail. You just have some intent and some constraints.
Say "I want HN client for mobile", "must notify me about comments", you see it and you add "should support dark mode". Can you see how that is much less than anything in any programming language?
visarga 12 hours ago [-]
My own approach also has intent sitting at the top: intent justifies plan justifies code justifies tests. And the other way around, tests satisfy code, satisfy plan, satisfy intent. These threads bottom up and top down are validated by judge agents.
I also make individual tasks md files (task.md) which makes them capable of carrying intent, plan, but not just checkbox driven "- [ ]" gates, they get annotated with outcomes, and become a workbook after execution. The same task.md is seen twice by judge agents which run without extra context, the plan judge and the implementation judge.
I ran tests to see which component of my harness contributes the most and it came out that it is the judges. Apparently claude code can solve a task with or without a task file just as well, but the existence of this task file makes plans and work more auditable, and not just for bugs, but for intent follow.
Coming back to user intent, I have a post user message hook that writes user messages to a project scoped chat_log.md file, which means all user messages are preserved (user text << agent text, it is efficient), when we start a new task the chat log is checked to see if intent was properly captured. I also use it to recover context across sessions and remember what we did last.
Once every 10-20 tasks I run a retrospective task that inspects all task.md files since last retro and judges how the harness performs and project goes. This can detect things not apparent in task level work, for example when using multiple tasks to implement a more complex feature, or when a subsystem is touched by multiple tasks. I think reflection is the one place where the harness itself and how we use it can be refined.
claude plugin marketplace add horiacristescu/claude-playbook-plugin
source at https://github.com/horiacristescu/claude-playbook-plugin/tree/main
beshrkayali 6 hours ago [-]
The hierarchy you describe (intent -> plan -> code -> tests) maps well to how Ossature works. The difference is that your approach builds scaffolding around Claude Code to recover structure that chat naturally loses, whereas Ossature takes chat out of the generation pipeline entirely. Specs are the source of truth before anything is generated, so there's no drift to compensate for, the audit and build plan handle that upfront.
The judge finding is interesting though. Right now verification during build for each task in Ossature is command-based, compile, tests, that kind of thing. A judge checking spec-to-code fidelity rather than (or maybe in addition to?) runtime correctness is worth thinking about.
visarga 3 hours ago [-]
Yes, judges should not just look for bugs, they should also validate intent follow, but that can only happen when intent was preserved. I chose to save the user messages as a compromise, they are probably 10 or 100x smaller than full session. I think tasks themselves are one step lower than pure user intent. Anyway, if you didn't log user messages you can still recover them from session files if they have not been removed.
One interesting data point - I counted word count in my chat messages vs final code and they came out about 1:1, but in reality a programmer would type 10x the final code during development. From a different perspective I found I created 10x more projects since I relied on Claude and my harness than before. So it looks user intent is 10x more effective than manual coding now.
I'm using something similar-ish that I build for myself (much smaller, less interesting, not yet published and with prettier syntax). Something like:
a->b # b must always be true if a is true
a<->b # works both ways
a=>b # when a happens, b must happen
a->fail, a=> fail # a can never be true / can never happen
a # a is always true
So you can write:
Product.alcoholic? Product in Order.lineItems -> Order.customer.can_buy_alcohol?
u1 = User(), u2=User(), u1 in u2.friends -> u2 in u1.friends
new Source() => new Subscription(user=Source.owner, source=Source)
Source.subscriptions.count>0 # delete otherwise
This is a much more compact way to write desired system properties than writing them out in English (or Allium), but helps you reason better about what you actually want.
beshrkayali 2 hours ago [-]
Allium looks interesting, making behavioral intent explicit in a structured format rather than prose is very close to what I'm trying to do with Ossature actually.
Ossature uses two markdown formats, SMD[1] for describing behavior and AMD for structure (components, file paths, data models). AMDs[2] link back to their parent SMD so behavior and structure stay connected. Both are meant to be written, reviewed, and/or owned by humans, the LLM only reads the relevant parts during generation. One thing I am thinking about for the future is making the template structure for this customizable per project, because "spec" means different things to different teams/projects. Right now the format is fixed, but I am thinking about a schema-based way to declare which sections are required, their order, and basic content constraints, so teams can adapt the spec structure to how they think about software without having to learn a grammar language to do it (though maybe peg-based underneath anyway, not sure).
The formal approach you describe is probably more precise for expressing system properties. Would be interesting to see how practical it is to maintain it as a project grows.
I like it a lot, I find the chat driven workflow very tiring and a lot of information gets lost in translation until LLMs just refuse to be useful.
How does the human intervention work out? Do you use a mix of spec and audit editing to get into the ready to generate state? How high is the success/error rate if you generate from tasks to code, do LLMs forget/mess up things or does it feel better?
The spec driven approach is potentially better for writing things from scratch, do you have any plans for existing code?
beshrkayali 23 hours ago [-]
Thanks!
> How does the human intervention work out? Do you use a mix of spec and audit editing to get into the ready to generate state?
Yes, the flow is: you write specs then you validate them with `ossature validate` which parses them and checks they are structurally sound (no LLM involved), then you run `ossature audit` which flags gaps or contradictions in the content as INFO, WARNING, or ERROR level findings. The audit has its own fixer loop that auto-resolves ERROR level findings, but you can also run it interactively, manually fix things yourself, address the INFO and WARNING findings as you see fit, and rerun until you are happy. From that it produces a toml build plan that you can read and edit directly before anything is generated. You can reorder tasks, add notes for the LLM, adjust verification commands, or skip steps entirely. So when you run `ossature build` to generate, the structure is already something you have signed off on. There's a bit more details under the hood, I wrote more in an intro post[1] about Ossature, might be useful.
> The spec driven approach is potentially better for writing things from scratch, do you have any plans for existing code?
Right now it is best for greenfield, as you said. I have been thinking about a workflow where you generate specs from existing code and then let Ossature work from those, but I am honestly not sure that is the right model either. The harder case is when engineers want to touch both the code and the specs, and keeping those in sync through that back and forth is something I want to support but have not figured out a clean answer for yet. It's on the list, if you have any thoughts please feel free to open an issue! I want to get through some of the issues I am seeing with just spec editing workflow (and re-audit/re-planning) first, specifically around how changes cascade through dependent tasks.
Regarding success rate, each task requires a verification command to run and pass after generation and if it fails, a separate fixer agent tries to repair it using the error output. The number of retry attempts is configurable. I did notice that the more concise and clear the spec is the more likely it is for capable models to generate code that works (obviously) but that's what auditing is supposed to help with. One interesting case about the chip-8 emulator I mentioned above is that even mentioning the correct name of the solution to a specific problem was not enough, I had to spell out the concrete algorithm in the spec (wrote more details here[2]). But the full prompt and response for every task is saved to disk, so when something does go wrong one can read the exact prompt/response and fix-attempts prompt/response for each task.
Totally agreed! Ive had good success using claude code with Cucumber, where I start with the spec and have claude iterate on the code. How does ossature compare to that approach?
xrd 16 hours ago [-]
This is really fascinating and lines up with my way of development.
I notice you support ollama. Have you found it effective with any local models? Gemma 4?
I'm definitely going to play with this.
peterm4 24 hours ago [-]
This looks great, and I’ve bookmarked to give it a go.
Any reason you’ve opted for custom markdown formats with the @ syntax rather than using something like frontmatter?
Very conscious that this would prevent any markdown rendering in github etc.
beshrkayali 23 hours ago [-]
I've answered this exact question in a previous hn comment thread a few weeks ago, maybe I should reconsider front-matter? My previous answer:
> Yeah, I did briefly consider front-matter, but ended up with inline @ tags because I thought it kept the entire document feeling like one coherent spec instead of header-data + body, front matter felt like config to me, but this is 0.0.1 so things might change :)
4b11b4 17 hours ago [-]
need both
4b11b4 17 hours ago [-]
nice but can't be only text based
alfiedotwtf 14 hours ago [-]
How does this differ from Superpowers?
17 hours ago [-]
straydusk 18 hours ago [-]
This is basically what Augment Intent is
dboreham 22 hours ago [-]
Waterfall!
AnimalMuppet 20 hours ago [-]
There are two problems with waterfall. First, if it takes too long to implement, the world moved on and your spec didn't move. Second, there are often gaps in the spec, and you don't discover them until you try to implement it and discover that the spec doesn't specify enough.
Well, for the first problem, if an AI can generate the code in a day or a week, the world hasn't moved very much in that time. (In the future, if everything is moving at the speed of AI, that may no longer be true. For now it is.)
The second problem... if Ossature (or equivalent) warns you of gaps rather than just making stuff up, you could wind up with iterative development of the spec, with the backend code generation being the equivalent of a compiler pass. But at that point, I'm not sure it's fair to call it "waterfall". It's iterative development of the spec, but the spec is all there is - it's the "source code".
beshrkayali 7 hours ago [-]
You framed it better than I would. The part I'm still working through is making re-planning feel cheap when specs change. Right now if you change something early, downstream tasks get invalidated and the cascade isn't always obvious. Ideally when the project gets built, and then specs change, nothing of the generated code should change if an irrelevant part of the spec changed, this is a bit harder to do properly but I have some ideas.
I agree that, this is what makes it not waterfall. You're iterating on the spec and not backtracking from broken code. The spec is the "source code", replanning and rebuilding is just "recompiling".
armcat 1 days ago [-]
I still find it incredible at the power that was unleashed by surrounding an LLM with a simple state machine, and giving it access to bash
Yokohiii 1 days ago [-]
That is why I am currently looking into building my own simple, heavily isolated coding agent. The bloat is already scary, but the bad decisions should make everyone shiver.
Ten years ago people would rant endlessly about things with more then one edge, that requires a glimpse of responsibility to use. Now everyone seems to be either in panic or hype mode, ignoring all good advice just to stay somehow relevant in a chaotic timeline.
HarHarVeryFunny 1 days ago [-]
At it's heart it's prompt/context engineering. The model has a lot of knowledge baked into it, but how do you get it out (and make it actionable for a semi-autonomous agent)? ... you craft the context to guide generation and maintain state (still interacting with a stateless LLM), and provide (as part of context) skills/tools to "narrow" model output into tool calls to inspect and modify the code base.
I suspect that more could be done in terms of translating semi-naive user requests into the steps that a senior developer would take to enact them, maybe including the tools needed to do so.
It's interesting that the author believes that the best open source models may already be good enough to complete with the best closed source ones with an optimized agent and maybe a bit of fine tuning. I guess the bar isn't really being able to match the SOTA model, but being close to competent human level - it's a fixed bar, not a moving one. Adding more developer expertise by having the agent translate/augment the users request/intent into execution steps would certainly seem to have potential to lower the bar of what the model needs to be capable of one-shotting from the raw prompt.
Serberus 20 hours ago [-]
[dead]
emp17344 22 hours ago [-]
If you saw the Claude Code leak, you’d know the harness is anything but simple. It’s a sprawling, labyrinthine mess, but it’s required to make LLMs somewhat deterministic and useful as tools.
girvo 19 hours ago [-]
That’s also because of how Claude Code was written. It doesn’t have to be that way per se.
efromvt 20 hours ago [-]
It's pretty easy to get determinism with a simple harness for a well-defined set of tasks with the recent models that are post-trained for tool use. CC probably gets some bloat because it tries to do a LOT more; and some bloat because it's grown organically.
emp17344 20 hours ago [-]
>It's pretty easy to get determinism with a simple harness for a well-defined set of tasks with the recent models that are post-trained for tool use.
Do you have a source? Claude Code is the only genetic system that seems to really work well enough to be useful, and it’s equipped with an absolutely absurd amount of testing and redundancy to make it useful.
xstas1 21 hours ago [-]
Hypothesis: it's a sprawling, labyrinthine mess because it was grown at high speed using Claude Code.
emp17344 21 hours ago [-]
There’s a lot of redundancy, because there has to be to make the system useful. It’s a hacked together mess.
stanleykm 1 days ago [-]
unfortunately all the agent cli makers have decided that simply giving it access to bash is not enough. instead we need to jam every possible functionality we can imagine into a javascript “TUI”.
HarHarVeryFunny 1 days ago [-]
If all you want is a program that calls the model in a loop and offers a bash tool, then ask Claude Code to build that. You won't like it though!
For a preview of what it'd be like, just tell your AI chat app that you'll run bash commands for it, and please change the app in your "current directory" to "sort the output before printing it", or some such request.
senko 23 hours ago [-]
Claude Code with Opus 4.6 regularly uses sed for multi-line edits, in my experience. On top of it, Pi is famously only exposing 4 tools, which is not just Bash, but far more constrained than CCs 57 or so tools.
So, yes, it can work.
HarHarVeryFunny 23 hours ago [-]
I think the problem/limitation would be as much due to context management as tools. Obviously bash plus a few utilities is sufficient to explore/edit the code base, but I can't imagine this working reliably without the models being specifically trained to use specific tools, and recognize/adapt to different versions of them etc.
Context management, both within and across sessions, seems the bigger issue. Without the agent supporting this, you are at the mercy of the model compacting/purging the context as needed, in some generic fashion, as well as being smart enough to decide to create notes for itself tracking what it is doing, etc.
Apparently CC is 512K LOC, which seems massively bloated, but I do think that things like tools, skills, context management and subagents are all needed to effectively manage context and avoid the issues that might be anticipated by just telling the model it's got a bash tool, and go figure.
stanleykm 21 hours ago [-]
You don’t really need most of that stuff. Have sensible steering files. Have the agent keep state itself. Dont bother compacting. Its fine.
HarHarVeryFunny 23 hours ago [-]
I thought CC only supports it's find/replace edit tool (implemented by CC itself, using Node.js for file access), and is platform agnostic. Are you saying that on linux CC offers "sed" as a tool too? I can't imagine it offers "bash" since that's way too dangerous.
senko 21 hours ago [-]
Yes, Claude Code has a Bash tool, and Claude in some cases uses the CLI sed utility (via the Bash tool) for file changes (although it has built-in file update), at least on my Linux machine.
HarHarVeryFunny 21 hours ago [-]
Interesting - thanks.
I just asked Claude, and apparently CC makes it's bash tool available on all platforms it runs on (Linux, macOS, Windows WSL, Git for Windows), and doesn't do platform-specifc filtering of bash commands, which would seem to make for some interesting incompatibilities - GNU utils (sed, grep, find) on Linux and Windows, but BSD variants on macOS.
girvo 19 hours ago [-]
Claude code will semi-regularly try to use GNU utils on my Mac
Yokohiii 24 hours ago [-]
I think you get him wrong? He is already concerned about "bash on steroids" and current tools add concerning amounts of steroids to everything.
girvo 19 hours ago [-]
> If all you want is a program that calls the model in a loop and offers a bash tool, then ask Claude Code to build that. You won't like it though!
Okay sure it’s technically more than just bash, but my own for-fun coding agent and pi-coding-agent work this way. The latter is quite useful. You can get surprisingly far with it.
stanleykm 23 hours ago [-]
i did.. and thats what i use. obviously its a little more than just a tool that calls bash but it is considerably less than whatever they are doing in coding agents now.
slopinthebag 23 hours ago [-]
Claude Code gets smoked on benchmarks by an agent that has a single tool: tmux. So I think they might actually like that quite a bit.
HarHarVeryFunny 22 hours ago [-]
What benchmarks are you referring to?
alfiedotwtf 14 hours ago [-]
I found replacing bash with python to be more useful… that way, it can craft whatever it desires without having to pipe a billion pieces of gum together
esafak 1 days ago [-]
Tools gave humans the edge over other animals.
Yokohiii 24 hours ago [-]
And those tools regularly burnt cities to ashes. Took a long time to get it under control.
y0eswddl 23 hours ago [-]
*burn - I'm not sure we've gotten that under control quite yet
agdexai 4 hours ago [-]
[dead]
Yokohiii 1 days ago [-]
The example is really lean and straightforward. I don't use coding agents, but this is some good overview and should help everyone to understand that coding agents may have sophisticated outcomes, but the raw interaction isn't magical at all.
It's also a good example that you can turn any useful code component that requires 1k LOC into a mess of 500k LOC.
IceWreck 20 hours ago [-]
> This is speculative, but I suspect that if we dropped one of the latest, most capable open-weight LLMs, such as GLM-5, into a similar harness, it could likely perform on par with GPT-5.4 in Codex or Claude Opus 4.6 in Claude Code.
People have been doing that for over a year already? GLM officially recommends plugging into Claude Code https://docs.z.ai/devpack/tool/claude and any model can be plugged into Codex CLI (it's open source and can be set via config file).
girvo 19 hours ago [-]
And while it’s not Opus level, it is incredibly good. I use it basically exclusively (and qwen3.5-plus) on my personal projects.
gburgett 17 hours ago [-]
Loved this writeup. I have built an agent for a specific niche use case for my clients (not a coding agent) but the principles are similar. ive only implemented 1-4 so far. Going to work on long term memory next, but I worry about prompt injection issues when allowing the LLM to write its own notes.
Since my agent works over email, the core agent loop only processes one message then hits the send_reply tool to craft a response. Then the next incoming email starts the loop again from scratch, only injecting the actual replies sent between user and agent. This naturally prunes the context preventing the long context window problem.
I also had a challenge deciding what context needs injecting into the initial prompt vs what to put into tools. Its a tradeoff between context bloat and cost of tool lookups which can get expensive paying per token. Theres also caching to consider here.
Isn't there a better word than harness? I understand the metaphor of leading and constraining a raw power - but I don't like it.
sweetjuly 20 hours ago [-]
What's the concern? Harness tends to be fairly common in the context of "shim program which manages some other program" (see: "test harness", "fuzzing harness", etc.)
addandsubtract 8 hours ago [-]
It's kinda ironic that everything has become an "app" over the past 10 years. Facebook is an "app", Reddit is an "app", your bank is an "app". However, the one time we actually introduce an app to execute our LLM calls, we don't call it an "app"? Wat.
paradite 9 hours ago [-]
It’s just a fancy way of saying scaffolding.
arcanemachiner 8 hours ago [-]
Perhaps you would care to propose an alternative?
zbyforgotpass 1 hours ago [-]
My favorite would be llm runtime.
hsaliak 19 hours ago [-]
Tool output truncation helps a lot and is one of the best ways to reduce context bloat. In my coding agent the context is assembled from SQLite. I suffix the message ID to rehydrate the truncated tool call if it’s needed and it works great.
My exploration on context management is mostly documented here https://github.com/hsaliak/std_slop/blob/main/docs/CONTEXT_M...
Asuka-wx 5 hours ago [-]
The useful framing here is that coding agents get better less from raw model gains and more from better scaffolding around the model. Once you give them tools, repo context, and a simple state machine, the bottleneck shifts to context qual
MrScruff 1 days ago [-]
> This is speculative, but I suspect that if we dropped one of the latest, most capable open-weight LLMs, such as GLM-5, into a similar harness, it could likely perform on par with GPT-5.4 in Codex or Claude Opus 4.6 in Claude Code.
Unless I'm misunderstanding what's being described here, running Claude Code with different backend models is pretty common.
It doesn't perform on par with Anthropic's models in my experience.
barnabee 21 hours ago [-]
I've found that on some projects maybe 70-80% of what can be done with Sonnet 4.6 in OpenCode can be done with a cheaper model like MiMo V2 Pro or similar. On others Sonnet completely outperforms. I'm not sure why. I only find Opus to be worth the extra cost maybe 5% of the time.
I also find OpenCode to be drastically better than Claude Code, to the extent that I'm buying OpenRouter API credits rather than Claude Max because Claude Code just isn't good enough.
I'm frankly amazed at what OpenCode can do with a few custom commands (just for common things like doing a quality review, etc.), and maybe an extra "agent" definition or two. For many projects even most of this isn't necessary. Often I just ask it to write an AGENTS.md that encapsulates a good development workflow, git branch/commit policy, testing and quality standards, and ROADMAP.md plus per milestone markdown files with phases and task tracking, and this is enough.
I'm somewhat interested in these more involved harnesses that automated or enforce more, but I don't know that they'd give me much that I don't have and I think they'd be tough to keep up with the state of the art compared to something less specific.
kamikazeturtles 1 days ago [-]
> It doesn't perform on par with Anthropic's models in my experience.
Why do you think that is the case? Is Anthropic's models just better or do they train the models to somehow work better with the harness?
mmargenot 1 days ago [-]
It is more common now to improve models in agentic systems "in the loop" with reinforcement learning. Anthropic is [very likely] doing this in the backend to systematically improve the performance of their models specifically with their tools. I've done this with Goose at Block with more classic post-training approaches because it was before RL really hit the mainstream as an approach for this.
It's a good question, I've wondered that myself. I haven't used GLM-5 with CC but I've used GLM-4.7 a fair amount, often swapping back and forth with Sonnet/Opus. The difference is fairly obvious - on occasions I've mistakenly left GLM enabled running when I thought I was using Sonnet, and could tell pretty quickly just based on the gap in problem solving ability.
esafak 1 days ago [-]
They're just dumber. I've used plenty of models. The harness is not nearly as important.
vidarh 24 hours ago [-]
The harness if anything matters more with those other models because of how much dumber they are... You can compensate for some of the stupidity (but by no means all) with harnesses that tries to compensate in ways that e.g. Claude Code does not because it isn't necessary to do so for Anthropics own models.
rbren 19 hours ago [-]
Strong article! I’ve been using the engine/car analogy for a while now.
Totally agree. Chat history feels like a side effect, not a source of truth. Having an explicit markdown file for goals and constraints has been a game changer for my workflow. It turns out you don't need a complex setup; you just need the agent to be explicit about what it’s doing and why.
apotheora 15 hours ago [-]
Compounding is probably the break point, one agent's output is another agent's input, does the garbage in garbage out rule apply?
crustycoder 1 days ago [-]
A timely link - I've just spent the last week failing to get a ChatGPT Skill to produce a reproducible management reporting workflow. I've figured out why and this article pretty much confirms my conclusions about the strengths & weaknesses of "pure" LLMS, and how to work around them. This article is for a slightly different problem domain, but the general problems and architecture needed to address them seem very similar.
I think spec-driven generation is the antithesis of chat-style coding for this reason. With tools like Claude Code, you are the one tracking what was already built, what interfaces exist, and why something was generated a certain way.
I built Ossature[1] around the opposite model. You write specs describing behavior, it audits them for gaps and contradictions before any code is written, then produces a build plan toml where each task declares exactly which spec sections and upstream files it needs. The LLM never sees more than that, and there is no accumulated conversation history to drift from. Every prompt and response is saved to disk, so traceability is built in rather than something you reconstruct by scrolling back through a chat. I used it over the last couple of days to build a CHIP-8 emulator entirely from specs[2]. I have some more example projects on GitHub[3]
1: https://github.com/ossature/ossature
2: https://github.com/beshrkayali/chomp8
3: https://github.com/ossature/ossature-examples
Now the coding agent starts fresh each time and its up to you to understand what you asked it and provide the feedback loop.
Instead of chat -> code, I think chat -> spec and then spec -> code is much more the future.
the spec -> code phase should be independent from any human. If the spec is unclear, ask the human to clarify the spec, then use the spec to generate the code.
What happens today is that something is unclear and there is a loop where the agent starts to uncover some broader understanding, but then it is lost the next chat. And then the Human also doesn't learn why their request was unclear. "Memories" and Agents files are all ducktape to this problem.
But the idea is similar in that I start with a spec and feed the LLM context that is a projection of the code and spec, rather than a conversation. The context is specific to the specific workflow stage (eg planning needs different context to implementing) and it doesn’t accumulate and grow (at least, the growth is limited and based on the tool call loop, not on the entire process).
My main goals are more focused context, no drift due to accumulated context, and code-driven workflows (the LLM doesn’t control the RPI workflow, my code does).
It’s built as a workflow engine so that it’s easy for me to experiment with and iterate on ideas.
I like your idea of using TOML as the artifact that flow between workflow stages, I will see if that’s something that might be useful for me too!
You talk with agent A it only modifies this spec, you still chat and can say "make it prettier" but that agent only modifies the spec, the spec could also separate "explicit" from "inferred".
And of course agent B which builds only sees the spec.
User actually can care about diffs generated by agent A again, because nobody wants to verify diffs on agents generated code full of repetition and created by search and replace. I believe if somebody implements this right it will be the way things are done.
And of course with better models spec can be used to actually meaningfully improve the product.
Long story short what industry misses currently and what you seem to be understanding is that intent is sacred. It should be always stored, preferably verbatim and always with relevant context ("yes exactly" is obviously not enough). Current generation of LLMs can already handle all that. It would mean like 2-3x cost but seem so much worth it (and the cost on the long run could likely go below 1x given typical workflows and repetitions)
I agree a dedicated layer for intent capture makes a lot of sense. I thought about that as well, I am just not fully convinced it has to be conversational (or free-form conversational). Writing a prompt to get the right spec change is still a skill in itself, and it feels like it'd just be shifting the problem upstream rather than actually solving it. A structured editing experience over specs feels like it'd be more tractable to me. But the explicit vs inferred distinction you mention is interesting and worth thinking through more.
It's just that we're lazy. After being able to chat, I don't see people going back. You can't just paste some error into the specs, you can't paste it image and say it make it look more like this. Plus however well designed the spec, something like "actually make it always wait for the user feedback" can trigger changes in many places (even for the sake of removing contradictions).
1. You can write a spec that builds something that is not what you actually wanted
2. You can write spec that is incoherent with itself or with the external world
3. You can write a spec that doesn't have sufficient mechanical sympathy with the tooling you have and so it requires you to all spec out more and more of the surrounding tech than you practically can.
All of those issues can be addressed by iterating on the spec with the help of agents. It's just an engineering practice, one that we have to become better at understanding
The third point is harder. You still need to know your tooling well enough to write a spec that works with it. That part hasn't gone away.
Say "I want HN client for mobile", "must notify me about comments", you see it and you add "should support dark mode". Can you see how that is much less than anything in any programming language?
I also make individual tasks md files (task.md) which makes them capable of carrying intent, plan, but not just checkbox driven "- [ ]" gates, they get annotated with outcomes, and become a workbook after execution. The same task.md is seen twice by judge agents which run without extra context, the plan judge and the implementation judge.
I ran tests to see which component of my harness contributes the most and it came out that it is the judges. Apparently claude code can solve a task with or without a task file just as well, but the existence of this task file makes plans and work more auditable, and not just for bugs, but for intent follow.
Coming back to user intent, I have a post user message hook that writes user messages to a project scoped chat_log.md file, which means all user messages are preserved (user text << agent text, it is efficient), when we start a new task the chat log is checked to see if intent was properly captured. I also use it to recover context across sessions and remember what we did last.
Once every 10-20 tasks I run a retrospective task that inspects all task.md files since last retro and judges how the harness performs and project goes. This can detect things not apparent in task level work, for example when using multiple tasks to implement a more complex feature, or when a subsystem is touched by multiple tasks. I think reflection is the one place where the harness itself and how we use it can be refined.
The judge finding is interesting though. Right now verification during build for each task in Ossature is command-based, compile, tests, that kind of thing. A judge checking spec-to-code fidelity rather than (or maybe in addition to?) runtime correctness is worth thinking about.
One interesting data point - I counted word count in my chat messages vs final code and they came out about 1:1, but in reality a programmer would type 10x the final code during development. From a different perspective I found I created 10x more projects since I relied on Claude and my harness than before. So it looks user intent is 10x more effective than manual coding now.
I'm using something similar-ish that I build for myself (much smaller, less interesting, not yet published and with prettier syntax). Something like:
So you can write: This is a much more compact way to write desired system properties than writing them out in English (or Allium), but helps you reason better about what you actually want.Ossature uses two markdown formats, SMD[1] for describing behavior and AMD for structure (components, file paths, data models). AMDs[2] link back to their parent SMD so behavior and structure stay connected. Both are meant to be written, reviewed, and/or owned by humans, the LLM only reads the relevant parts during generation. One thing I am thinking about for the future is making the template structure for this customizable per project, because "spec" means different things to different teams/projects. Right now the format is fixed, but I am thinking about a schema-based way to declare which sections are required, their order, and basic content constraints, so teams can adapt the spec structure to how they think about software without having to learn a grammar language to do it (though maybe peg-based underneath anyway, not sure).
The formal approach you describe is probably more precise for expressing system properties. Would be interesting to see how practical it is to maintain it as a project grows.
1: https://docs.ossature.dev/specs/smd.html
2: https://docs.ossature.dev/specs/amd.html
How does the human intervention work out? Do you use a mix of spec and audit editing to get into the ready to generate state? How high is the success/error rate if you generate from tasks to code, do LLMs forget/mess up things or does it feel better?
The spec driven approach is potentially better for writing things from scratch, do you have any plans for existing code?
> How does the human intervention work out? Do you use a mix of spec and audit editing to get into the ready to generate state?
Yes, the flow is: you write specs then you validate them with `ossature validate` which parses them and checks they are structurally sound (no LLM involved), then you run `ossature audit` which flags gaps or contradictions in the content as INFO, WARNING, or ERROR level findings. The audit has its own fixer loop that auto-resolves ERROR level findings, but you can also run it interactively, manually fix things yourself, address the INFO and WARNING findings as you see fit, and rerun until you are happy. From that it produces a toml build plan that you can read and edit directly before anything is generated. You can reorder tasks, add notes for the LLM, adjust verification commands, or skip steps entirely. So when you run `ossature build` to generate, the structure is already something you have signed off on. There's a bit more details under the hood, I wrote more in an intro post[1] about Ossature, might be useful.
> The spec driven approach is potentially better for writing things from scratch, do you have any plans for existing code?
Right now it is best for greenfield, as you said. I have been thinking about a workflow where you generate specs from existing code and then let Ossature work from those, but I am honestly not sure that is the right model either. The harder case is when engineers want to touch both the code and the specs, and keeping those in sync through that back and forth is something I want to support but have not figured out a clean answer for yet. It's on the list, if you have any thoughts please feel free to open an issue! I want to get through some of the issues I am seeing with just spec editing workflow (and re-audit/re-planning) first, specifically around how changes cascade through dependent tasks.
Regarding success rate, each task requires a verification command to run and pass after generation and if it fails, a separate fixer agent tries to repair it using the error output. The number of retry attempts is configurable. I did notice that the more concise and clear the spec is the more likely it is for capable models to generate code that works (obviously) but that's what auditing is supposed to help with. One interesting case about the chip-8 emulator I mentioned above is that even mentioning the correct name of the solution to a specific problem was not enough, I had to spell out the concrete algorithm in the spec (wrote more details here[2]). But the full prompt and response for every task is saved to disk, so when something does go wrong one can read the exact prompt/response and fix-attempts prompt/response for each task.
1: https://ossature.dev/blog/introducing-ossature/
2: https://log.beshr.com/chip8-emulator-from-spec/
I notice you support ollama. Have you found it effective with any local models? Gemma 4?
I'm definitely going to play with this.
Any reason you’ve opted for custom markdown formats with the @ syntax rather than using something like frontmatter?
Very conscious that this would prevent any markdown rendering in github etc.
> Yeah, I did briefly consider front-matter, but ended up with inline @ tags because I thought it kept the entire document feeling like one coherent spec instead of header-data + body, front matter felt like config to me, but this is 0.0.1 so things might change :)
Well, for the first problem, if an AI can generate the code in a day or a week, the world hasn't moved very much in that time. (In the future, if everything is moving at the speed of AI, that may no longer be true. For now it is.)
The second problem... if Ossature (or equivalent) warns you of gaps rather than just making stuff up, you could wind up with iterative development of the spec, with the backend code generation being the equivalent of a compiler pass. But at that point, I'm not sure it's fair to call it "waterfall". It's iterative development of the spec, but the spec is all there is - it's the "source code".
I agree that, this is what makes it not waterfall. You're iterating on the spec and not backtracking from broken code. The spec is the "source code", replanning and rebuilding is just "recompiling".
I suspect that more could be done in terms of translating semi-naive user requests into the steps that a senior developer would take to enact them, maybe including the tools needed to do so.
It's interesting that the author believes that the best open source models may already be good enough to complete with the best closed source ones with an optimized agent and maybe a bit of fine tuning. I guess the bar isn't really being able to match the SOTA model, but being close to competent human level - it's a fixed bar, not a moving one. Adding more developer expertise by having the agent translate/augment the users request/intent into execution steps would certainly seem to have potential to lower the bar of what the model needs to be capable of one-shotting from the raw prompt.
Do you have a source? Claude Code is the only genetic system that seems to really work well enough to be useful, and it’s equipped with an absolutely absurd amount of testing and redundancy to make it useful.
For a preview of what it'd be like, just tell your AI chat app that you'll run bash commands for it, and please change the app in your "current directory" to "sort the output before printing it", or some such request.
So, yes, it can work.
Context management, both within and across sessions, seems the bigger issue. Without the agent supporting this, you are at the mercy of the model compacting/purging the context as needed, in some generic fashion, as well as being smart enough to decide to create notes for itself tracking what it is doing, etc.
Apparently CC is 512K LOC, which seems massively bloated, but I do think that things like tools, skills, context management and subagents are all needed to effectively manage context and avoid the issues that might be anticipated by just telling the model it's got a bash tool, and go figure.
I just asked Claude, and apparently CC makes it's bash tool available on all platforms it runs on (Linux, macOS, Windows WSL, Git for Windows), and doesn't do platform-specifc filtering of bash commands, which would seem to make for some interesting incompatibilities - GNU utils (sed, grep, find) on Linux and Windows, but BSD variants on macOS.
Okay sure it’s technically more than just bash, but my own for-fun coding agent and pi-coding-agent work this way. The latter is quite useful. You can get surprisingly far with it.
It's also a good example that you can turn any useful code component that requires 1k LOC into a mess of 500k LOC.
People have been doing that for over a year already? GLM officially recommends plugging into Claude Code https://docs.z.ai/devpack/tool/claude and any model can be plugged into Codex CLI (it's open source and can be set via config file).
Since my agent works over email, the core agent loop only processes one message then hits the send_reply tool to craft a response. Then the next incoming email starts the loop again from scratch, only injecting the actual replies sent between user and agent. This naturally prunes the context preventing the long context window problem.
I also had a challenge deciding what context needs injecting into the initial prompt vs what to put into tools. Its a tradeoff between context bloat and cost of tool lookups which can get expensive paying per token. Theres also caching to consider here.
Full writeup is here if anyone is interested: https://www.healthsharetech.com/blog/building-alice-an-empow...
Unless I'm misunderstanding what's being described here, running Claude Code with different backend models is pretty common.
https://docs.z.ai/scenario-example/develop-tools/claude
It doesn't perform on par with Anthropic's models in my experience.
I also find OpenCode to be drastically better than Claude Code, to the extent that I'm buying OpenRouter API credits rather than Claude Max because Claude Code just isn't good enough.
I'm frankly amazed at what OpenCode can do with a few custom commands (just for common things like doing a quality review, etc.), and maybe an extra "agent" definition or two. For many projects even most of this isn't necessary. Often I just ask it to write an AGENTS.md that encapsulates a good development workflow, git branch/commit policy, testing and quality standards, and ROADMAP.md plus per milestone markdown files with phases and task tracking, and this is enough.
I'm somewhat interested in these more involved harnesses that automated or enforce more, but I don't know that they'd give me much that I don't have and I think they'd be tough to keep up with the state of the art compared to something less specific.
Why do you think that is the case? Is Anthropic's models just better or do they train the models to somehow work better with the harness?
If you want to look at some of the tooling and process for this, check out verifiers (https://github.com/PrimeIntellect-ai/verifiers), hermes (https://github.com/nousresearch/hermes-agent) and accompanying trace datasets (https://huggingface.co/datasets/kai-os/carnice-glm5-hermes-t...), and other open source tools and harnesses.
If you want to play with the basic building blocks of coding agents, check out https://github.com/OpenHands/software-agent-sdk
https://github.com/shareAI-lab/learn-claude-code/tree/main/a...
I found it excellent in explaining a CC-like coding agent in layers.