Agentic loop for deterministic spec driven development
A problem with GenAI programming is the fact that LLMs are non-deterministic by nature. The bottleneck has now moved from writing code to reviewing code. This begs the question, how can we stop reviewing the code generated by LLMs?
I think the answer lies somewhere between having clear enough specs as well on doing “tools distillation” to achieve “enough” deterministic behavior to the point where we don’t need to review the code (or at least the code in current programming languages).
In this post I introduce an agentic architecture pattern to converge specs into close to deterministic code generation. In a follow up post I would describe, and hopefully provide some proof, the tools distillation pattern to complement this technic.
Disclaimer: it’s possible that this pattern is already described by someone else. Our engineering practices are changing so fast that is impossible to know everything. If you are aware of this pattern as of March 23, 2026, please don’t hesitate to let me know at [email protected].
What is this about
The idea of automatic spec convergence is simple, we create an agentic loop where specs are iteratively improved in a loop by asking multiple agents to generate code from the current version of the specs, comparing them and then feedback looping the difference in the implementation of the multiple coding agents.
We want to create a loop like this:
The agents that we need for this loop are the following:
High-Level Specs: it generate business oriented specs. Importantly, we don’t want this agent writing tech specs or setting any pre-mature constraint on the tech. Think about these as the business requirements documents that your typical product manager (PM) would write and would be the input to a tech team in a non-agentic software organization.
Tech Specs: this agent is focused on writing abstract tech specification of the software. It will include things like components, it’s responsibilities, APIs, key algorithms, data structures, etc. Now, you don’t want this agent to writing specs coupled to a specific tech stack. Instead, you will likely benefit from the agent writing platform and language independent specs. For example, in the tests I’m doing I direct this agent to write pseudo code instead of using a specific programming language. To guide the tech stack, I can provide a specific file detailing the tech stack I want, or I could let this to be decided by the agent. More on this topic later.
Coding Agents: these are a set of agents, you need at least 3 of them. Each coding agent will take on the role of a software engineer implementing the specs provided by agents in 1# and 2#. You can make the agents work in separate branches or in separate folders. In my limited tests, I found so long that is less error prone to make the agents just work on different sub-folders within the repo. But your milage may vary and ideally you want to provide clean isolated environments to these agents. Importantly, the code generated by these agents is completely deleted after each cycle in the convergence loop. You don’t want to contaminate the agent execution by allowing it to modify pre-existing code. Remember that the goal is to have clear enough specs that most agents will generate code that is precise enough that doesn’t require to be reviewed.
Comparator Agent: Once the coding agents have generated the code, this agent will compare the solutions from the N coding agents (at least 3). A key property of this agent, is that it has to use tools to get accurate and deterministic metrics about the source code. In my tests, I used few metrics like number of code lines, public API comparison, cyclomatic complexity of the implementation, folder and files structure, public classes, text UI, or graphical UI/layouts comparison, etc. Depending on your use case, and how close do you want the code to converge, you may want to use different metrics. The main thing to remember is that this agent will use “tools”, specifically either scripts or code analysis tools that can give deterministic metrics. You don’t want this step to be non-deterministic.
Feedback Loop: This agent ties everything together. Based on the report from the comparator agent, this will first analyze if the convergence criteria have been meet. The convergence criteria most likely will be parametrized and you will want to achieve consensus from all the coder agents on certain metrics, while other can have more variance. For example, number of lines of code or private methods could vary, but likely you want the UI to be close to 100% similar between coders, same for invariants, security, public API and non-functional constraints. The feedback agent will kick-off a new round of specs iteration and code generation until the thresholds for the evaluation are achieved and you end up with converged specs.
When the feedback loops end successfully, you should have “converged specs”, and technically don’t even need to version control the code. You could just generate the code an be fairly certain the code is valid and implement the specs. As part of this, we would want to record in metadata the models and it’s configuration used for the code generation. This would help ensure there is a lower variance between multiple rounds of code generation. Obviously, because of the nature of how LLMs and even hardware works, it’s not possible to fully reproduce the code generation, but once specs are converged you do know that as long as you are using the same family and version of models should create code with very similar characteristics.
Result so long
I’m going to be posting my progress on validating this idea. So long, I was able to validate this with a small project using Kiro. I’m working on replicating this with Claude code, but need few days given I’m only paying the cheaper pro subscription, yeah sorry but I do have unlimited Kiro at work.
Now, I can confirm that using Kiro with claude models it’s possible to achieve convergence and there are interesting learnings,
Learnings so long
Once I build a simple version of this agentic architecture, I requested Kiro to build a simple expense management CLI application. Learnings so long:
Initially asked to use different branches for each coder agent, but it ended up creating conflicts so had to switch to use different folders. I think it’s easier to use different folders at least for initial exploration, it allows to easily compare the code from different coders and it’s fun to see how each agent interprets ambiguous specs differently.
On the initial iterations, all 3 coding agents in Kiro selected Python as implementation language (my prompt was vague on purpose on implementation details)
It required 4 iterations to converge the Python code to a point where all coding agents where basically spitting the same code with very small variations.
After the successful Python convergence, I deleted all code and added a tech stack spec to require implementation in modern Java. Interestingly, this required few more iterations to polish the specs to accommodate strongly typed languages. I think it was 3 more iterations.
Once the specs converged for both Python and Java code, I deleted all generated code again and changed the tech stack spec to require a C++ implementation. To my surprise, it worked on the first iteration, it didn’t needed any further iteration of the specs. Signaling that at least for this smaller app, the specs where detailed enough to generate consistent code in 2 quite different strongly typed languages and one dynamic typed language.
Started testing the same pattern in Claude code, so long one of the 3 coding agents decided to use Go, while the other 2 decided to use Python. I ran out of tokens for today, so will let you know in the next few days about the progress :D
Next steps
Will post here when I can publish the open source specs for this agentic architecture. And will keep it evolving with more complex projects. Additionally, in it’s current form, it consumes quite a bit of tokens, but I do have few ideas on how to solve that problem using tools distillation. Details coming soon,