Six Principles for Agent Systems That Don't Hallucinate

May 06, 2026 16 min read

AI Architecture Agents LLM Systems-Design

Six Principles for Agent Systems That Don't Hallucinate

Why this article exists

Agentic development with LLMs in 2026 is no longer an "interesting experiment". It is its own engineering discipline. By an "agent" I mean a program built on top of a language model that performs a structured task inside a product rather than merely replying to a user in chat: it reads code, makes decisions, writes files, calls external APIs, and returns a result. I join product teams where three to five such agents already work in parallel: code review bots, content classifiers, ticket routers, recommendation pipelines, internal documentation generators. A demo can be assembled in one evening. Production cannot.

The line between a demo-quality and a production-quality agent system is not where people usually look for it. The deciding factor is not the model, not the token budget, and not the quality of the prompts. The deciding factor is the architecture of the system in which the model operates – and that architecture does not come from "build your first agent" tutorials. It comes from failed attempts.

At that boundary, every agent system runs into the same three problems:

Hallucinations – the agent invents facts that sound plausible but do not match reality.
Non-reproducibility – the same prompt produces different results across runs, and errors cannot be debugged properly.
No way to accumulate knowledge – every run starts from zero, and the mistakes of one run do not help the next one.

Of those three, the first two are discussed by almost everyone writing about multi-agent systems in 2025 – roles, JSON contracts, system prompts. The third one – how an agent system becomes cheaper and more accurate with each subsequent run – is barely discussed, because that conversation requires metrics across multiple runs, not a single production anecdote.

These three problems are not properties of the model. They are solved at the architecture level of the system the model operates inside. Below are six design principles that address them.

The principles are universal. They work equally well for code review, refactoring tools, security audit pipelines, migration tools, documentation generators, customer support routing, content moderation, and data pipelines with LLM stages. Wherever you have multiple roles, a rules-heavy domain, and a need for reproducible output, these six layers apply.

I distilled them while building one specific system – LLM agents for E2E testing. That took a month and a half of part-time iteration and produced measurable results: eight runs, eleven production bugs found automatically, and a first-try pass rate that rose from 14% to 95% as the knowledge base – hereafter "KB" – became more saturated. Each principle below is paired with one concrete E2E example and one or two applications in other domains.

Those eight runs are not a generic "it works for us" claim. They are the first trendline. On a single run, any architecture can sound plausible. Across eight runs, you start to see which principles actually deliver ROI and which ones are overhead without return.

The six principles work together as layers. Remove one, and the whole stack collapses.

Principle 1: An explicit contract

What it is. A document that describes the rules the agent operates under and that do not change between runs. Not code, not config, but text. Usually Markdown with five to ten numbered principles, around 500–800 words (~3–5 KB). In my E2E version it is seven principles, 83 lines, about 600 words.

Why it works. Without an explicit contract, the agent makes arbitrary choices every time it encounters ambiguity. "Should the database be clean for each test, or should it stay dirty?" The agent picks one answer today, another next week, and you end up with incompatible tests. With a contract ("the DB is dirty by default; demo data is reused"), the answer is predefined.

E2E example. My ENVIRONMENT.md contains seven principles: the container as an external dependency, a dirty database, sequential execution, a health check, seed/assertion separation, a host runner plus MCP browser, and session caching. Each one is a short paragraph plus a brief rationale.

Non-E2E application. A security-audit agent gets SCOPE.md: what is in scope (src/ production code) and what is not (test fixtures, vendor/, deprecated code). Without that contract, the agent will report vulnerabilities in demo files and waste your time on false positives. A code-review agent gets STYLE.md with an explicit instruction: "code style is already formalised in .eslintrc; do not comment on formatting." A refactoring agent gets BOUNDARIES.md: which modules it may not touch and which public APIs it may not break.

What breaks without it. The agent acts on unstated assumptions, and in half the cases those assumptions will be the opposite of yours. Two weeks later, the team no longer understands why the agent behaves one way today and another way tomorrow. A month later, they stop trusting its output.

Principle 2: Role separation

What it is. A complex task equals multiple cognitive modes, and those modes cannot live inside one agent. You need separate roles with differentiated tools, context, and instructions.

Why it works. A single prompt cannot simultaneously demand "explore broadly" and "do not deviate from the plan." A single context cannot hold both architecture diagrams and specific code. A single toolset cannot be optimal both for browser automation and for editing text files.

E2E example. Four agents: e2e-analyzer (discovery), e2e-planner (strategy), e2e-generator (implementation), and e2e-healer (diagnostics). Each has its own MCP tools, its own context, and its own responsibility. The generator is not allowed to invent selectors; the healer is not allowed to expand coverage. Those constraints are what make the system predictable.

Non-E2E application. In a refactoring agent pipeline, code-mapper builds the dependency graph and writes dependencies.json; refactor-planner reads the graph and writes refactor-plan.md with numbered steps; refactor-applier applies each step; verifier runs tests and checks the result. In a code-review pipeline, static-scanner looks for obvious anti-patterns; context-reader loads related files; reviewer writes comments; summarizer aggregates them into one review message.

What breaks without it. A monolithic agent with one giant prompt initially looks "simpler" – one file, one entry point. Two weeks later the prompt is 800 lines of contradictory instructions, the context is bloated, and the output is worse than that of a simple script.

Principle 3: Persistent state between phases

What it is. Artifacts the agent writes to disk and that survive between runs. Not RAM, not in-memory state, but files with structured data that can be read by both humans and downstream agents.

Why it works. Discovery is an expensive phase. If you rescan the codebase from scratch every time, you pay for that in both context and time. But discovery changes slowly: the list of modules, the database schema, the routes. Do it once, save it, and later phases can read the result.

An additional benefit is that persistent state enables idempotent skip logic. If modules.json is still fresh (by mtime, the file modification time), the analyze phase is skipped automatically. The pipeline becomes cheap on repeated runs.

E2E example. The analyzer writes modules.json (modules, routes, dependencies) and schema-map.json (database schema). On the second run for the same module, the analyze phase takes zero seconds. Those files are also useful in their own right: a new team member can read schema-map.json and understand in fifteen minutes what would otherwise take a full day.

Non-E2E application. A migration tool uses mappings.json (old names → new names) and applied-steps.jsonl (what has already been done). .jsonl means JSON Lines: an append-only format with one JSON object per line. It is ideal for event logs: a new entry is just >> appended to the file, you do not need to parse the whole thing, and one corrupted line does not invalidate the rest. If a migration stops halfway through, the restart reads applied-steps.jsonl and continues from there. A customer-support pipeline can keep session-context.json for each conversation so a new request reads prior context instead of starting from zero. A documentation generator can rebuild module-graph.json only when the source files changed, speeding repeated runs up by an order of magnitude.

What breaks without it. Every run is expensive. The pipeline cannot be stopped and resumed. Artifacts live in one agent's head and disappear as soon as the context is cleared.

Principle 4: Knowledge as a separate layer

What it is. Domain knowledge – platform patterns, known constraints, gotchas you only discover in real-world use – lives in separate files that agents read but do not import into the main code. Curated Markdown or YAML, not an embedding vector store where texts are pre-translated into numeric representations and retrieved by similarity.

Why it works. Domain knowledge changes on a different rhythm than the code itself. A UI framework might update once a year; your code changes every week. If the knowledge is baked into the code, a framework upgrade becomes a migration. If it lives in a separate layer, you change one YAML file and everything else stays intact.

A curated KB is also deterministic. RAG chooses top-k documents by embedding similarity, and if an important paragraph misses the retrieval cut, the agent runs without it. A flat KB is either entirely present in context or it is not – and that is immediately visible.

E2E example. On my ecommerce project, the local KB is 12 Markdown files (admin, classic-storefront, modern-storefront), plus 9 YAML files in a global cross-stack KB (tailwind-css, alpine-js, fastapi, nextjs, and so on). When I ported the method to FastAPI + NextJS, tailwind-css.yml, alpine-js.yml, and mailpit.yml just worked on the new stack without modification. That is cross-project KB reuse: platform knowledge isolated into its own layer travels across projects.

This is a rare kind of evidence in the current multi-agent literature – almost every public case study shows one system on one stack. Portability is what confirms that the split between code, KB, and agents is not cosmetic but architectural: the KB layer behaves like a self-contained component.

Non-E2E application. A security-audit KB can cover CVE categories, OWASP patterns, and framework-specific gotchas (XSS in template engines, SQL injection in ORM bypasses). A customer-support KB can encode ticket types, escalation rules, and refund policies. A documentation generator KB can define documentation formats (JSDoc, RST, OpenAPI) and conventions for each language.

What breaks without it. Knowledge gets smeared across prompts and code. Every agent ends up with its own copy of the rules, and those copies drift apart over time. When the platform changes, there is no single place to update.

When RAG is actually needed

A flat KB stops working at one of three thresholds: around 200k tokens (too expensive to load in full), uncurated sources (code, tickets, logs), or history-driven retrieval (when the agent benefits from the top-k most similar prior cases). At those thresholds, the KB evolves into RAG – but that is a change of tool, not of methodology. The contract, role separation, and persistent state still remain.

Principle 5: Closed-loop learning (knowledge compounding)

What it is. Every failure or error is turned into a structured artifact – not "fixed a selector," but a completed template with diagnosis, hypotheses, action taken, verification, KB candidates, and out-of-scope items. Those artifacts then feed back into the KB, so the next agent run already sees them.

Why it works. Without a closed loop, every run rediscovers the same failures. With one, you get knowledge compounding. The KB grows by the same logic as compound interest: the system becomes cheaper and more accurate on every pass.

E2E example. The healer writes per-run files under heal-findings/<date>-<module>.md with six sections: A (diagnosis), B (hypotheses), C (action), D (verification), E (KB candidates), F (out-of-scope siblings). Section E is the promotion path into the KB. On my project, across eight runs, the KB grew by 67% (from 25 gotchas to 42), and first_try_pass_rate rose from 14% (a new module) to 95% (the third run of the same module). That is the KB saturation curve: same agents, same prompts, different feed.

Non-E2E application. In a code-review pipeline, each rejected agent comment becomes structured feedback ("false positive: the agent flagged X, but X is allowed in this module under line N of CONTRACT.md") and is then promoted into the KB, so the next run sees the rule. In a migration tool, each failed migration becomes a markdown report with the root cause, then a rule in migration-gotchas.yml, so the next migration does not repeat the mistake. In a security audit, each false positive becomes a rule in audit-exceptions.yml, improving signal-to-noise.

What breaks without it. Agents do not learn between runs. The tenth run is as expensive as the first. Every failure requires manual diagnosis from scratch.

Principle 6: Additive instrumentation

What it is. Metrics after each run are written to a file with an evolving schema: new fields are added, old fields stay. v1 records remain valid after v2 fields are introduced. No breaking changes, no migrations.

Why it works. Without quantitative feedback, "is it getting better?" is an unanswerable question. The feeling that "it seems faster now" is not data. With metrics.jsonl, you can actually see the trendline.

There is a second benefit: an additive schema lets you learn gradually which metrics matter. I did not know in advance that first_try_pass_rate would become a key metric; it only appeared on the third run, when I noticed that the number of healing iterations was a proxy for KB maturity. If the schema had been rigid, I would have needed a migration for older records. With an additive schema, I simply added the field and the old records stayed valid.

E2E example. metrics.jsonl v1 (the first two runs) contains timestamp, target, stack, phases, kb_updates, and volume. v2 (from the third run onward) adds first_try_pass_rate, real_app_bugs_found[], test_churn, kb_hits, patterns_added, and wall_clock_ms. The v1 records remained valid, which lets me query across all eight runs.

Non-E2E application. In an ML training pipeline, experiments.jsonl can record hyperparameters, dataset version, and metrics. In a refactoring tool, refactor-runs.jsonl can track the number of changed files, tests broken or restored, and review time. In customer support, tickets.jsonl can store time-to-first-response, escalation depth, and resolution type.

What breaks without it. You cannot say objectively whether the system is improving. Debates about whether it got better or worse get resolved by intuition instead of data. When a new agent introduces an unexpected regression, you do not see it until complaints accumulate.

What these principles give you together

Each principle on its own is a useful pattern. Together they produce a system with specific properties:

Accuracy. Contract + source reading + role separation cut down the space for improvisation. The agent works from ground truth – what is actually in the code – rather than guesses about how it might be organised.
Fewer hallucinations. Persistent state provides stable context; the KB provides deterministic rules; the closed loop catches hallucinations and prevents them from recurring.
Reproducibility. The same input artifact plus the same KB snapshot should produce the same output. Different results across runs are treated as a bug to investigate, not as "the nature of LLMs."
Knowledge accumulation. Closed-loop learning plus additive metrics turn every run into data. After ten runs, you know more about your system than after a hundred one-off GPT calls driven by a single prompt.
Portability. The same six principles work for E2E testing, code review, refactoring, security audit, and migration tools. Only the KB and helpers are platform-specific; the architecture is not.

What these principles do not give you

I would not present this as a silver bullet. The principles solve a specific class of problems – accuracy and reproducibility in multi-agent systems – and do not solve others.

They do not make the agent smarter. GPT does not turn into an expert just because you wrapped six layers around it. If the task requires creativity or deep understanding, the agent stays limited by the model.
They do not work well for very short tasks. The payback starts after three to five runs. If you only run the system once, the overhead is not justified.
They do not replace review. Closed-loop learning catches errors that the agent or the system itself already noticed. Errors nobody recognised as errors still stay in the code.
They require discipline. Six-section heal findings, an explicit contract, persistent state – all of that is work. If the team is not willing to maintain those artifacts, the method turns into dead weight.

What comes next

I am now applying these six principles to a third independent domain – knowledge work (planning, learning, content), not software development. This is a deliberate attempt to eliminate the method's software bias: the first two validations were in E2E testing, and it is still unclear which principles are code-specific and which are truly domain-agnostic.

If you are applying a similar architecture in another domain – or, conversely, if you found where it stops working – I would love to hear about it. I am especially interested in cases where a principle did not work. Those cases show the boundaries of the method more clearly than successful implementations do.

P.S. In parallel, I am writing a more technical deep-dive series on one concrete application of these principles – E2E testing: a month and a half of iteration, eight runs, a six-section healing protocol, and a breakdown of KB-saturation metrics. I am also preparing an open-source companion repo with a reference implementation of the six principles – framework, four agents, metrics schema, and skeleton KBs. Announcements for new articles and the repo launch go out on LinkedIn; the articles themselves are published on the blog.