Stop Asking the LLM to Decide. Make It Write the Rules.
TL;DR. The common advice - "put an AI agent in the loop" - is most dangerous exactly where it sounds most useful: the money path. Tax, promo eligibility, discount stacking, cart rules, pricing. On those decisions a probabilistic model is the wrong final authority. The pattern that holds up is a division of labour: the LLM formalizes the natural-language rules into a small, auditable specification, and a sound deterministic engine executes it. You stop reviewing code you cannot read and start approving a contract you can. I ran this across 113 experiments and nine model families, and the shape of the result is consistent. Here is the evidence, the one hard condition, and where it actually pays off.
The decision you should never let a model guess at
Hand today's best AI models a stack of real tax returns and they get most of them wrong. That is not a thought experiment - it is a benchmark. TaxCalcBench took 51 real 2024 US tax returns with official IRS answers and asked the frontier models to file them. The best performer, Gemini 2.5 Pro, got 32% right under strict scoring. Claude Opus 4 managed 27%, Sonnet 4 just 23%. These were not rounding-error misses - they were wrong returns: wrong tax tables, arithmetic slips, eligibility mistakes, and, worst of all, different answers every time you asked. The benchmark's authors put it bluntly: there is a "continued need for deterministic tax calculation engines," because this output is "not acceptable for a task which needs consistently correct results with clear auditability."
Tax is the cleanest example because someone built the benchmark, but the shape generalizes to every money-path decision a merchant runs. Does this cart qualify for the promo? Can these two discounts stack? Which tax jurisdiction applies? What does this customer's loyalty tier unlock? These are not open-ended questions. Each has a correct answer defined by rules you already wrote down somewhere. Handing them to a model that pattern- matches probabilistically means accepting an error rate on decisions that move money and that a regulator, an auditor, or an angry customer can ask you to explain.
The reflex fix is "use a bigger model" or "prompt it harder." TaxCalcBench is the counter- argument: the frontier is already here and it is still at 23-32%. The problem is not model size. It is that you are using a probabilistic system as a deterministic one.
The pattern: formalize, don't decide
The fix is not to remove the LLM. It is to move it. Language models are extraordinary at one thing that traditional software is terrible at: turning messy natural language into structure. They are unreliable at the next thing: executing multi-step logic without drifting. So split the job along that exact seam.
Think of the LLM as a courtroom translator and the engine as the judge. The translator turns the contract from human language into something precise; the judge applies it the same way every time. You would never let the translator also decide the verdict - and you should not let the model that reads your promo terms also rule on whether this cart qualifies.
- The fuzzy front end is the LLM's job: read the rules as written - the policy doc, the promo terms, the tax guidance - and translate them into a formal specification.
- The strict back end is an engine's job: take that specification plus the instance data and compute the answer by mechanical, repeatable logic.
The payoff is not just accuracy. It is where the control point sits. When an LLM decides directly, the thing you would have to audit is a probability distribution over tokens - you cannot. When the LLM instead emits a small set of rules, the thing you audit is the rules. A human confirms "this specification matches our intent." The engine guarantees "the answer follows from the specification." Completeness of the spec does not vanish - it moves to a place a person can actually inspect.
This is not a new idea in the research literature, and the literature is worth knowing because it tells you the gain is real and not a quirk of one setup. Logic-LM - delegating the reasoning step to a symbolic solver - reported +39.2% over standard prompting and +18.4% over chain-of-thought. CLOVER, which translates to first-order logic and post- verifies with the Z3 solver, pushed harder still: 62.8% versus 42.4% on AR-LSAT, 75.4% versus 45.4% on ZebraLogic. Different teams, different tasks, same direction: when a sound engine does the executing, accuracy on rule-shaped problems jumps.
What is actually doing the reasoning - and why not just Python
If you delegate execution to an engine, which engine? In my lab the engine was Prolog - specifically SWI-Prolog, with a constraint solver (CLP(FD)) for the search-heavy problems.
If you have never touched it: Prolog is a language where you do not write how to find the answer - you state the facts and the rules and let the engine find it for you. You declare things like "a discount applies if the cart total is over X and the customer is in group Y," and the engine works out the rest, trying possibilities and throwing away the ones that break a rule. A program reads like a list of statements about the world, not a sequence of steps. It reads like a contract and runs like logic.
The obvious objection from any engineer is: I could write that in Python with a few if
statements - why drag in a logic engine? The answer is not "Prolog the language beats
Python the language." Python can express anything Prolog can - you could write the solver
by hand. The point is what you are asking the LLM to produce, and that difference is
concrete:
The search is the runtime, and it is complete. Ask "is there a valid configuration?" In Python you hand-write the search - loops, recursion, your own backtracking - and every line is a place to introduce a bug, with no guarantee you explored the whole space. In Prolog, backtracking is the engine. The checkable consequence: the engine can return UNSAT with a guarantee - "no such schedule exists" - while a Python script can only tell you "I didn't find one." Absence of evidence is not evidence of absence, and on a refund or a tax edge case that gap is the whole game.
Correctness is not tied to execution order. A declarative rule holds or it does not. There is no mutation, no early return, no off-by-one, no "the rules were right but the code that applied them was wrong." That entire bug class is gone.
A declarative ruleset is a verifiable target; imperative code is not. This is the one that matters. The LLM's output for a rule is a handful of lines that map one-to-one onto the business contract, and you can mechanically check it against probes. Arbitrary Python you can only test - which is precisely the stochastic, "hope the cases cover it" surface you were trying to escape. The Prolog artifact is the contract. Python is an implementation of a contract you still have to take on trust.
One honest caveat: the engine does not have to be Prolog. Any sound declarative engine fits the same role - CLP(FD) or MiniZinc for scheduling and search, SAT/SMT solvers like Z3 for policy checks, Datalog or ASP for closed-world rules. There is even tooling (the MCP Solver) that wires LLMs to MiniZinc, PySAT, and Z3 directly. Prolog is what I validated on, not a requirement of the idea.
As for running it: the LLM generates the program, and it executes in an isolated SWI-Prolog subprocess with timeouts and a stack limit - no model in the loop at execution time, only the deterministic engine. That isolation matters for the auditability claim: the answer comes from the engine or it does not come at all.
The evidence: a cheap model plus an engine reaches the frontier
I tested this across 113 experiments and nine model families - Claude Haiku, Sonnet, and Opus; llama-3.3-70b; mistral-large; deepseek; gpt-oss; gemini; qwen - logging every run, successful or not, to an append-only journal. The point of that breadth is to make sure I was looking at a pattern, not one lucky configuration.
The strongest, most replicated result is on constraint problems - the "zebra puzzle" family, where you satisfy a web of interlocking conditions. On these, the hybrid beat the LLM-only baseline by +63.3 percentage points with a cheap llama-3.3-70b and +19.2 points with Claude Haiku, at N=120, with 95% confidence intervals that exclude zero. I controlled for the obvious confound - that the gain came from leaking structure into the prompt - by re-running with a neutral, off-task example; the advantage held.
The most striking version of the result: models that solve roughly 0% of these puzzles on their own reach roughly 100% once they are formalizing for an engine instead of answering directly. A cheap model with a sound back end performs like a frontier model on this class. That tracks an independent finding too - ChatLogic reported that the gain from delegating to a solver concentrates in weaker models and deeper inference, which is exactly what you would expect if the engine is supplying the rigor the model lacks.
There is a deeper lesson hiding in the failures. When the hybrid was wrong, it was never the engine that was wrong. The engine executes the contract exactly, every time. Every failure lived in the formalization step - the model wrote a specification that did not match the intent. Which leads straight to the one condition you have to respect.
The one condition: you need a capable formalizer
The honest boundary is this: the method works when a capable model does the formalizing. Below a certain capability floor, a weak model writes rules that look right and are wrong - and because the engine faithfully executes whatever contract it is handed, a wrong contract produces a confident wrong answer. In my runs, every hybrid failure traced to this step, not to execution.
That is not "it works sometimes." It is "it works given a strong enough translator" - which is a measurable, controllable condition rather than a roll of the dice. It also reframes the model-selection question. You are not buying a model to be the decision engine. You are buying it to write the rules once, correctly. That is a narrower, more testable bar, and it is the right place to spend capability budget.
When the method applies - a three-part gate
Not every task belongs in this pipeline. Before you reach for it, check three things. The method applies when:
- An independent gold oracle exists or can be derived. You need a source of truth for "correct" that is not the same model that produced the answer - a formal property, a human- verified set, or a separate tool. If your only checker is another LLM call, you get correlated errors and a system that confidently grades itself wrong.
- The rules are specifiable. They can be written down from a spec, not "guessed by meaning." Promo terms, tax tables, eligibility logic - specifiable. "Sounds like a complaint" - not.
- The task is finite and decidable. There is a bounded, computable answer.
Open-world tasks fail the gate: free-form natural language, forecasts with no spec, anything that turns on interpreting intent rather than applying a rule. That is not a weakness of the method - it is the edge of its class, and knowing the edge is what keeps you from misapplying it.
Where it pays off
Rank the opportunities by how badly the current approach hurts and how much auditability is demanded, and a clear order emerges.
Tax and payroll calculation sit at the top. The pain is measured (TaxCalcBench), the class is a clean fit, and people already build it this way: SARA, an academic system, translates statutes and facts into Prolog and runs them on SWI-Prolog so that auditors can inspect the exact reasoning path; OpenFisca is an open-source "rules as code" engine for tax and benefit law used by several national governments, sold explicitly on algorithmic transparency.
Credit underwriting and adverse-action decisions are the strongest case on the regulatory axis. Under the US Equal Credit Opportunity Act and its Regulation B, a lender must state the specific reasons for a denial - not "the algorithm said no." A verifiable rules engine satisfies that by construction; a black-box model does not.
E-commerce - promo, discounts, eligibility, cart rules, pricing, tax - is the same decidable class, and it is where most merchants will actually meet this problem as they wire agents into checkout. I will be honest about a gap here: there is no TaxCalcBench for e-commerce, no published benchmark that measures pure-LLM failure on promo logic specifically. The case rests on class-transfer from the measured domains, not on an external industry number. The architecture argument is strong; the industry-measured "pain" number, for now, is not there. Below that sit KYC/AML policy checks, fee and commission calculation, regulatory reporting, and returns policy - the class fits, and there are industry signals (Amazon's VeRAFI work on "neurosymbolic policy generation") even if no one has published a clean measurement.
The economics: the model is a compiler, not a calculator
There is a cost objection: capable models are not free, and if you call one on every decision the bill adds up. The answer is to stop thinking per-call.
When the output you want is not a single answer but a solver for a whole class of decisions, the right cost model is amortization. The LLM runs once - as a one-time compiler that turns the natural-language spec into a single verified, parametrized engine. After that, the engine runs on CPU at essentially zero marginal token cost; you inject each instance's data as facts and it computes. In my experiments the total-cost break-even landed around two to three instances - past that, the compiled engine is cheaper and more accurate than calling the model each time. The certified engines generalized cleanly to held-out instances: 32 of 32 on a scheduling class (including correctly proving the unsatisfiable ones), 30 of 30 on a promo-stacking class. This is the same shape as the emerging "Compiled AI" idea - call the model once at build time, then run the workflow as static code with zero tokens per transaction.
Two caveats keep this honest. First, certifying that engine is itself capability-gated: of the non-Anthropic model families I tested as the compiler, only one converged reliably; others started at the same place but could not act on the engine's counter-examples to repair their own rules. The gate is the ability to use feedback, not raw one-shot quality. Second, an LLM asked to re-solve the same problem on every call - rather than compile a solver once - is the wrong tool and the data says so. The win is the compiler framing, not the model as a repeated calculator.
The determinism layer agent commerce is missing
Most "add an AI agent to your store" advice quietly assumes the agent should make the decision. For the parts of commerce that move money, that is the one thing it should not do. The agent's real job there is to read the rules and write them down formally; a deterministic engine should decide. You get the model's fluency at the fuzzy edge and the engine's guarantees at the core - accuracy, completeness, and an answer you can defend line by line to an auditor or a customer.
That is a layer the agent-commerce conversation keeps skipping, and it is the one that decides whether any of this is safe to put near a checkout. If you are wiring agents into the money path, the question to ask is not "which model decides?" It is "what does the model formalize, and what executes the contract?" Get that seam right and the rest of the architecture has somewhere solid to stand.
I am currently building an e-commerce project of my own around cart-rule and discount logic - the money-path surface this whole article is about. Real promo rules are where the formalize-don't-decide seam gets stress-tested, and the first place I would scrutinize before letting any agent near a checkout. Designing and building this kind of deterministic layer for merchants is the work I do - if that is the seam you are wrestling with, that is where I can help.
Comments
No comments yet. Be the first to share your thoughts.
Sign in to leave a comment. Only registered readers can comment.