e2e-llm-agents

Canonical methodology for E2E testing built around LLM agents: a four-layer code structure (lib / pages / tests / knowledge), a four-agent pipeline (analyzer → planner → generator → healer), an explicit environment contract, a curated knowledge layer, and the KB-saturation curve as the empirical signal that compounding has happened.

Guide

E2E with LLM agents — a methodology

What this page is. The canonical reference for a methodology of E2E testing built around LLM agents. Read it to understand the four-layer structure, the four-agent pipeline, the contract that makes the structure operational, the knowledge layer that lets the system learn between runs, and the doctrines that hold the pieces together.

What this page is not. A tutorial, a war story, or a launch announcement. The case-study evidence — the numbers that justify the choices below — lives in the companion article. The granular per-piece specifications live in the e2e-llm-agents repository.

1. Introduction

E2E testing is the part of the test pyramid where most teams quietly give up. The reasons cluster around the same three failures: tests are flaky, tests are slow, tests are written against assumptions the code no longer matches. Each of these has a familiar workaround — retries: 3, parallel workers, disable-in-CI — and each workaround makes the next round of failures harder to attribute.

This methodology takes a different starting point. It treats E2E not as a test framework problem but as an agent-systems problem. The premise is that LLM agents can write, run, and heal E2E tests reliably if and only if the surrounding system gives them the right kind of context: a small explicit contract about the environment, a small explicit structure about the code, a curated knowledge layer that grows with the project, and a healing loop that produces auditable findings rather than silent retries.

Under those conditions, the agent pipeline produces a measurable compounding effect — the KB-saturation curve — where each subsequent run on the same surface costs less than the last, because the system has learned what it didn't know before. The curve is the empirical signature that the methodology has paid for itself.

The companion article documents the curve across multiple independent applications. This page describes the methodology itself.

2. The contract

E2E tests live or die by their relationship to the environment they run against. Without an explicit contract, every flaky-test debate collapses into the same loop: is this a test bug, an application bug, or a CI quirk? — and nobody can tell, because there is no shared baseline.

The contract makes the baseline explicit. It is a markdown document — ENVIRONMENT.md — read by humans during onboarding and by LLM agents at every planning, generation, and healing invocation. It is not read by the test runner; the runner reads playwright.config.ts.

Seven principles

The contract is short. Seven principles, each one paragraph long, plus a brief why.

1. The container is an external dependency. Tests do not start, stop, or reset the application stack. They assume a running instance at $BASE_URL. If the instance is missing, the preflight check (Principle 4) fails before any spec runs. Lifecycle management is the slowest path to flakiness; decoupling test runs from container orchestration means iteration happens in seconds, not minutes.

2. The database is dirty by default. Demo data is reused across runs. A clean database per run is an anti-pattern for local development — it is expensive in wall-clock time and hides the kind of bugs that actually appear in production (where state has accumulated for months). Test-specific data is isolated by prefix (e2e_*), and seed specs are idempotent.

3. Sequential execution. workers: 1, retries: 0, fullyParallel: false. Sequential isn't a performance compromise; it is a methodology commitment. Parallel execution against a shared database introduces ordering coupling that is invisible until it isn't. The retries: 0 half of this principle is the most consequential rule in the entire methodology — covered in depth in the no-retries doctrine.

4. Health check before everything. global-setup.ts makes one HTTP request to a health endpoint before any spec runs. If the application is not responding, no spec runs. Without this, the first failing test produces a cryptic timeout; the remaining tests repeat the same timeout with noise. With this, one clear error appears five seconds in.

5. Seed vs assertion separation. Test files come in two genres: seed specs (configure state, idempotent) and assertion specs (verify behavior). Seed specs live in a physically separate directory, prefixed with an underscore so they sort first lexicographically. When the two genres share a file, tests become order-dependent and a failure becomes ambiguous (did setup break, or did the behavior break?).

6. Host runner with MCP browser tools. Playwright runs on the host, not inside the application container. During test generation, the LLM agent has access to MCP browser tools — it opens the page, reads the DOM, finds stable selectors. Without this, agents hallucinate selectors based on guesses, and the pipeline breaks on the first run.

7. Session caching with TTL. Login is expensive and shared. The session is cached in a file, with a TTL chosen by the backend's nature — long enough that a single suite run doesn't re-authenticate, short enough that the platform's own session policy hasn't expired the credentials.

What the contract is for, what it isn't

The contract establishes what tests are entitled to assume about the environment, what tests are forbidden from doing to the environment, and what counts as a contract violation versus a test bug versus an application bug. It is the smallest possible commitment that makes the rest of the methodology coherent.

The contract deliberately does not address test data factories (structural), selector strategy (a generator concern), CI integration (orthogonal), or coverage targets (measurable, but not part of the contract). Adding to the contract is fine; expecting the contract to address every E2E concern is a category error.

The full version of each principle, with verification criteria and explicit forbids, lives in contract-spec.md. The deep version of Principle 3 — why retries: 0 is methodology rather than configuration — lives in no-retries-doctrine.md.

3. The four-layer structure

The contract says what tests do and do not do. The structure says where the artifacts of doing those things physically live. Four layers, one-way dependency direction, two extraction rules. The whole thing fits on one diagram:

                ┌──────────────┐
                │  knowledge/  │  ← read by LLM agents,
                │              │     NOT imported by tests
                └──────────────┘
                       ↑
┌──────────┐    ┌──────────┐    ┌────────┐
│  tests/  │ ─→ │  pages/  │ ─→ │  lib/  │
└──────────┘    └──────────┘    └────────┘

`lib/` — stateless utilities

Functions like waitForLoadingMask, signInAsAdmin, addProductToCart. No business-module names. If a function in lib/ is called setupCheckoutTaxConfigForRegion, it does not belong in lib/ — it belongs in a Page Object or a flow. Stateless and module-agnostic is the constraint: everything else depends on lib/, and lib/ depends on nothing else.

`pages/` — Page Objects

Stateful — they hold a page reference and offer methods like getCartTotal(). Page Objects depend on lib/ for low-level helpers, and on nothing else. They are read by tests, not by other Page Objects. Importantly, they are extracted only after the third real use of the same pattern (see Rule of Three).

`tests/` — the specs

Split into _seed/ (idempotent setup), modules/<feature>/ (per-feature assertion specs with a plan.md), smoke/ and flows/ (optional, cross-feature). Tests depend on pages/ and lib/. Tests never depend on other tests, and tests never depend on knowledge files — the knowledge layer is read by agents, not by runtime code.

`knowledge/` — the fourth layer

Markdown reference for LLM agents: selectors, AJAX patterns, CSS-class quirks, gotchas. Read at planning, generation, and healing time. Never imported by tests. The firewall is what keeps this layer useful — it is data for agents, not code for the runtime.

Dependency direction and Rule of Three

tests/ ──→ pages/ ──→ lib/

Reverse edges are forbidden. A cycle is the moment reusability dies — within two or three iterations, lib/ and pages/ melt into one mass and the methodology can no longer answer "where does this helper go".

The Rule of Three governs extraction: one use, leave inline; two uses, leave duplicated; three uses, extract to lib/ (if stateless) or pages/ (if stateful). After two uses you don't yet know what is shared and what is coincidental; the third use shows the real shape.

A single playwright.config.ts can serve multiple orthogonal surfaces — not "different browsers", but different DOMs (a back-office surface, a legacy MVC storefront, a modern reactive storefront, an admin app in a different stack). The multi-project pattern keeps the structure flat: one set of layers, multiple project bindings.

The full version of this layer, with verification criteria and the multi-project routing pattern, lives in structure-spec.md.

4. The four-agent pipeline

The methodology runs as four agents in sequence: analyzer → planner → generator → healer. Each agent has one cognitive mode. Each agent ingests specific inputs and emits specific outputs. The four are orchestrated by four skills (e2e-plan, e2e-generate, e2e-heal, e2e-coverage), one per workflow.

Analyzer

The first agent. Discovery: scan the source tree, identify modules, routes, schema, dependencies, and write the results as persistent JSON artifacts in e2e/.state/. The analyzer is cheap and cacheable — on every invocation, it first checks whether its previous output is still fresh; if it is, the scan is skipped.

The analyzer is the only agent with this kind of caching layer. Most cycles operate on a stable codebase; re-scanning every time is waste. The persistent state (modules.json, schema-map.json, project-auth.yml, project-seed.yml) is read by the planner, the generator, and the healer — each consuming what they need without re-doing the discovery.

Planner

The second agent. Reasoning: turn what the analyzer discovered, plus what the knowledge base knows, into a concrete plan.md — a numbered list of test cases for one feature, written before any spec file is touched. The plan is human-readable, agent-readable, and reviewable. It is the artifact that answers whether the tests being generated are the right tests — not whether they're written well.

Planning is a distinct phase, not a step inside the generator, because planning and code-generation are different cognitive modes. Planning needs broad context (feature semantics, edge cases, KB-flagged hazards). Generation needs narrow context (the exact selector for one button on one page). Trying to do both in one prompt either over-prompts the generator or under-prompts the planner.

Generator

The third agent. Code emission: take plan.md and produce *.spec.ts files. The defining constraint on the generator is ground-truth selector discipline — every selector that appears in a generated spec must have been observed in the running application via MCP browser tools, not inferred from source code or guessed from a screenshot.

The generator does not invent test cases (that was the planner's job). It does not interpret failures (that's the healer's job). It operates close to the metal: per plan case, emit code that compiles and runs against the actual application.

Healer

The fourth and most consequential agent. Diagnosis: run the generated specs, observe failures, attribute each failure to one of three categories — test-bug, app-bug, env-drift — and write a structured heal-finding that captures the audit trail.

The three-category attribution is what makes the no-retries doctrine actionable. Each category has a different remediation path: test-bug is fixed in the spec, app-bug is reported with root_cause_slug and the spec stays failing as a true positive, env-drift is surfaced for environment or contract repair. The healer applies or proposes the fix and writes the finding.

The healer is also the agent that proposes KB candidates — when a failure reveals a gotcha future tests should know about. The candidate is not auto-promoted; it sits in _inbox/ until an editorial gate decides whether it generalizes.

Why four, not three

Earlier versions of this methodology used three agents (planner / generator / healer) and folded discovery into the planner. The four-agent split was empirical: planning prompts that also did discovery were noticeably worse at both jobs. Separating discovery into the analyzer made each phase smaller, cheaper, and individually skippable (the analyzer caches; the planner skips when the plan exists; the healer skips when the run is green).

The four-agent pipeline produces measurable artifacts at every boundary: e2e/.state/*.json after analysis, plan.md after planning, *.spec.ts after generation, a six-section heal-finding after healing. Each is reviewable. Each is comparable across runs.

Per-agent depth lives in the four agent-role specs: agent-role-analyzer.md, agent-role-planner.md, agent-role-generator.md, agent-role-healer.md. Skill-level orchestration lives in skill-design.md.

5. The knowledge layer

The knowledge base is the fourth layer of the structure and the input that makes agents learn between runs. KB files are YAML documents read by LLM agents at planning, generation, and healing time. They are never imported by test code.

Two categories of entry

Every KB entry is one of two kinds. Gotchas are prose advisories — "when clicking a button inside a modal on the admin surface, wait for the loading-mask overlay or you will click the wrong element". Gotchas are advisory; an agent reads them as context. Lint patterns are machine-checkable rules with a severity — "if a spec calls page.click() on a .btn-primary selector without first awaiting waitForLoadingMask(), that's a warning". Lint patterns are runnable as a static analysis pass.

A piece of knowledge starts as a gotcha. It graduates to a lint pattern only after it has fired enough times to clearly generalize. Lint-as-KB is the discipline of this promotion path.

Project-local vs cross-stack

KB has two homes. Project-local KB lives inside the project's e2e/knowledge/ — it is about this codebase: authentication patterns, seed fixtures, business-domain quirks. Cross-stack KB lives outside any single project, in a shared location — it is about a technology: a UI framework's selector patterns, a CSS framework's class quirks, a specific dev tool's interaction model. Cross-stack KB generalizes across every project that uses that technology.

The same agent reads both. The split is what produces cross-project knowledge transfer: when you start a new project on a familiar technology, the cross-stack KB applies on day one. The project-local KB starts empty and fills as the codebase reveals its specifics.

`sources.yml` routing

A sources.yml file describes which KBs apply to which surfaces:

universal:
  - project-auth.yml
  - project-seed.yml

by_surface:
  admin:
    - modules/admin.yml
    - platform/admin-framework.yml
  storefront:
    - modules/storefront.yml
    - platform/alpine-js.yml
    - platform/tailwind-css.yml

The planner reads sources.yml and loads only the KBs relevant to the test target. Multi-app projects extend the pattern with a kb_by_app dimension — a FastAPI backend and a Next.js admin in the same repository get disjoint KB sets without cross-contamination.

Editorial promotion

The single most important rule in the KB:

Auto-promotion of healer candidates into the active KB is forbidden.

Auto-promotion optimizes for recall over precision. The resulting KB ends up describing the system's errors rather than what is true. The agent then retrieves contradictory advice (because every fix becomes a "principle"), and the compounding curve reverses — saturation goes down instead of up.

Promotion is editorial: the healer writes a candidate to _inbox/; a reviewer asks two questions (does this generalize? and is it not already covered?); the candidate is moved into the active KB only if the answers are yes. This editorial gate is what keeps the KB useful as the project grows.

The full schema and the promotion rules live in knowledge-format.md. The promotion path from gotcha to lint pattern is the subject of lint-as-kb.md.

6. The healing loop and saturation

The healer produces one artifact per run: a markdown file in e2e/.state/heal-findings/, structured in six sections. Together, the findings form the audit trail of how the agent pipeline learns.

Six sections per finding

A. Diagnosis matrix. A table: tests on one axis, surface projects on the other. Pass / fail / skip / N/A in each cell. The matrix is the first thing a reviewer sees — it answers "what failed and where" before any narrative starts.

B. Hypothesis on root causes. For each failure in the matrix, a working theory. Each hypothesis names an attribution category (test-bug / app-bug / env-drift).

C. Healing action + decision rationale. What the healer did. For test-bug failures, the action is to fix the spec. For app-bug failures, the action is to file the bug with root_cause_slug — the spec stays failing as a true positive. For env-drift, the action is to fix the environment or update the contract.

D. Verification checklist. How to confirm each action worked. A checklist, not prose. This is what makes the audit trail closeable.

E. KB candidates. Gotchas worth promoting (subject to the editorial gate). Candidates sit in _inbox/; they are reviewed before promotion.

F. Out-of-scope siblings. Failures or anomalies that appeared during the run but are not the focus of this finding. The explicit "see this, don't act on it now" channel — without it, observations either clutter the main narrative or get dropped entirely.

Even green runs get a six-section finding. The absence of a finding for a run is itself missing evidence.

Why this produces compounding

Every heal-finding's Section E feeds the KB-inbox. The editorial gate either promotes or rejects each candidate. Promoted entries become available to the next run's planner and generator. The next run on the same surface starts from a richer KB, and first_try_pass_rate (the metric defined below) rises.

This is the compounding mechanism. It depends on three preconditions, each of which has its own dedicated spec:

Three-category attribution (the no-retries doctrine) — without it, failures become "flaky", and the healer has nothing structured to record.
Editorial promotion (the KB rule above) — without it, the KB becomes an error log, and the compounding curve reverses.
Per-run findings (the six-section discipline) — without it, the audit trail is missing, and the next reviewer cannot follow the chain.

The KB-saturation curve

Charting first_try_pass_rate across runs of the same surface produces the curve:

first_try_pass_rate
       ↑
   1.0 │           ────●──────● ── ─ ─
       │      ●───
       │  ●
   0.5 │
       │
   0.0 │●
       └──┴──┴──┴──┴──┴──┴──┴──┴──→ run index
          1  2  3  4  5  6  7  8

The plateau is saturation — the point at which the methodology has paid for itself on that surface. Further runs no longer improve, because there is no more domain-specific knowledge left to discover. Saturation is the empirical signal that compounding has happened.

The full definition of first_try_pass_rate, the root_cause_slug discipline, the metrics.jsonl schema, and the conditions under which the curve actually appears all live in metric-design.md. The six-section finding structure lives in heal-findings-structure.md.

7. Alignment with the Six Principles

This methodology is a specific application of a more general framework for agent systems that don't hallucinate — the Six Principles meta article. The connection is not decorative. Each load-bearing choice below has a direct counterpart in the meta principles.

Architecture, not prompt quality

The four-layer structure, the four-agent pipeline, the seven-principle contract — these are architectural commitments. The methodology does not depend on clever prompting to recover from structural ambiguity; it depends on giving each agent a small, explicit, repeated context. Prompt quality is downstream of architecture, not upstream.

Deterministic context over probabilistic retrieval

The knowledge layer is a curated flat KB, not a RAG system. Every agent reads the relevant KBs in full, deterministically, via sources.yml routing. The reason is debuggability: with a flat KB, you know exactly what the agent saw. With RAG, you have to reconstruct whether the critical paragraph made it through similarity scoring — and the reconstruction step is itself error-prone.

Closed-loop validation with a hard signal

The healer runs the generated tests and reads the runner's hard pass/fail signal. There is no judgment-based validator in the loop. A test passes or fails; a build compiles or doesn't; a query returns rows or doesn't. This deterministic validator is what makes the saturation curve appear — without it, the loop has no fixed point to converge to.

Three-category attribution

A failure is not a generic "the agent got it wrong". It is exactly one of three things — test-bug, app-bug, env-drift — and the remediation path differs by category. Without explicit attribution, two different fix loops collide through one interface, and the noise never settles.

Editorial gates over auto-promotion

The KB is not a log. Auto-promoting every gotcha would optimize for recall and produce a KB that explains the system's errors rather than what is true. The editorial gate (and the lint-as-KB promotion path) keeps precision high — and precision is what compounds.

Multi-run measurement

Single-run anecdotes are almost useless for evaluating whether an architecture is sound. The methodology measures across runs (the KB-saturation curve), and the only honest claim about it being effective is the curve actually appearing — on multiple surfaces, on multiple stacks.

The companion article documents that curve appearing on more than one independent application. This page describes the methodology that produces the curve.

8. Read further

The companion article — empirical case study. The numbers behind the curve, two independent platforms, what the no-retries doctrine surfaced in practice. (URL: pending publication)
The principles repository — granular specs and light illustrative examples: https://github.com/webmaster-ramos/e2e-llm-agents.
The Six Principles meta article — the general framework this methodology is an application of: https://webmaster-ramos.com/blog/six-principles-agent-systems.

The repository specs are the deepest layer. The companion article is the most concrete. This page is the binding — the canonical narrative that connects the two.