Your LLM eval suite is probably testing the wrong thing

Date: 2026-06-13 Post 4 of ~27 (build-in-public, intermittent cadence)

If you write code that calls an LLM, you've hit this fork: do your tests call the real model, or do they mock it? Both answers are bad, and the reason they're both bad is that they're answering two different questions as if they were one.

I built a small eval harness to draw the line where it belongs. It's open source: github.com/dmic23/agent-evals. This is the thinking behind it.

The bad fork

Call the real model on every test run. Now CI is flaky (the model is non-deterministic), slow (network plus inference), and costs money on every push. Worse: a green run only tells you the provider was up and behaved today. It doesn't tell you your code is correct, and it can go red for reasons that have nothing to do with your change.

Mock the model heavily. Now tests are fast and deterministic, but they assert against a fiction you wrote. You've stopped testing the integration. A real schema-shape bug or a parsing regression walks straight through.

The mistake underneath both is treating "is my code correct?" and "is the model good?" as one question. They're different questions, with different answers, different cadences, and different costs. They deserve different machinery.

Two kinds of eval

So the harness splits them:

Contract eval — does my code's contract around the model still hold? Given a known input, does my tool return the right shape: score is a number in range, severity is a valid enum, required fields present, no thrown error, no refusal stub leaking through, no banned phrase? This is deterministic and has nothing to do with whether the model is smart — only whether my handling of it is correct. It should run on every push: fast, offline, free.
Quality eval — is the output actually good? Is the reasoning grounded? Is the summary faithful? Inherently non-deterministic, judgment-laden, scored by an LLM judge. You run it on demand, spend tokens, accept variance. It must not gate CI, or CI becomes a flaky token meter.

Keeping these apart is the whole point. The contract gate gives you the thing unit tests give everyone else — a fast, trustworthy "did I break it?" — for code that happens to call a model. The quality eval stays available for the slower, fuzzier question, run when you actually want to ask it.

Three modes over one VCR

The split is enforced by three run modes over a single fetch-boundary recorder:

replay serves previously recorded provider responses from disk. No network, no API key, bit-stable. This is the CI gate. A stranger clones the repo and pnpm eval is green in seconds, no key required.
record makes the real call once and writes a fixture. You run it deliberately, when a prompt or schema change has legitimately changed the request.
live skips the recorder and calls the provider for real. The only mode that catches model drift — and the reason it's deliberately off the gate.

The fixture is keyed by a canonical, redacted hash of the request: key-sorted body, with caller-declared volatile fields (a timestamp, a request id) masked. So a clock tick doesn't bust the fixture, but a real prompt edit does — which is the signal to re-record. The API key travels as a header and is never hashed and never written to a fixture.

A worked bug, because that's the point

A structured-output tool of mine asked a provider to return JSON against a generated schema. Five separate schema-shape incompatibilities made the call throw in production: an additionalProperties key the API rejects, a stray name key, an unsupported field, validation keywords that aren't allowed, and a schema long enough that it had to be truncated before validation, not after. Each one a silent 500.

A replay contract case now pins the contract that survived those fixes: a real input comes back as a parsed object with the right shape — never a thrown parse error, never an "I cannot" stub — and it stays green and offline on every push. A regression that reintroduced any of those bugs would either change the request hash (forcing an honest re-record) or fail the assertions. That's a class of bug a live eval catches only sometimes, slowly, and for money. A contract gate catches it every time, instantly, for free.

The unglamorous parts are the taste

The decisions I'd defend in a review are the boring, fail-closed ones:

Record refuses to persist a non-2xx response. Replay always serves a fixture as HTTP 200, so baking a 429 or 500 body would mask a real outage as a permanently-green test. So it throws instead.
Negative assertions fail-closed. "Must not contain X" fails when the field doesn't resolve, rather than passing because the field was absent. An absent field satisfying a safety check is exactly the false comfort evals exist to kill.
No silent network fallthrough. An invalid mode, or a recorder nesting inside an active one, throws — because the dangerous failure would be quietly making a real call.
The judge runs through the same recorder, so a quality assertion is usable on a deterministic gate: on replay, the judge's verdict is itself a fixture.

What it isn't

This isn't a new category, and I'm not pretending it is. promptfoo and OpenAI evals are richer declarative runners; Braintrust is a full hosted platform with datasets, scoring, and dashboards; vitest snapshots cover deterministic output. What this harness commits to that those don't foreground: the contract/quality split as a first-class run-mode boundary, a key-free offline replay gate that drops into any CI, and fail-closed defaults everywhere. It deliberately omits dataset management, dashboards, and significance testing — adding those would turn it into the products it points you to. It stays a readable ~500-line pattern on purpose. When you outgrow it, the case format is boring so the migration is easy.

Code, README, and a detailed explainer: github.com/dmic23/agent-evals.