← back·← dev log

Your LLM eval suite is probably testing the wrong thing

Date: 2026-06-13 Post 4 of ~27 (build-in-public, intermittent cadence)

If you write code that calls an LLM, you've hit this fork: do your tests call the real model, or do they mock it? Both answers are bad, and the reason they're both bad is that they're answering two different questions as if they were one.

I built a small eval harness to draw the line where it belongs. It's open source: github.com/dmic23/agent-evals. This is the thinking behind it.

The bad fork

Call the real model on every test run. Now CI is flaky (the model is non-deterministic), slow (network plus inference), and costs money on every push. Worse: a green run only tells you the provider was up and behaved today. It doesn't tell you your code is correct, and it can go red for reasons that have nothing to do with your change.

Mock the model heavily. Now tests are fast and deterministic, but they assert against a fiction you wrote. You've stopped testing the integration. A real schema-shape bug or a parsing regression walks straight through.

The mistake underneath both is treating "is my code correct?" and "is the model good?" as one question. They're different questions, with different answers, different cadences, and different costs. They deserve different machinery.

Two kinds of eval

So the harness splits them:

Keeping these apart is the whole point. The contract gate gives you the thing unit tests give everyone else — a fast, trustworthy "did I break it?" — for code that happens to call a model. The quality eval stays available for the slower, fuzzier question, run when you actually want to ask it.

Three modes over one VCR

The split is enforced by three run modes over a single fetch-boundary recorder:

The fixture is keyed by a canonical, redacted hash of the request: key-sorted body, with caller-declared volatile fields (a timestamp, a request id) masked. So a clock tick doesn't bust the fixture, but a real prompt edit does — which is the signal to re-record. The API key travels as a header and is never hashed and never written to a fixture.

A worked bug, because that's the point

A structured-output tool of mine asked a provider to return JSON against a generated schema. Five separate schema-shape incompatibilities made the call throw in production: an additionalProperties key the API rejects, a stray name key, an unsupported field, validation keywords that aren't allowed, and a schema long enough that it had to be truncated before validation, not after. Each one a silent 500.

A replay contract case now pins the contract that survived those fixes: a real input comes back as a parsed object with the right shape — never a thrown parse error, never an "I cannot" stub — and it stays green and offline on every push. A regression that reintroduced any of those bugs would either change the request hash (forcing an honest re-record) or fail the assertions. That's a class of bug a live eval catches only sometimes, slowly, and for money. A contract gate catches it every time, instantly, for free.

The unglamorous parts are the taste

The decisions I'd defend in a review are the boring, fail-closed ones:

What it isn't

This isn't a new category, and I'm not pretending it is. promptfoo and OpenAI evals are richer declarative runners; Braintrust is a full hosted platform with datasets, scoring, and dashboards; vitest snapshots cover deterministic output. What this harness commits to that those don't foreground: the contract/quality split as a first-class run-mode boundary, a key-free offline replay gate that drops into any CI, and fail-closed defaults everywhere. It deliberately omits dataset management, dashboards, and significance testing — adding those would turn it into the products it points you to. It stays a readable ~500-line pattern on purpose. When you outgrow it, the case format is boring so the migration is easy.

Code, README, and a detailed explainer: github.com/dmic23/agent-evals.