Evaluation

Eval Harness

A test suite that runs the model against a fixed set of inputs and grades outputs automatically.

Last updated: April 26, 2026

Definition

An eval harness is the equivalent of unit tests for an LLM-powered system. You build a "golden dataset". Input/expected-output pairs covering happy paths and edge cases. The harness runs each input through your agent and grades outputs, often using another LLM as judge. Good evals catch regressions when you change prompts or models. Without an eval harness, you tweak prompts and pray. With one, you A/B test rationally.

Code Example

python

for case in golden_dataset:
    output = await agent.run(case.input)
    score = judge_llm.score(
        question=case.input,
        expected=case.expected,
        actual=output,
    )
    record(case.id, score)

Run all golden cases on every prompt change. Track score over time.

When To Use

Build one before your second prompt change. The first version takes 4 hours and pays for itself in the first regression.

Related Terms

LLM-as-Judge

Using one LLM to evaluate the quality of another LLM's output, replacing or supp…

Building with Eval Harness?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.

Book a discovery call Browse more terms