Evaluation

LLM-as-Judge

Using one LLM to evaluate the quality of another LLM's output, replacing or supplementing human review at scale.

Last updated: April 26, 2026

Definition

LLM-as-judge is the dominant approach to automated evaluation in 2026. You provide the judge model with the original prompt, the candidate response, and a rubric, and ask it to score the response (binary pass/fail, 1-5 scale, multi-dimensional rubric). For most subjective qualities (tone, helpfulness, relevance, format adherence) a frontier judge model agrees with human reviewers around 80 percent of the time, which is good enough for regression testing and most A/B test decisions. The pattern scales infinitely cheaply compared to human review.

Three pitfalls. First, judge bias: GPT models tend to prefer GPT outputs, and Claude tends to prefer Claude. Use a different family for judging than for generating where possible. Second, position bias: judges prefer the first option in pairwise comparisons. Randomize order. Third, rubric drift: vague rubrics produce inconsistent scores. Make the rubric specific ("score 1 if the response misses any cited source, 5 if it cites every fact"). Calibrate every new judge prompt against a small set of human-labeled examples before relying on it.

When To Use

Use LLM-as-judge for any evaluation where you need to score thousands of outputs and human review is impractical. Validate the judge against human labels first.

Sources

Related Terms

Eval Harness

A test suite that runs the model against a fixed set of inputs and grades output…

Self-Correction

An agent's ability to detect errors in its own outputs and revise them without e…

Confidence Scoring

Estimating how reliable an agent's output is so the system can decide whether to…

Building with LLM-as-Judge?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.

Book a discovery call Browse more terms