Jahanzaib
Evaluation

LLM-as-Judge

Using one LLM to evaluate the quality of another LLM's output, replacing or supplementing human review at scale.

Last updated: April 26, 2026

Definition

LLM-as-judge is the dominant approach to automated evaluation in 2026. You provide the judge model with the original prompt, the candidate response, and a rubric, and ask it to score the response (binary pass/fail, 1-5 scale, multi-dimensional rubric). For most subjective qualities (tone, helpfulness, relevance, format adherence) a frontier judge model agrees with human reviewers around 80 percent of the time, which is good enough for regression testing and most A/B test decisions. The pattern scales infinitely cheaply compared to human review.

Three pitfalls. First, judge bias: GPT models tend to prefer GPT outputs, and Claude tends to prefer Claude. Use a different family for judging than for generating where possible. Second, position bias: judges prefer the first option in pairwise comparisons. Randomize order. Third, rubric drift: vague rubrics produce inconsistent scores. Make the rubric specific ("score 1 if the response misses any cited source, 5 if it cites every fact"). Calibrate every new judge prompt against a small set of human-labeled examples before relying on it.

When To Use

Use LLM-as-judge for any evaluation where you need to score thousands of outputs and human review is impractical. Validate the judge against human labels first.

Sources

Related Terms

Building with LLM-as-Judge?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.