← Journal/2026-06-10·11 min·llm evals

How to Evaluate LLM Outputs: Evals and LLM-as-Judge in Practice

You can't ship an LLM feature you can't measure. Here's how I build evals — from a labeled set to LLM-as-judge graders — so prompt changes stop being vibes and start being numbers.

By Harel Asaf·AI Builder·Tel Aviv

Here's the failure mode that kills LLM features before they ship: someone tweaks the prompt, eyeballs three outputs, declares it "better," and pushes. A week later it's worse on the cases nobody looked at, and there's no way to know when it broke or why. The feature doesn't have a bug. It has no scoreboard.

Evals are the scoreboard. An eval is a repeatable measurement of whether your LLM system does what it's supposed to, run against a fixed set of cases, producing a number you can compare across changes. Without one, every prompt edit is a coin flip you can't see land. With one, "is this better?" becomes a question you answer in thirty seconds instead of a debate you have in Slack.

I treat evals as non-negotiable infrastructure for anything LLM-powered that I intend to keep. Here's how I actually build them.

Start with cases, not metrics

The instinct is to reach for a fancy metric first. Wrong order. Start by collecting cases — real inputs your system will face, paired with what a good output looks like.

The best source is production (or your own dogfooding). Every time the system does something wrong, that's a case: the input, plus a note on what should have happened. Every time it does something impressively right, that's a case too — you want to protect those, not just fix the failures. Twenty to fifty good cases is enough to start. People stall forever trying to assemble a thousand-case gold set; you don't need it to begin. You need enough to catch regressions on the things you actually care about.

Crucially, include the hard cases on purpose. An eval set made of easy inputs scores 95% and tells you nothing — it can't distinguish a good system from a mediocre one because both pass. Load it with the edge cases, the ambiguous inputs, the ones that broke before. The eval should hurt a little. A set everything passes is a set that measures nothing.

Match the grader to the task

Now the question becomes: given an output, how do you decide if it's right? This is the grader, and the right kind depends entirely on what you're measuring. There are three broad families, in increasing order of difficulty.

Exact / programmatic grading. If the task has a checkable answer — a classification label, a number, valid JSON conforming to a schema, a piece of code that compiles and passes tests — grade it in code. This is the gold standard: fast, free, deterministic, no judgment calls. Whenever you can phrase the success criterion as an assertion (output.label == expected, json.loads(output) succeeds, tests pass), do that. A huge amount of LLM work can be made checkable if you constrain the output format, which is one reason structured outputs are so useful — they turn "did it answer well" into "did it produce a valid object with the right fields."

Reference-based grading. Sometimes there's a known-good answer but the output is free text that won't match byte-for-byte. Here you compare against a reference — does the output contain the required facts, does it match the gold answer's meaning. This shades into the third family because "match the meaning" usually needs a model to judge.

LLM-as-judge. For open-ended outputs — a summary, an explanation, a tone, a piece of writing — there's no assertion that captures "good." So you use a second LLM call to grade the first. The judge gets the input, the output, and a rubric, and returns a score or a verdict. This is what makes evaluating subjective work tractable at all, and it's worth doing well.

Making LLM-as-judge actually reliable

LLM-as-judge has a reputation for being noisy, and it earns that reputation when people do it lazily. Done carefully, it's reliable enough to gate releases. The difference is almost entirely in the rubric.

Write explicit, gradeable criteria — not vibes. "Is this a good summary?" produces noise, because the judge has to invent its own standard each time. "Does the summary (1) mention the dollar figure, (2) name the responsible team, (3) stay under 80 words, (4) avoid speculation not in the source?" produces a stable signal, because each criterion is independently checkable. The rubric does the work. A vague rubric is the cause of almost every "LLM-as-judge is unreliable" complaint I've heard.

Prefer binary or low-cardinality judgments. Asking a judge to rate quality 1–10 invites it to cluster around 7 and wobble between 6 and 8 on identical inputs. Asking "does this meet each criterion: yes or no?" is far more stable. Decompose a quality score into several yes/no checks and aggregate them yourself, rather than asking for one holistic number.

Grade criteria independently. When a rubric has several criteria, score each one on its own and report per-criterion results. This tells you what is failing, not just that something failed — and per-criterion gaps are exactly what you feed back into the next prompt iteration.

Validate the judge against humans, at least once. Before you trust a judge to gate releases, check that it agrees with you on a sample. Grade thirty cases yourself, have the judge grade the same thirty, and look at where they disagree. If they diverge badly, your rubric is underspecified — fix it before you rely on it. The judge is a measurement instrument; calibrate it before you trust the readings.

A practical trick when you don't have a rubric yet: take one output you consider excellent, have a model analyze why it's good, and turn that analysis into the rubric. You often know good when you see it before you can articulate the criteria, and this extracts them.

The iterate loop

Once you have cases and a grader, the workflow that replaces "eyeball three outputs and pray" looks like this:

1. Run the full eval set, record the baseline score. This is your starting number. Per-criterion, not just an aggregate — the aggregate hides which dimension is weak.

2. Make one change. A prompt edit, a model swap, a tool tweak. One at a time, so you can attribute the effect.

3. Re-run, compare. Did the number go up? On which criteria? Did anything regress? This is where evals earn their keep — you see the tradeoffs, not just the headline.

4. Inspect the cases that moved. The score tells you what changed; reading the specific outputs that flipped tells you why. Often a "win" on average hides a new failure on a case you cared about.

This is the difference between engineering and superstition. Without the loop, you're adjusting a prompt based on the last output you happened to look at. With it, you're optimizing against a representative set and watching the whole distribution move.

Some platforms formalize this into a grade-and-revise loop the system runs itself: state the success criteria as a rubric, and the harness iterates — produce, grade against the rubric, revise — until it passes or hits a cap. That's the same loop, automated, and it only works as well as the rubric you write. Vague criteria, noisy loop. Sharp criteria, the system converges.

What I evaluate beyond "is the answer right"

For simple features, output quality is the whole game. For agents, it isn't — there's a trajectory, not just a final answer, and several things are worth measuring:

Task success. Did it accomplish the goal? Often the only thing that ultimately matters, but the hardest to grade automatically — frequently an LLM-as-judge against a rubric, or a programmatic check on the end state.
Tool use correctness. Did it call the right tools with sane arguments, or thrash? You can grade this from the trace.
Cost and efficiency. How many tokens and steps did it take? A run that succeeds in 40 steps when 8 would do is a regression even if the answer is right. Track this — it's a number, so it's free to measure.
Faithfulness. For anything retrieval-grounded, did the output stick to the sources or invent things? This is its own LLM-as-judge check and it's worth isolating.

The mistake is collapsing all of these into one "good/bad" verdict. Keep them separate. An agent can get more accurate and more expensive at the same time, and a single score hides that.

The one-sentence version

An eval is a repeatable measurement against a fixed set of hard cases, graded programmatically where you can and by a carefully-rubriced LLM judge where you can't — and it's the thing that turns "I think this prompt is better" into a number you can actually defend.

FAQ

What is an LLM eval?

An LLM eval is a repeatable measurement of whether an LLM system does what it's supposed to, run against a fixed set of test cases and producing a comparable score. It replaces eyeballing a few outputs with a scoreboard you can track across prompt and model changes.

How many test cases do I need to start evaluating an LLM?

Far fewer than people think — twenty to fifty good cases is enough to begin catching regressions. Don't wait to assemble a thousand-case gold set. The key is that the cases include the hard, ambiguous, and previously-broken inputs, not just easy ones.

What is LLM-as-judge?

LLM-as-judge uses a second LLM call to grade the output of the first against a rubric. It's how you evaluate open-ended outputs — summaries, explanations, tone, writing — where there's no programmatic assertion that captures "good." Its reliability depends almost entirely on rubric quality.

How do I make LLM-as-judge reliable?

Write explicit, independently-gradeable criteria instead of asking "is this good"; prefer binary yes/no judgments over 1–10 scales; grade each criterion separately; and validate the judge against your own human grading on a sample before trusting it to gate releases.

When should I use programmatic grading instead of an LLM judge?

Whenever the task has a checkable answer — a classification label, a number, schema-valid JSON, or code that passes tests. Programmatic grading is fast, free, and deterministic. Constraining the output format (e.g., with structured outputs) lets you make far more tasks programmatically checkable.

What should I measure when evaluating an LLM agent?

More than final-answer quality: task success, tool-use correctness, cost and efficiency (tokens and steps), and faithfulness to sources for retrieval-grounded work. Keep these separate rather than collapsing them into one verdict — an agent can get more accurate and more expensive at once.

How do evals fit into iterating on a prompt?

Run the full set for a baseline score, make one change, re-run and compare per-criterion, then read the specific cases that moved to understand why. This loop turns prompt tuning from guessing based on the last output you saw into optimizing against a representative set.

Build log

Get an email when I ship a new prototype or essay. No funnel — just the work.

Next in the journal →

How to Build a Claude AI Agent (The Way I Actually Did It)

A step-by-step guide to building a real Claude AI agent — from the agentic loop to Cloud Run deployment, written by someone who did it in production.