Custom Evals and LLM-as-Judge

Public benchmarks measure general capability. Your product measures whether a specific model handles your specific traffic. The correlation is loose enough that any team shipping LLMs into production ends up building its own eval set within the first three months - and once that set exists, the next question is who or what grades it.

Distribution mismatch is why public scores mislead

The model that wins MMLU may lose your support-ticket triage task because your tickets are full of product nouns, terse customer phrasing, and a long tail of edge cases the pretraining distribution under-represents. You can only discover this by evaluating on your own data. A custom eval set is not a nice-to-have; it is the precondition for any defensible model selection or regression test.

The eval-set construction recipe

Sample real traffic. Pull 200-500 representative inputs from production logs. Stratify by intent, language, and customer segment so the tail is represented.
Get gold answers from experts. Have your most senior domain experts write or approve the correct outputs. This is slow and expensive; do not skip it. Single-expert labels are noisy - aim for two independent labels and adjudicate disagreements.
Categorise failure modes. Cluster the model's current errors into named buckets (hallucination, format break, refusal, off-topic, partial answer). Each bucket becomes a sub-score in the eval report.
Bootstrap with synthetic variation. Once you have the seed set, use an LLM to generate paraphrases, edge cases, and adversarial twists - then have humans accept or reject each one. This grows the set 5-10x at low cost without losing label quality.
Freeze the test split. Keep a held-out slice the model and prompt are never tuned against. Public benchmark contamination is now your contamination problem in miniature.

LLM-as-judge

For tasks where there is no single correct answer (open-ended generation, summarisation, dialogue), human grading does not scale. The mainstream solution is LLM-as-judge: prompt a frontier model to grade the output of the model under test.

Framework	Approach	Notes
G-Eval	Chain-of-thought + form-filling, GPT-4-backed	High correlation with human (Spearman ~0.51 on summarisation) but biased toward outputs from the judge's own model family
Prometheus	13B open-source judge trained on rubric-feedback pairs	Matches GPT-4 judge correlation when given a reference answer and rubric, far cheaper
AlpacaEval 2.0	Pairwise preference against a fixed baseline, GPT-4 judge	0.98 Spearman with ChatBot Arena, runs in under 3 minutes for under $10
MT-Bench	Multi-turn questions scored by GPT-4	The original LLM-as-judge methodology paper; 80%+ agreement with human preference

Known judge biases

The MT-Bench paper catalogued the failure modes that every subsequent framework has had to mitigate:

Distribution mismatch is why public scores mislead

The eval-set construction recipe

LLM-as-judge

Known judge biases

Keep reading with Pro.