Evaluation measures how well AI systems perform. Benchmarking compares systems against standardised tests. Both are essential for understanding capabilities, limitations, and progress.

Why Evaluation Matters

  • Capability assessment — What can this model actually do?
  • Model selection — Which model is best for my use case?
  • Progress tracking — Is the field improving?
  • Safety — Does the model behave as intended?
  • Regression detection — Did changes break something?

LLM Benchmarks

General Knowledge

BenchmarkDescription
MMLU57 subjects from STEM to humanities
ARCGrade-school science questions
HellaSwagCommonsense reasoning
TruthfulQAAvoids generating false claims
WinograndePronoun resolution

Reasoning

BenchmarkDescription
GSM8KGrade-school maths word problems
MATHCompetition mathematics
BIG-Bench200+ diverse reasoning tasks
GPQAGraduate-level science
ARC-AGINovel reasoning patterns

Coding

BenchmarkDescription
HumanEvalPython function completion
MBPPMostly basic Python problems
SWE-benchReal GitHub issues
LiveCodeBenchContinuously updated coding problems

Instruction Following

BenchmarkDescription
IFEvalVerifiable instruction constraints
MT-BenchMulti-turn conversation quality
AlpacaEvalInstruction response quality

Multilingual

BenchmarkDescription
MGSMMaths in multiple languages
XL-SumMultilingual summarisation
FLORESMachine translation

Evaluation Methods

Exact Match

Does the output exactly match the expected answer?

  • Simple, objective
  • Brittle (fails on equivalent phrasings)
  • Good for factual questions, code

Model-based (LLM-as-Judge)

Use another LLM to evaluate outputs.

Given the question and response, rate the quality 1-5:
Question: {question}
Response: {response}

Advantages:

  • Scales to open-ended tasks
  • Can evaluate nuance

Challenges:

  • Judge model has biases
  • Self-preference if same model family
  • Consistency across runs

Human Evaluation

Gold standard but expensive and slow.

  • Pairwise comparison (A vs B)
  • Absolute rating (1-5 scale)
  • Rubric-based scoring

Arena / ELO Ratings

Users compare model outputs, votes generate ELO scores.

Metrics

Classification

  • Accuracy — Correct / Total
  • Precision — True Positives / Predicted Positives
  • Recall — True Positives / Actual Positives
  • F1 — Harmonic mean of precision and recall
  • AUC-ROC — Area under ROC curve

Generation

  • BLEU — N-gram overlap (translation)
  • ROUGE — Recall-oriented (summarisation)
  • Perplexity — How surprised is the model by text
  • Pass@K — Any of K samples correct (code)

Retrieval

  • Recall@K — Relevant items in top K
  • MRR — Mean reciprocal rank
  • NDCG — Normalised discounted cumulative gain

Evaluation Challenges

Benchmark Saturation

Models approach ceiling on benchmarks, losing discriminative power. MMLU is nearly saturated.

Data Contamination

Training data may include benchmark questions. Hard to verify, distorts results.

Goodhart’s Law

“When a measure becomes a target, it ceases to be a good measure.” Optimising for benchmarks doesn’t guarantee real-world performance.

Task Coverage

Benchmarks test specific capabilities. Real use cases are diverse and open-ended.

Evaluation Validity

Does the benchmark measure what we think it measures?

Leaderboards

General

Specialised

Building Your Own Evaluations

When to Build Custom Evals

  • Domain-specific tasks
  • Proprietary data
  • Specific quality criteria
  • Production monitoring

Framework

  1. Define success criteria — What does “good” look like?
  2. Create test cases — Representative examples with expected outputs
  3. Choose metrics — Exact match, similarity, LLM judge
  4. Automate execution — CI/CD integration
  5. Track over time — Detect regressions

Tools

Example Test Case

- input: "What is the capital of France?"
  expected: "Paris"
  type: contains
 
- input: "Write a haiku about coding"
  criteria:
    - has_three_lines
    - syllable_count_valid
  type: custom_validator

Production Evaluation

Online Metrics

  • User satisfaction (thumbs up/down)
  • Task completion rate
  • Time to completion
  • Regeneration rate

Monitoring

  • Output quality drift
  • Latency percentiles
  • Error rates
  • Cost per interaction

A/B Testing

Compare model versions with real traffic.

  • Statistical significance
  • Guardrail metrics
  • Gradual rollout

Resources