Evaluation & Benchmarking

Evaluation measures how well AI systems perform. Benchmarking compares systems against standardised tests. Both are essential for understanding capabilities, limitations, and progress.

Why Evaluation Matters

Capability assessment — What can this model actually do?
Model selection — Which model is best for my use case?
Progress tracking — Is the field improving?
Safety — Does the model behave as intended?
Regression detection — Did changes break something?

LLM Benchmarks

General Knowledge

Benchmark	Description
MMLU	57 subjects from STEM to humanities
ARC	Grade-school science questions
HellaSwag	Commonsense reasoning
TruthfulQA	Avoids generating false claims
Winogrande	Pronoun resolution

Reasoning

Benchmark	Description
GSM8K	Grade-school maths word problems
MATH	Competition mathematics
BIG-Bench	200+ diverse reasoning tasks
GPQA	Graduate-level science
ARC-AGI	Novel reasoning patterns

Coding

Benchmark	Description
HumanEval	Python function completion
MBPP	Mostly basic Python problems
SWE-bench	Real GitHub issues
LiveCodeBench	Continuously updated coding problems

Instruction Following

Benchmark	Description
IFEval	Verifiable instruction constraints
MT-Bench	Multi-turn conversation quality
AlpacaEval	Instruction response quality

Multilingual

Benchmark	Description
MGSM	Maths in multiple languages
XL-Sum	Multilingual summarisation
FLORES	Machine translation

Evaluation Methods

Exact Match

Does the output exactly match the expected answer?

Simple, objective
Brittle (fails on equivalent phrasings)
Good for factual questions, code

Model-based (LLM-as-Judge)

Use another LLM to evaluate outputs.

Given the question and response, rate the quality 1-5:
Question: {question}
Response: {response}

Advantages:

Scales to open-ended tasks
Can evaluate nuance

Challenges:

Judge model has biases
Self-preference if same model family
Consistency across runs

Human Evaluation

Gold standard but expensive and slow.

Pairwise comparison (A vs B)
Absolute rating (1-5 scale)
Rubric-based scoring

Arena / ELO Ratings

Users compare model outputs, votes generate ELO scores.

LMSYS Chatbot Arena
Reflects real-world preference
Continuously updated

Metrics

Classification

Accuracy — Correct / Total
Precision — True Positives / Predicted Positives
Recall — True Positives / Actual Positives
F1 — Harmonic mean of precision and recall
AUC-ROC — Area under ROC curve

Generation

BLEU — N-gram overlap (translation)
ROUGE — Recall-oriented (summarisation)
Perplexity — How surprised is the model by text
Pass@K — Any of K samples correct (code)

Retrieval

Recall@K — Relevant items in top K
MRR — Mean reciprocal rank
NDCG — Normalised discounted cumulative gain

Evaluation Challenges

Benchmark Saturation

Models approach ceiling on benchmarks, losing discriminative power. MMLU is nearly saturated.

Data Contamination

Training data may include benchmark questions. Hard to verify, distorts results.

Goodhart’s Law

“When a measure becomes a target, it ceases to be a good measure.” Optimising for benchmarks doesn’t guarantee real-world performance.

Task Coverage

Benchmarks test specific capabilities. Real use cases are diverse and open-ended.

Evaluation Validity

Does the benchmark measure what we think it measures?

Leaderboards

General

Open LLM Leaderboard — Open models
Chatbot Arena — Human preference
Artificial Analysis — Speed + quality

Specialised

SWE-bench — Coding agents
MTEB — Embeddings
Big Code Leaderboard — Code models

Building Your Own Evaluations

When to Build Custom Evals

Domain-specific tasks
Proprietary data
Specific quality criteria
Production monitoring

Framework

Define success criteria — What does “good” look like?
Create test cases — Representative examples with expected outputs
Choose metrics — Exact match, similarity, LLM judge
Automate execution — CI/CD integration
Track over time — Detect regressions

Tools

promptfoo — Prompt testing framework
OpenAI Evals — Evaluation framework
Inspect — UK AISI’s eval framework
DeepEval — LLM evaluation
RAGAS — RAG evaluation
LangSmith — Tracing and evaluation

Example Test Case

- input: "What is the capital of France?"
  expected: "Paris"
  type: contains
 
- input: "Write a haiku about coding"
  criteria:
    - has_three_lines
    - syllable_count_valid
  type: custom_validator

Production Evaluation

Online Metrics

User satisfaction (thumbs up/down)
Task completion rate
Time to completion
Regeneration rate

Monitoring

Output quality drift
Latency percentiles
Error rates
Cost per interaction

A/B Testing

Compare model versions with real traffic.

Statistical significance
Guardrail metrics
Gradual rollout

Rai Notes

Explorer