Evaluation measures how well AI systems perform. Benchmarking compares systems against standardised tests. Both are essential for understanding capabilities, limitations, and progress.
Why Evaluation Matters
- Capability assessment — What can this model actually do?
- Model selection — Which model is best for my use case?
- Progress tracking — Is the field improving?
- Safety — Does the model behave as intended?
- Regression detection — Did changes break something?
LLM Benchmarks
General Knowledge
| Benchmark | Description |
|---|---|
| MMLU | 57 subjects from STEM to humanities |
| ARC | Grade-school science questions |
| HellaSwag | Commonsense reasoning |
| TruthfulQA | Avoids generating false claims |
| Winogrande | Pronoun resolution |
Reasoning
| Benchmark | Description |
|---|---|
| GSM8K | Grade-school maths word problems |
| MATH | Competition mathematics |
| BIG-Bench | 200+ diverse reasoning tasks |
| GPQA | Graduate-level science |
| ARC-AGI | Novel reasoning patterns |
Coding
| Benchmark | Description |
|---|---|
| HumanEval | Python function completion |
| MBPP | Mostly basic Python problems |
| SWE-bench | Real GitHub issues |
| LiveCodeBench | Continuously updated coding problems |
Instruction Following
| Benchmark | Description |
|---|---|
| IFEval | Verifiable instruction constraints |
| MT-Bench | Multi-turn conversation quality |
| AlpacaEval | Instruction response quality |
Multilingual
| Benchmark | Description |
|---|---|
| MGSM | Maths in multiple languages |
| XL-Sum | Multilingual summarisation |
| FLORES | Machine translation |
Evaluation Methods
Exact Match
Does the output exactly match the expected answer?
- Simple, objective
- Brittle (fails on equivalent phrasings)
- Good for factual questions, code
Model-based (LLM-as-Judge)
Use another LLM to evaluate outputs.
Given the question and response, rate the quality 1-5:
Question: {question}
Response: {response}
Advantages:
- Scales to open-ended tasks
- Can evaluate nuance
Challenges:
- Judge model has biases
- Self-preference if same model family
- Consistency across runs
Human Evaluation
Gold standard but expensive and slow.
- Pairwise comparison (A vs B)
- Absolute rating (1-5 scale)
- Rubric-based scoring
Arena / ELO Ratings
Users compare model outputs, votes generate ELO scores.
- LMSYS Chatbot Arena
- Reflects real-world preference
- Continuously updated
Metrics
Classification
- Accuracy — Correct / Total
- Precision — True Positives / Predicted Positives
- Recall — True Positives / Actual Positives
- F1 — Harmonic mean of precision and recall
- AUC-ROC — Area under ROC curve
Generation
- BLEU — N-gram overlap (translation)
- ROUGE — Recall-oriented (summarisation)
- Perplexity — How surprised is the model by text
- Pass@K — Any of K samples correct (code)
Retrieval
- Recall@K — Relevant items in top K
- MRR — Mean reciprocal rank
- NDCG — Normalised discounted cumulative gain
Evaluation Challenges
Benchmark Saturation
Models approach ceiling on benchmarks, losing discriminative power. MMLU is nearly saturated.
Data Contamination
Training data may include benchmark questions. Hard to verify, distorts results.
Goodhart’s Law
“When a measure becomes a target, it ceases to be a good measure.” Optimising for benchmarks doesn’t guarantee real-world performance.
Task Coverage
Benchmarks test specific capabilities. Real use cases are diverse and open-ended.
Evaluation Validity
Does the benchmark measure what we think it measures?
Leaderboards
General
- Open LLM Leaderboard — Open models
- Chatbot Arena — Human preference
- Artificial Analysis — Speed + quality
Specialised
- SWE-bench — Coding agents
- MTEB — Embeddings
- Big Code Leaderboard — Code models
Building Your Own Evaluations
When to Build Custom Evals
- Domain-specific tasks
- Proprietary data
- Specific quality criteria
- Production monitoring
Framework
- Define success criteria — What does “good” look like?
- Create test cases — Representative examples with expected outputs
- Choose metrics — Exact match, similarity, LLM judge
- Automate execution — CI/CD integration
- Track over time — Detect regressions
Tools
- promptfoo — Prompt testing framework
- OpenAI Evals — Evaluation framework
- Inspect — UK AISI’s eval framework
- DeepEval — LLM evaluation
- RAGAS — RAG evaluation
- LangSmith — Tracing and evaluation
Example Test Case
- input: "What is the capital of France?"
expected: "Paris"
type: contains
- input: "Write a haiku about coding"
criteria:
- has_three_lines
- syllable_count_valid
type: custom_validatorProduction Evaluation
Online Metrics
- User satisfaction (thumbs up/down)
- Task completion rate
- Time to completion
- Regeneration rate
Monitoring
- Output quality drift
- Latency percentiles
- Error rates
- Cost per interaction
A/B Testing
Compare model versions with real traffic.
- Statistical significance
- Guardrail metrics
- Gradual rollout