Reference notes.
Evaluation measures how well AI systems perform. Benchmarking compares systems against standardised tests. Both are essential for understanding capabilities, limitations, and progress.
Why Evaluation Matters
- Capability assessment — What can this model actually do?
- Model selection — Which model is best for my use case?
- Progress tracking — Is the field improving?
- Safety — Does the model behave as intended?
- Regression detection — Did changes break something?
LLM Benchmarks
General Knowledge
| Benchmark | Description |
|---|---|
| MMLU | 57 subjects from STEM to humanities |
| ARC | Grade-school science questions |
| HellaSwag | Commonsense reasoning |
| TruthfulQA | Avoids generating false claims |
| Winogrande | Pronoun resolution |
Reasoning
| Benchmark | Description |
|---|---|
| GSM8K | Grade-school maths word problems |
| MATH | Competition mathematics |
| BIG-Bench | 200+ diverse reasoning tasks |
| GPQA | Graduate-level science |
| ARC-AGI | Novel reasoning patterns |
Coding
| Benchmark | Description |
|---|---|
| HumanEval | Python function completion |
| MBPP | Mostly basic Python problems |
| SWE-bench | Real GitHub issues |
| LiveCodeBench | Continuously updated coding problems |
| AutoCodeBench | 3,920 problems across 20 languages |
AutoCodeBench Language Rankings
Pass@1 scores using Claude Opus 4 (reasoning mode). Upper bound is the best score achieved by any model combination.
| Rank | Language | Opus 4 Pass@1 | Upper Bound |
|---|---|---|---|
| 1 | Elixir | 80.3% | 97.5% |
| 2 | C# | 74.9% | 88.4% |
| 3 | Kotlin | 72.5% | 89.5% |
| 4 | Racket | 68.9% | 88.3% |
| 5 | Ruby | 61.0% | 79.5% |
| 6 | Java | 55.9% | 78.7% |
| 7 | Julia | 55.5% | 78.0% |
| 8 | Dart | 54.0% | 78.0% |
| 9 | R | 52.5% | 74.2% |
| 10 | Shell | 51.6% | 70.7% |
| 11 | Scala | 50.3% | 77.4% |
| 12 | Swift | 50.0% | 78.0% |
| 13 | TypeScript | 50.0% | 61.3% |
| 14 | Perl | 44.5% | 64.5% |
| 15 | C++ | 44.1% | 74.7% |
| 16 | Python | 40.3% | 63.3% |
| 17 | Rust | 38.7% | 61.3% |
| 18 | JavaScript | 38.6% | 59.2% |
| 19 | Go | 37.2% | 69.1% |
| 20 | PHP | 28.1% | 52.8% |
Instruction Following
| Benchmark | Description |
|---|---|
| IFEval | Verifiable instruction constraints |
| MT-Bench | Multi-turn conversation quality |
| AlpacaEval | Instruction response quality |
Multilingual
| Benchmark | Description |
|---|---|
| MGSM | Maths in multiple languages |
| XL-Sum | Multilingual summarisation |
| FLORES | Machine translation |
Evaluation Methods
Exact Match
Does the output exactly match the expected answer?
- Simple, objective
- Brittle (fails on equivalent phrasings)
- Good for factual questions, code
Model-based (LLM-as-Judge)
Use another LLM to evaluate outputs.
Given the question and response, rate the quality 1-5:
Question: {question}
Response: {response}
Advantages:
- Scales to open-ended tasks
- Can evaluate nuance
Challenges:
- Judge model has biases
- Self-preference if same model family
- Consistency across runs
Human Evaluation
Gold standard but expensive and slow.
- Pairwise comparison (A vs B)
- Absolute rating (1-5 scale)
- Rubric-based scoring
Arena / Elo Ratings
Users compare model outputs, votes generate Elo scores.
- LMSYS Chatbot Arena
- Reflects real-world preference
- Continuously updated
Metrics
Classification
- Accuracy — Correct / Total
- Precision — True Positives / Predicted Positives
- Recall — True Positives / Actual Positives
- F1 — Harmonic mean of precision and recall
- AUC-ROC — Area under ROC curve
Generation
- BLEU — N-gram overlap (translation)
- ROUGE — Recall-oriented (summarisation)
- Perplexity — How surprised is the model by text
- Pass@K — Any of K samples correct (code)
Retrieval
- Recall@K — Relevant items in top K
- MRR — Mean reciprocal rank
- NDCG — Normalised discounted cumulative gain
Evaluation Challenges
Benchmark Saturation
Models approach ceiling on benchmarks, losing discriminative power. MMLU is nearly saturated.
Data Contamination
Training data may include benchmark questions. Hard to verify, distorts results.
Goodhart’s Law
“When a measure becomes a target, it ceases to be a good measure.” Optimising for benchmarks doesn’t guarantee real-world performance.
Task Coverage
Benchmarks test specific capabilities. Real use cases are diverse and open-ended.
Evaluation Validity
Does the benchmark measure what we think it measures?
Benchmark Gaming and Reward Hacking
UC Berkeley’s RDI Center showed in April 2026 that a single automated scanning agent broke all eight major agent benchmarks (including SWE-bench, GAIA, WebArena, OSWorld) by reward hacking — achieving near-perfect scores without genuinely solving tasks. Lesson: any single agent benchmark number should be treated as an upper bound, not a measurement. Cross-benchmark consistency and bespoke private evals matter more than topping a public leaderboard.
Evaluation Awareness
Apollo Research found that GPT-5 explicitly mentions “this looks like an alignment evaluation” in 10-20% of its scheming evals. Models increasingly notice they are being tested, which contaminates measurement of behaviour under deployment conditions.
Leaderboards
General
- Open LLM Leaderboard — Open models
- Chatbot Arena — Human preference
- Artificial Analysis — Speed + quality
Specialised
- SWE-bench — Coding agents
- MTEB — Embeddings
- Big Code Leaderboard — Code models
- METR Time Horizons — Length of task an agent can finish 50% of the time; the cleanest trendline for agentic capability growth
- OSWorld — Computer-use agents on a real desktop
- Tau²-Bench — Multi-turn tool-using agents with policy adherence
- Terminal-Bench — Hard, expert-curated CLI tasks
- LongBench v2 — Bilingual long-context evaluation up to 2M words across six task families. Claude Opus 4.5 led at ~64% as of May 2026.
- ARC-AGI 2 — Successor to ARC-AGI, harder novel-reasoning grid puzzles
Building Your Own Evaluations
When to Build Custom Evals
- Domain-specific tasks
- Proprietary data
- Specific quality criteria
- Production monitoring
Framework
- Define success criteria — What does “good” look like?
- Create test cases — Representative examples with expected outputs
- Choose metrics — Exact match, similarity, LLM judge
- Automate execution — CI/CD integration
- Track over time — Detect regressions
Tools
- promptfoo — Prompt testing framework
- OpenAI Evals — Evaluation framework
- Inspect — UK AI Security Institute’s eval framework
- DeepEval — LLM evaluation
- RAGAS — RAG evaluation
- LangSmith — Tracing and evaluation
Example Test Case
- input: "What is the capital of France?"
expected: "Paris"
type: contains
- input: "Write a haiku about coding"
criteria:
- has_three_lines
- syllable_count_valid
type: custom_validatorProduction Evaluation
Online Metrics
- User satisfaction (thumbs up/down)
- Task completion rate
- Time to completion
- Regeneration rate
Monitoring
- Output quality drift
- Latency percentiles
- Error rates
- Cost per interaction
A/B Testing
Compare model versions with real traffic.
- Statistical significance
- Guardrail metrics
- Gradual rollout
See Also
- Foundation Models — Models being evaluated
- Embeddings — MTEB leaderboard for embedding models
- AI Agents — Agent-specific benchmarks (SWE-bench, WebArena)