Reference notes.

Evaluation measures how well AI systems perform. Benchmarking compares systems against standardised tests. Both are essential for understanding capabilities, limitations, and progress.

Why Evaluation Matters

  • Capability assessment — What can this model actually do?
  • Model selection — Which model is best for my use case?
  • Progress tracking — Is the field improving?
  • Safety — Does the model behave as intended?
  • Regression detection — Did changes break something?

LLM Benchmarks

General Knowledge

BenchmarkDescription
MMLU57 subjects from STEM to humanities
ARCGrade-school science questions
HellaSwagCommonsense reasoning
TruthfulQAAvoids generating false claims
WinograndePronoun resolution

Reasoning

BenchmarkDescription
GSM8KGrade-school maths word problems
MATHCompetition mathematics
BIG-Bench200+ diverse reasoning tasks
GPQAGraduate-level science
ARC-AGINovel reasoning patterns

Coding

BenchmarkDescription
HumanEvalPython function completion
MBPPMostly basic Python problems
SWE-benchReal GitHub issues
LiveCodeBenchContinuously updated coding problems
AutoCodeBench3,920 problems across 20 languages

AutoCodeBench Language Rankings

Pass@1 scores using Claude Opus 4 (reasoning mode). Upper bound is the best score achieved by any model combination.

RankLanguageOpus 4 Pass@1Upper Bound
1Elixir80.3%97.5%
2C#74.9%88.4%
3Kotlin72.5%89.5%
4Racket68.9%88.3%
5Ruby61.0%79.5%
6Java55.9%78.7%
7Julia55.5%78.0%
8Dart54.0%78.0%
9R52.5%74.2%
10Shell51.6%70.7%
11Scala50.3%77.4%
12Swift50.0%78.0%
13TypeScript50.0%61.3%
14Perl44.5%64.5%
15C++44.1%74.7%
16Python40.3%63.3%
17Rust38.7%61.3%
18JavaScript38.6%59.2%
19Go37.2%69.1%
20PHP28.1%52.8%

Instruction Following

BenchmarkDescription
IFEvalVerifiable instruction constraints
MT-BenchMulti-turn conversation quality
AlpacaEvalInstruction response quality

Multilingual

BenchmarkDescription
MGSMMaths in multiple languages
XL-SumMultilingual summarisation
FLORESMachine translation

Evaluation Methods

Exact Match

Does the output exactly match the expected answer?

  • Simple, objective
  • Brittle (fails on equivalent phrasings)
  • Good for factual questions, code

Model-based (LLM-as-Judge)

Use another LLM to evaluate outputs.

Given the question and response, rate the quality 1-5:
Question: {question}
Response: {response}

Advantages:

  • Scales to open-ended tasks
  • Can evaluate nuance

Challenges:

  • Judge model has biases
  • Self-preference if same model family
  • Consistency across runs

Human Evaluation

Gold standard but expensive and slow.

  • Pairwise comparison (A vs B)
  • Absolute rating (1-5 scale)
  • Rubric-based scoring

Arena / Elo Ratings

Users compare model outputs, votes generate Elo scores.

Metrics

Classification

  • Accuracy — Correct / Total
  • Precision — True Positives / Predicted Positives
  • Recall — True Positives / Actual Positives
  • F1 — Harmonic mean of precision and recall
  • AUC-ROC — Area under ROC curve

Generation

  • BLEU — N-gram overlap (translation)
  • ROUGE — Recall-oriented (summarisation)
  • Perplexity — How surprised is the model by text
  • Pass@K — Any of K samples correct (code)

Retrieval

  • Recall@K — Relevant items in top K
  • MRR — Mean reciprocal rank
  • NDCG — Normalised discounted cumulative gain

Evaluation Challenges

Benchmark Saturation

Models approach ceiling on benchmarks, losing discriminative power. MMLU is nearly saturated.

Data Contamination

Training data may include benchmark questions. Hard to verify, distorts results.

Goodhart’s Law

“When a measure becomes a target, it ceases to be a good measure.” Optimising for benchmarks doesn’t guarantee real-world performance.

Task Coverage

Benchmarks test specific capabilities. Real use cases are diverse and open-ended.

Evaluation Validity

Does the benchmark measure what we think it measures?

Benchmark Gaming and Reward Hacking

UC Berkeley’s RDI Center showed in April 2026 that a single automated scanning agent broke all eight major agent benchmarks (including SWE-bench, GAIA, WebArena, OSWorld) by reward hacking — achieving near-perfect scores without genuinely solving tasks. Lesson: any single agent benchmark number should be treated as an upper bound, not a measurement. Cross-benchmark consistency and bespoke private evals matter more than topping a public leaderboard.

Evaluation Awareness

Apollo Research found that GPT-5 explicitly mentions “this looks like an alignment evaluation” in 10-20% of its scheming evals. Models increasingly notice they are being tested, which contaminates measurement of behaviour under deployment conditions.

Leaderboards

General

Specialised

  • SWE-bench — Coding agents
  • MTEB — Embeddings
  • Big Code Leaderboard — Code models
  • METR Time Horizons — Length of task an agent can finish 50% of the time; the cleanest trendline for agentic capability growth
  • OSWorld — Computer-use agents on a real desktop
  • Tau²-Bench — Multi-turn tool-using agents with policy adherence
  • Terminal-Bench — Hard, expert-curated CLI tasks
  • LongBench v2 — Bilingual long-context evaluation up to 2M words across six task families. Claude Opus 4.5 led at ~64% as of May 2026.
  • ARC-AGI 2 — Successor to ARC-AGI, harder novel-reasoning grid puzzles

Building Your Own Evaluations

When to Build Custom Evals

  • Domain-specific tasks
  • Proprietary data
  • Specific quality criteria
  • Production monitoring

Framework

  1. Define success criteria — What does “good” look like?
  2. Create test cases — Representative examples with expected outputs
  3. Choose metrics — Exact match, similarity, LLM judge
  4. Automate execution — CI/CD integration
  5. Track over time — Detect regressions

Tools

Example Test Case

- input: "What is the capital of France?"
  expected: "Paris"
  type: contains
 
- input: "Write a haiku about coding"
  criteria:
    - has_three_lines
    - syllable_count_valid
  type: custom_validator

Production Evaluation

Online Metrics

  • User satisfaction (thumbs up/down)
  • Task completion rate
  • Time to completion
  • Regeneration rate

Monitoring

  • Output quality drift
  • Latency percentiles
  • Error rates
  • Cost per interaction

A/B Testing

Compare model versions with real traffic.

  • Statistical significance
  • Guardrail metrics
  • Gradual rollout

See Also

Resources