Reference notes.
Large Language Models (LLMs) are neural networks trained on vast text corpora to predict the next token in a sequence. Despite this simple objective, they exhibit emergent capabilities including reasoning, code generation, and general knowledge.
Architecture
Transformers
The transformer architecture (Vaswani et al., 2017) revolutionised NLP by replacing recurrence with self-attention, enabling parallel processing and better long-range dependencies.
Key components:
- Self-attention — Allows each token to attend to all other tokens, computing relevance weights
- Multi-head attention — Multiple attention heads capture different relationship types
- Positional encoding — Injects sequence order information (sinusoidal or learned)
- Feed-forward layers — Process each position independently after attention
- Layer normalisation — Stabilises training
- Residual connections — Skip connections around each sub-layer, enabling gradient flow through deep networks
Decoder-only vs Encoder-Decoder
- Decoder-only (GPT, Llama, Claude) — Autoregressive, predicts next token, used for generation
- Encoder-decoder (T5, BART) — Encoder processes input, decoder generates output, used for translation/summarisation
- Encoder-only (BERT) — Bidirectional, used for classification and embeddings
Attention Mechanism — The Maths
Scaled Dot-Product Attention
Each token produces three vectors from its embedding via learned weight matrices , , :
- Query (Q) — “What am I looking for?”
- Key (K) — “What do I contain?”
- Value (V) — “What information do I provide?”
The attention function computes:
Where is the dimension of the key vectors. The dot product produces a matrix of raw attention scores between every pair of tokens.
Why scale by ? — As grows, dot products grow in magnitude, pushing softmax into regions with extremely small gradients (saturation). Dividing by keeps the variance of dot products roughly at 1 regardless of dimension, ensuring softmax produces useful gradients during training.
Multi-Head Attention (MHA)
Rather than computing a single attention function, MHA runs attention heads in parallel, each with its own learned projections:
where
Each head learns to attend to different aspects — one head might track syntactic relationships, another semantic similarity, another positional patterns.
Efficiency Variants
Standard MHA requires storing separate K and V for every head, which dominates memory during inference (the KV cache). Modern models use variants that reduce this cost:
- Multi-Query Attention (MQA) — All query heads share a single K and V head. Drastically reduces KV cache size (~8x for 8-head models) at a small quality cost. Used in PaLM, Falcon.
- Grouped-Query Attention (GQA) — Compromise between MHA and MQA. Query heads are divided into groups, each sharing one K/V head. Achieves most of MQA’s memory savings with near-MHA quality. Now the default in Llama 3, Mistral, and most modern open-weight models.
- Multi-Head Latent Attention (MLA) — DeepSeek’s approach. Compresses K and V into a low-rank latent representation shared across heads, then projects back to per-head keys and values. Achieves aggressive KV cache compression while retaining full MHA expressiveness. Used in DeepSeek-V2 and V3.
Cross-Attention
Used in encoder-decoder models. The decoder’s queries attend to the encoder’s keys and values, allowing the decoder to “look at” the input sequence while generating output. Q comes from the decoder, K and V come from the encoder.
Forgetting Attention (FoX)
The Forgetting Transformer (Lin et al., ICLR 2025) adds a learned forget gate that down-weights attention scores in a data-dependent way. Outperforms standard transformers on long-context language modelling and length extrapolation, requires no positional embeddings, and is compatible with FlashAttention.
Positional Encoding
Transformers process all tokens in parallel — unlike RNNs, they have no inherent notion of sequence order. Positional encodings inject this information.
- Sinusoidal (original transformer) — Fixed sine and cosine functions of different frequencies. Position at dimension : . Allows the model to learn relative positions through linear combinations.
- Learned embeddings — Trainable position vectors added to token embeddings. Simple but limited to the maximum length seen during training.
- RoPE (Rotary Position Embedding) — Rotates Q and K vectors by an angle proportional to their position. The dot product between Q and K then naturally depends on relative position. Preserves properties of the original attention mechanism. Used by Llama, Mistral, Qwen, and most modern models.
- ALiBi (Attention with Linear Biases) — Doesn’t modify embeddings at all. Instead, adds a position-dependent linear penalty to attention scores: tokens further apart get a larger negative bias. Simple and effective for length generalisation beyond training length.
- PaTH Attention (MIT, 2025) — Makes positional information adaptive and context-aware rather than static, modelling how meaning changes along the path between tokens. Outperforms RoPE on reasoning benchmarks.
Why it matters for length generalisation: Models trained at one context length often degrade at longer lengths. RoPE and ALiBi generalise better than learned embeddings because they encode relative rather than absolute positions.
Mixture of Experts (MoE)
MoE replaces the standard feed-forward network (FFN) in each transformer layer with multiple parallel “expert” FFNs, of which only a few are activated per token. This decouples total parameter count from compute cost.
Architecture
Token → Router/Gate → Top-k experts selected → Expert FFNs → Weighted sum of outputs
The router (or gating network) is a small learned linear layer that produces a probability distribution over experts. For each token, it selects the top- experts and weights their outputs by the gating probabilities.
Load Balancing
A naive router tends to collapse — sending all tokens to the same few experts while others go unused. Solutions:
- Auxiliary load-balancing loss — An extra loss term penalising uneven expert utilisation. Added to the main training loss.
- DeepSeek’s auxiliary-loss-free balancing — Uses learned bias terms on expert selection, manually adjusted when experts become overloaded. Messier but better performing.
- Expert capacity limits — Cap the number of tokens each expert can process. Overflow tokens are dropped or routed to a fallback.
”Super Experts”
Research has identified a small subset of experts (“super experts”) that are disproportionately critical. Pruning them causes catastrophic performance collapse, while other experts can be removed with minimal impact. These super experts are the primary source of outlier activations in transformers.
Real Examples
| Model | Total Params | Active Params | Experts | Top-k |
|---|---|---|---|---|
| Mixtral 8x7B | 47B | 13B | 8 | 2 |
| DeepSeek-V3 | 671B | 37B | 256 | 8 |
| Llama 4 Scout | 109B | 17B | 16 | 1 |
| Llama 4 Maverick | 400B | 17B | 128 | 1 |
| Qwen3-235B | 235B | 22B | 128 | 8 |
| Kimi K2 | 1T | 32B | — | — |
Trade-offs
- More total parameters for the same active compute → higher capacity
- ~70% compute reduction versus dense models of similar capability
- Harder to train — load balancing, expert collapse, communication overhead
- Harder to serve — expert parallelism across GPUs, irregular memory access patterns
- Now dominant: over 60% of frontier model releases in 2025 use MoE
Alternative Architectures
While Transformers dominate, alternative architectures are gaining traction to address the quadratic scaling of attention:
- State Space Models (SSMs) (e.g., Mamba, Jamba) — Linear time complexity, constant memory inference, better handling of infinite context.
- Hybrid Models (e.g., Jamba, Qwen3-Next, Nemotron 3) — Interleave transformer layers with SSM or linear attention layers. State-space layers handle long-range structure efficiently; transformer layers handle precise recall. Jamba achieved a 256K context window on a single GPU.
- Diffusion LLMs (LLaDA, TiDAR) — Treat text generation as a denoising process, predicting multiple tokens per step in parallel rather than left-to-right.
- Linear Attention Hybrids (Gated DeltaNets) — Replace softmax attention with gated linear recurrences, scaling linearly with sequence length. Used in Qwen3-Next and Kimi Linear.
Despite these innovations, no sub-quadratic or hybrid model currently scores in the top 10 on LMSys — the “Transformer++” remains the default when compute is not a constraint.
Tokenisation
Text must be converted to tokens before processing. Common approaches:
- Byte-Pair Encoding (BPE) — Iteratively merges frequent character pairs. Start with individual characters, count adjacent pair frequencies, merge the most common pair into a new token, repeat until vocabulary size is reached. The merge order becomes the tokeniser’s ruleset.
- WordPiece — Similar to BPE, used by BERT
- SentencePiece — Language-agnostic, treats text as raw bytes
- Tiktoken — OpenAI’s fast BPE implementation
Vocabulary size trade-offs — Typical range: 32K–128K tokens. Larger vocabularies mean more common words are single tokens (efficient context usage, better multilingual coverage) but increase the embedding table size and may cause rare tokens to be poorly learned. Smaller vocabularies are more token-efficient for common patterns but waste context on character-level tokenisation of uncommon words.
Tokenisation affects:
- Context window utilisation (more efficient = more content)
- Multilingual performance (some tokenisers favour English)
- Cost (APIs charge per token)
Context Windows
The context window is the maximum number of tokens a model can process at once.
- GPT-5.2: 400K tokens
- Claude Opus/Sonnet 4.6: 1M tokens
- Gemini 3 Pro: 1M tokens
- Llama 4 Scout: 10M tokens (largest available)
Longer contexts enable:
- Processing entire codebases or documents
- Extended conversations with memory
- Complex reasoning chains
Trade-offs:
- Attention is O(n²) — longer contexts are computationally expensive
- “Lost in the middle” — models may struggle with information in the middle of long contexts
- Cost scales with context length
Training
See Fine-tuning for adapting models to specific tasks after pre-training.
Data
Training data quality matters as much as scale. The pipeline:
- Collection — Web crawls (Common Crawl), books, code (GitHub, Stack), academic papers, Wikipedia, curated datasets
- Cleaning — Remove boilerplate, HTML, duplicates, low-quality pages, toxic content
- Deduplication — MinHash or exact dedup at document and paragraph level. Duplicates waste compute and can cause memorisation
- Quality filtering — Classifier-based scoring, perplexity filtering (remove text a language model finds too easy or too hard), heuristic rules
- Data mix — The ratio of web/books/code/academic/multilingual data. Heavily impacts downstream capabilities. Meta’s AutoMixer (2025) showed that checkpoint artefacts during training encode information about optimal data mixtures.
A 1B model trained on curated data can match a 3B model trained on raw web scrapes.
Pre-training
Self-supervised learning on massive text corpora. The model learns to predict the next token, acquiring grammar, world knowledge, and reasoning patterns.
Objective: Minimise cross-entropy loss between predicted token distribution and the actual next token:
Perplexity is the standard evaluation metric: . Lower is better. Intuition: “how many tokens is the model equally uncertain between?” A perplexity of 10 means the model is, on average, as uncertain as if choosing uniformly among 10 tokens.
Post-training
Refining the base model for specific behaviours:
- Supervised Fine-Tuning (SFT) — Training on curated instruction-response pairs
- RLHF/DPO/GRPO — Alignment via reinforcement learning (see the RL for LLMs section below)
- RLVR (RL from Verifiable Rewards) — Training against automatically verifiable rewards (maths, code). The dominant new training stage in 2025 — models spontaneously develop reasoning strategies. Unlike RLHF, the reward is objective and non-gameable, allowing much longer optimisation runs. Most capability progress in 2025 came from longer RL runs rather than larger models.
See Fine-tuning for practical details on SFT, LoRA, QLoRA, and alignment techniques.
Distributed Training
Training large models requires splitting work across many GPUs. Key strategies:
- Data parallelism — Each GPU holds a full model copy, processes different data batches, gradients are averaged. Simple but memory-limited by model size.
- Tensor parallelism (TP) — Splits individual layers across GPUs on the same node. Each GPU computes part of a matrix multiplication. Low latency but requires fast interconnect (NVLink).
- Pipeline parallelism (PP) — Splits layers across GPUs. GPU 1 runs layers 1-20, GPU 2 runs 21-40, etc. Creates pipeline bubbles (idle time) but lower communication overhead.
- Expert parallelism (EP) — For MoE models, distributes experts across GPUs. Tokens are routed to the GPU holding their selected expert.
- ZeRO (Zero Redundancy Optimiser) — Partitions optimiser states (Stage 1), gradients (Stage 2), and parameters (Stage 3) across GPUs, eliminating redundant copies. Developed by DeepSpeed.
- FSDP (Fully Sharded Data Parallelism) — PyTorch’s implementation of ZeRO Stage 3. Shards model parameters, gradients, and optimiser states across all GPUs, only gathering them when needed for computation.
In practice, frontier models combine all of these. DeepSeek-V3 used TP within nodes, PP across nodes, and EP for expert distribution.
Compute Requirements
FLOPs estimation — A common approximation for a forward pass through a dense transformer:
where is parameter count and is number of training tokens. The backward pass roughly doubles this, so total training FLOPs .
| Model | Parameters | Tokens | Estimated FLOPs | GPU Hours (approx) |
|---|---|---|---|---|
| Llama 3 8B | 8B | 15T | ~ | ~30K H100 hours |
| Llama 3 70B | 70B | 15T | ~ | ~250K H100 hours |
| Llama 3 405B | 405B | 15T | ~ | ~1.5M H100 hours |
Frontier model training runs cost 500M+ in compute.
Training Stability
- Learning rate scheduling — Warmup phase (linear ramp from near-zero) followed by cosine decay. Warmup prevents early instability from large gradients; cosine decay helps convergence.
- Gradient clipping — Cap gradient norms to prevent explosion. Typical max norm: 1.0.
- Loss spikes — Sudden increases in loss during training. Can be caused by bad data batches, numerical instability, or learning rate issues. Often require rolling back to a checkpoint and skipping the problematic data.
- Mixed precision (bf16/fp16) — Train with lower precision to halve memory usage and increase throughput. bf16 (bfloat16) is preferred over fp16 because it has the same exponent range as fp32, avoiding overflow/underflow. Loss scaling is needed for fp16 but not bf16.
Scaling Laws
Power-law relationships govern how model performance (loss) scales with compute, data, and parameters.
Kaplan et al. (2020) — Showed that loss decreases as a power law of compute, data size, and parameter count. Concluded that scaling parameters matters most — train large models on “enough” data.
Chinchilla (Hoffmann et al., 2022) — Challenged Kaplan by showing data and parameters should scale equally for compute-optimal training. For a given compute budget, the optimal model is smaller but trained on more data than Kaplan suggested. Implication: many existing models (GPT-3, PaLM) were undertrained on data.
Chinchilla’s impact: Led directly to Llama’s approach — smaller models (7B, 13B) trained on far more data (1-2T tokens) than prior practice, achieving disproportionately strong results.
Inference scaling laws — Test-time compute scaling (o1, R1). Performance improves logarithmically with the number of reasoning tokens generated before answering. This opened a second axis of scaling beyond training compute.
Densing Law (Xiao et al., 2025) — Capability density (capability per parameter) doubles roughly every 3.5 months. Equivalent performance can be achieved with exponentially fewer parameters over time, driven by better data, architectures (MoE), and training techniques.
Data exhaustion — Current consumption patterns suggest exhaustion of public text data by 2026–2028. Synthetic data generation is the primary proposed solution.
Inference
Temperature
Controls randomness in token selection:
- 0 — Deterministic, always picks highest probability token (greedy)
- 0.1–0.7 — Balanced creativity and coherence
- 1.0+ — More random, creative, potentially incoherent
Top-p (Nucleus Sampling)
Only considers tokens whose cumulative probability exceeds p. Top-p of 0.9 means only the most likely tokens summing to 90% probability are considered.
Top-k
Only considers the k most likely tokens at each step.
Key Optimisations
See Model Serving for deployment details, quantisation formats, and inference server comparison.
KV Cache — In autoregressive generation, each new token’s attention computation needs the keys and values of all previous tokens. Rather than recomputing them, we cache them. Memory cost: . For a 70B model at fp16 with 32K context, the KV cache alone can exceed 10GB — often the primary memory bottleneck during inference.
Flash Attention — Standard attention materialises the full attention matrix in HBM (GPU main memory), which is slow and O(n²) in memory. FlashAttention (Dao et al., 2022) tiles the computation into blocks that fit in SRAM (fast on-chip memory), computing attention without ever materialising the full matrix. Reduces memory from O(n²) to O(n) while being 2-4x faster. FlashAttention-2 improved thread work partitioning. FlashAttention-3 (NeurIPS 2024) targets Hopper GPUs (H100) with asynchronous execution and FP8 support, achieving 85% of theoretical max FLOPs.
PagedAttention — vLLM’s innovation. Manages KV cache like virtual memory pages instead of contiguous blocks. Eliminates memory fragmentation when serving many concurrent requests with different sequence lengths. Enables near-zero waste of GPU memory for KV cache.
Speculative Decoding — A small, fast “draft” model generates several candidate tokens. The main model then verifies all of them in a single forward pass (which is parallelisable, unlike sequential generation). Accepted tokens are kept; on rejection, fall back to the main model’s token. Delivers up to 2-3x speedup with identical output quality. Intel & Weizmann (ICML 2025) showed any draft model works regardless of vocabulary differences.
Continuous Batching — Instead of waiting for all requests in a batch to finish, new requests are dynamically inserted as others complete. Maximises GPU utilisation and throughput.
Prefix Caching — Cache KV states for common prefixes (system prompts) across requests. SGLang’s RadixAttention stores these in a radix tree, enabling up to 5x speedup for workloads with shared system prompts.
Reinforcement Learning for LLMs
Pre-training learns to predict text. Post-training uses RL to make models follow instructions, be helpful, and be safe. See Fine-tuning for practical implementation details.
Why RL?
SFT teaches format and style by imitating demonstrations, but it can’t easily optimise for outcomes. RL optimises a policy (the model) against a reward signal, enabling the model to discover strategies not present in the training data — including novel reasoning chains and self-correction behaviours.
RLHF Pipeline
- SFT — Fine-tune the base model on curated instruction-response pairs
- Reward model — Train a separate model on human pairwise comparisons (“output A is better than output B”). The reward model scores arbitrary outputs by how much a human would prefer them.
- RL optimisation — Use PPO to optimise the SFT model against the reward model, with a KL divergence penalty to prevent the model from drifting too far from the SFT policy
PPO (Proximal Policy Optimisation)
The standard RL algorithm for RLHF. Clips the policy update ratio to a small range (typically where ), preventing catastrophically large updates that could destabilise training. Uses a value model (critic) to estimate expected future reward.
DPO (Direct Preference Optimisation)
Bypasses the reward model entirely. Directly optimises the policy from preference pairs by reparameterising the RLHF objective into a classification loss over preferred vs dispreferred outputs. Simpler, cheaper (no reward model or RL loop), increasingly popular.
GRPO (Group Relative Policy Optimisation)
DeepSeek’s approach. Samples outputs per prompt, scores them with a reward function, then computes advantages relative to the group mean. No separate critic/value model needed — the group itself serves as the baseline, heavily reducing VRAM. Key enabler of DeepSeek-R1’s reasoning capabilities.
Beyond GRPO (2025+)
- DAPO (ByteDance) — Decoupled clipping and dynamic sampling. Relaxes GRPO’s upper clip bound to maintain exploration, resamples extreme-reward prompts, and penalises verbose wrong answers.
- Dr. GRPO — Removes response-length normalisation in advantage computation, preventing the model from being rewarded for long wrong answers.
- REINFORCE++ — Efficient RLHF with improved robustness to prompt and reward model variations.
- RLVR — RL from Verifiable Rewards. Uses programmatically verifiable rewards (maths proofs, code tests) rather than a learned reward model. Non-gameable, enabling much longer training runs. The dominant post-training technique of 2025.
Key Challenges
- Reward hacking — Models learn to exploit the reward model rather than genuinely improving. e.g., producing verbose, confident-sounding but incorrect outputs that score highly.
- Constitutional AI — Anthropic’s approach: the model self-critiques against a set of principles, then revises its output. Reduces reliance on human labellers while maintaining alignment.
- Process vs outcome reward — Outcome reward models judge only the final answer. Process reward models (PRMs) score each reasoning step, providing denser training signal. PRMs significantly improve maths and reasoning tasks but are expensive to annotate.
The Mathematics
Key mathematical foundations. See Machine Learning for gradient descent and backpropagation fundamentals.
Softmax
Converts a vector of logits into a probability distribution:
Properties: outputs are positive and sum to 1. Amplifies differences — larger logits get disproportionately more probability mass. Temperature controls this sharpness: .
Cross-Entropy Loss
Measures divergence between the model’s predicted distribution and the true distribution (a one-hot vector in next-token prediction):
For next-token prediction, this simplifies to where is the correct token — the negative log probability of the right answer.
Perplexity
The exponential of the average cross-entropy loss. Intuition: “how surprised is the model?” A perplexity of 1 means perfect prediction. Lower is better.
Layer Normalisation
Normalises activations across the feature dimension (not the batch dimension, unlike batch normalisation):
where and are the mean and variance across features, and , are learned scale and shift parameters. Critical for training stability in deep transformers. Modern models use RMSNorm (root mean square normalisation) which drops the mean centering for efficiency.
Residual Connections
Skip connections around each sub-layer (attention and FFN). Without them, gradients vanish in deep networks. They also enable each layer to learn a “delta” — a modification to the representation — rather than a complete transformation, making optimisation easier.
Capabilities & Limitations
Emergent Capabilities
- In-context learning — Learning from examples in the prompt
- Test-Time Compute (System 2 Reasoning) — Allocating extra inference time for deep reasoning, search, and validation (e.g., OpenAI o3, DeepSeek R1). This unlocks “inference scaling laws” where performance scales logarithmically with tokens generated before answering.
- Chain-of-thought reasoning — Step-by-step problem solving
- Code generation and execution
- Multi-turn dialogue
- Tool use and function calling
Limitations
- Hallucinations — Generating plausible but false information
- Knowledge cutoff — Training data has a temporal boundary
- Context limitations — Can’t process arbitrarily long inputs
- Reasoning failures — Struggles with novel logical problems
- Bias — Reflects biases in training data
See AI Safety for mitigations and alignment techniques.
See Also
- Foundation Models — Current commercial and open-source models
- Model Serving — Inference, quantisation, deployment
- Machine Learning — Neural network fundamentals, backpropagation, gradient descent
- Fine-tuning — SFT, LoRA, QLoRA, alignment techniques
Resources
- The Illustrated Transformer
- Let’s build GPT — Andrej Karpathy’s walkthrough
- LLM Visualization — Interactive 3D visualisation
- Anthropic’s Core Views on AI Safety
Seminal Papers
- Attention Is All You Need (2017) — The original transformer
- BERT (2018) — Bidirectional pre-training
- GPT-2 (2019) — Showed scaling emergent capabilities
- Scaling Laws for Neural Language Models (2020) — Kaplan et al.
- Training language models to follow instructions with human feedback (InstructGPT) (2022) — RLHF for LLMs
- Training Compute-Optimal LLMs (Chinchilla) (2022) — Hoffmann et al.
- FlashAttention (2022) — Dao et al., IO-aware exact attention
- LLaMA (2023) — Open-weight model that changed the field
- Direct Preference Optimization (DPO) (2023) — RL without reward models
- Mixture of Experts Meets Instruction Tuning (2023)
- DeepSeekMath: GRPO (2024) — Group Relative Policy Optimisation
- FlashAttention-3 (2024) — Asynchrony and FP8 on Hopper GPUs
- DeepSeek-V3 Technical Report (2024) — MLA, MoE at scale
- Forgetting Transformer (FoX) (2025) — Softmax attention with a forget gate