LLMs

Reference notes.

Large Language Models (LLMs) are neural networks trained on vast text corpora to predict the next token in a sequence. Despite this simple objective, they exhibit emergent capabilities including reasoning, code generation, and general knowledge.

Architecture

Transformers

The transformer architecture (Vaswani et al., 2017) revolutionised NLP by replacing recurrence with self-attention, enabling parallel processing and better long-range dependencies.

Key components:

Self-attention — Allows each token to attend to all other tokens, computing relevance weights
Multi-head attention — Multiple attention heads capture different relationship types
Positional encoding — Injects sequence order information (sinusoidal or learned)
Feed-forward layers — Process each position independently after attention
Layer normalisation — Stabilises training
Residual connections — Skip connections around each sub-layer, enabling gradient flow through deep networks

Decoder-only vs Encoder-Decoder

Decoder-only (GPT, Llama, Claude) — Autoregressive, predicts next token, used for generation
Encoder-decoder (T5, BART) — Encoder processes input, decoder generates output, used for translation/summarisation
Encoder-only (BERT) — Bidirectional, used for classification and embeddings

Attention Mechanism — The Maths

Scaled Dot-Product Attention

Each token produces three vectors from its embedding via learned weight matrices $W^{Q}$ , $W^{K}$ , $W^{V}$ :

Query (Q) — “What am I looking for?”
Key (K) — “What do I contain?”
Value (V) — “What information do I provide?”

The attention function computes:

$Attention (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V$

Where $d_{k}$ is the dimension of the key vectors. The dot product $Q K^{⊤}$ produces a matrix of raw attention scores between every pair of tokens.

Why scale by $d_{k}$ ? — As $d_{k}$ grows, dot products grow in magnitude, pushing softmax into regions with extremely small gradients (saturation). Dividing by $d_{k}$ keeps the variance of dot products roughly at 1 regardless of dimension, ensuring softmax produces useful gradients during training.

Multi-Head Attention (MHA)

Rather than computing a single attention function, MHA runs $h$ attention heads in parallel, each with its own learned projections:

$MultiHead (Q, K, V) = Concat (head_{1}, ..., head_{h}) \cdot W^{O}$

where $head_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})$

Each head learns to attend to different aspects — one head might track syntactic relationships, another semantic similarity, another positional patterns.

Efficiency Variants

Standard MHA requires storing separate K and V for every head, which dominates memory during inference (the KV cache). Modern models use variants that reduce this cost:

Multi-Query Attention (MQA) — All query heads share a single K and V head. Drastically reduces KV cache size (~8x for 8-head models) at a small quality cost. Used in PaLM, Falcon.
Grouped-Query Attention (GQA) — Compromise between MHA and MQA. Query heads are divided into groups, each sharing one K/V head. Achieves most of MQA’s memory savings with near-MHA quality. Now the default in Llama 3, Mistral, and most modern open-weight models.
Multi-Head Latent Attention (MLA) — DeepSeek’s approach. Compresses K and V into a low-rank latent representation shared across heads, then projects back to per-head keys and values. Achieves aggressive KV cache compression while retaining full MHA expressiveness. Used in DeepSeek-V2 and V3.

Cross-Attention

Used in encoder-decoder models. The decoder’s queries attend to the encoder’s keys and values, allowing the decoder to “look at” the input sequence while generating output. Q comes from the decoder, K and V come from the encoder.

Forgetting Attention (FoX)

The Forgetting Transformer (Lin et al., ICLR 2025) adds a learned forget gate that down-weights attention scores in a data-dependent way. Outperforms standard transformers on long-context language modelling and length extrapolation, requires no positional embeddings, and is compatible with FlashAttention.

Positional Encoding

Transformers process all tokens in parallel — unlike RNNs, they have no inherent notion of sequence order. Positional encodings inject this information.

Sinusoidal (original transformer) — Fixed sine and cosine functions of different frequencies. Position $p os$ at dimension $i$ : $P E_{(p os, 2 i)} = sin (p os /1000 0^{2 i / d})$ . Allows the model to learn relative positions through linear combinations.
Learned embeddings — Trainable position vectors added to token embeddings. Simple but limited to the maximum length seen during training.
RoPE (Rotary Position Embedding) — Rotates Q and K vectors by an angle proportional to their position. The dot product between Q and K then naturally depends on relative position. Preserves properties of the original attention mechanism. Used by Llama, Mistral, Qwen, and most modern models.
ALiBi (Attention with Linear Biases) — Doesn’t modify embeddings at all. Instead, adds a position-dependent linear penalty to attention scores: tokens further apart get a larger negative bias. Simple and effective for length generalisation beyond training length.
PaTH Attention (MIT, 2025) — Makes positional information adaptive and context-aware rather than static, modelling how meaning changes along the path between tokens. Outperforms RoPE on reasoning benchmarks.

Why it matters for length generalisation: Models trained at one context length often degrade at longer lengths. RoPE and ALiBi generalise better than learned embeddings because they encode relative rather than absolute positions.

Mixture of Experts (MoE)

MoE replaces the standard feed-forward network (FFN) in each transformer layer with multiple parallel “expert” FFNs, of which only a few are activated per token. This decouples total parameter count from compute cost.

Architecture

Token → Router/Gate → Top-k experts selected → Expert FFNs → Weighted sum of outputs

The router (or gating network) is a small learned linear layer that produces a probability distribution over experts. For each token, it selects the top- $k$ experts and weights their outputs by the gating probabilities.

Load Balancing

A naive router tends to collapse — sending all tokens to the same few experts while others go unused. Solutions:

Auxiliary load-balancing loss — An extra loss term penalising uneven expert utilisation. Added to the main training loss.
DeepSeek’s auxiliary-loss-free balancing — Uses learned bias terms on expert selection, manually adjusted when experts become overloaded. Messier but better performing.
Expert capacity limits — Cap the number of tokens each expert can process. Overflow tokens are dropped or routed to a fallback.

”Super Experts”

Research has identified a small subset of experts (“super experts”) that are disproportionately critical. Pruning them causes catastrophic performance collapse, while other experts can be removed with minimal impact. These super experts are the primary source of outlier activations in transformers.

Real Examples

Model	Total Params	Active Params	Experts	Top-k
Mixtral 8x7B	47B	13B	8	2
DeepSeek-V3	671B	37B	256	8
Llama 4 Scout	109B	17B	16	1
Llama 4 Maverick	400B	17B	128	1
Qwen3-235B	235B	22B	128	8
Kimi K2	1T	32B	—	—

Trade-offs

More total parameters for the same active compute → higher capacity
~70% compute reduction versus dense models of similar capability
Harder to train — load balancing, expert collapse, communication overhead
Harder to serve — expert parallelism across GPUs, irregular memory access patterns
Now dominant: over 60% of frontier model releases in 2025 use MoE

Alternative Architectures

While Transformers dominate, alternative architectures are gaining traction to address the quadratic scaling of attention:

State Space Models (SSMs) (e.g., Mamba, Jamba) — Linear time complexity, constant memory inference, better handling of infinite context.
Hybrid Models (e.g., Jamba, Qwen3-Next, Nemotron 3) — Interleave transformer layers with SSM or linear attention layers. State-space layers handle long-range structure efficiently; transformer layers handle precise recall. Jamba achieved a 256K context window on a single GPU.
Diffusion LLMs (LLaDA, TiDAR) — Treat text generation as a denoising process, predicting multiple tokens per step in parallel rather than left-to-right.
Linear Attention Hybrids (Gated DeltaNets) — Replace softmax attention with gated linear recurrences, scaling linearly with sequence length. Used in Qwen3-Next and Kimi Linear.

Despite these innovations, no sub-quadratic or hybrid model currently scores in the top 10 on LMSys — the “Transformer++” remains the default when compute is not a constraint.

Tokenisation

Text must be converted to tokens before processing. Common approaches:

Byte-Pair Encoding (BPE) — Iteratively merges frequent character pairs. Start with individual characters, count adjacent pair frequencies, merge the most common pair into a new token, repeat until vocabulary size is reached. The merge order becomes the tokeniser’s ruleset.
WordPiece — Similar to BPE, used by BERT
SentencePiece — Language-agnostic, treats text as raw bytes
Tiktoken — OpenAI’s fast BPE implementation

Vocabulary size trade-offs — Typical range: 32K–128K tokens. Larger vocabularies mean more common words are single tokens (efficient context usage, better multilingual coverage) but increase the embedding table size and may cause rare tokens to be poorly learned. Smaller vocabularies are more token-efficient for common patterns but waste context on character-level tokenisation of uncommon words.

Tokenisation affects:

Context window utilisation (more efficient = more content)
Multilingual performance (some tokenisers favour English)
Cost (APIs charge per token)

Context Windows

The context window is the maximum number of tokens a model can process at once.

GPT-5.2: 400K tokens
Claude Opus/Sonnet 4.6: 1M tokens
Gemini 3 Pro: 1M tokens
Llama 4 Scout: 10M tokens (largest available)

Longer contexts enable:

Processing entire codebases or documents
Extended conversations with memory
Complex reasoning chains

Trade-offs:

Attention is O(n²) — longer contexts are computationally expensive
“Lost in the middle” — models may struggle with information in the middle of long contexts
Cost scales with context length

Training

See Fine-tuning for adapting models to specific tasks after pre-training.

Data

Training data quality matters as much as scale. The pipeline:

Collection — Web crawls (Common Crawl), books, code (GitHub, Stack), academic papers, Wikipedia, curated datasets
Cleaning — Remove boilerplate, HTML, duplicates, low-quality pages, toxic content
Deduplication — MinHash or exact dedup at document and paragraph level. Duplicates waste compute and can cause memorisation
Quality filtering — Classifier-based scoring, perplexity filtering (remove text a language model finds too easy or too hard), heuristic rules
Data mix — The ratio of web/books/code/academic/multilingual data. Heavily impacts downstream capabilities. Meta’s AutoMixer (2025) showed that checkpoint artefacts during training encode information about optimal data mixtures.

A 1B model trained on curated data can match a 3B model trained on raw web scrapes.

Pre-training

Self-supervised learning on massive text corpora. The model learns to predict the next token, acquiring grammar, world knowledge, and reasoning patterns.

Objective: Minimise cross-entropy loss between predicted token distribution and the actual next token:

$L = - \sum_{t = 1}^{T} lo g P (x_{t} ∣ x_{< t})$

Perplexity is the standard evaluation metric: $PPL = e^{L}$ . Lower is better. Intuition: “how many tokens is the model equally uncertain between?” A perplexity of 10 means the model is, on average, as uncertain as if choosing uniformly among 10 tokens.

Post-training

Refining the base model for specific behaviours:

Supervised Fine-Tuning (SFT) — Training on curated instruction-response pairs
RLHF/DPO/GRPO — Alignment via reinforcement learning (see the RL for LLMs section below)
RLVR (RL from Verifiable Rewards) — Training against automatically verifiable rewards (maths, code). The dominant new training stage in 2025 — models spontaneously develop reasoning strategies. Unlike RLHF, the reward is objective and non-gameable, allowing much longer optimisation runs. Most capability progress in 2025 came from longer RL runs rather than larger models.

See Fine-tuning for practical details on SFT, LoRA, QLoRA, and alignment techniques.

Distributed Training

Training large models requires splitting work across many GPUs. Key strategies:

Data parallelism — Each GPU holds a full model copy, processes different data batches, gradients are averaged. Simple but memory-limited by model size.
Tensor parallelism (TP) — Splits individual layers across GPUs on the same node. Each GPU computes part of a matrix multiplication. Low latency but requires fast interconnect (NVLink).
Pipeline parallelism (PP) — Splits layers across GPUs. GPU 1 runs layers 1-20, GPU 2 runs 21-40, etc. Creates pipeline bubbles (idle time) but lower communication overhead.
Expert parallelism (EP) — For MoE models, distributes experts across GPUs. Tokens are routed to the GPU holding their selected expert.
ZeRO (Zero Redundancy Optimiser) — Partitions optimiser states (Stage 1), gradients (Stage 2), and parameters (Stage 3) across GPUs, eliminating redundant copies. Developed by DeepSpeed.
FSDP (Fully Sharded Data Parallelism) — PyTorch’s implementation of ZeRO Stage 3. Shards model parameters, gradients, and optimiser states across all GPUs, only gathering them when needed for computation.

In practice, frontier models combine all of these. DeepSeek-V3 used TP within nodes, PP across nodes, and EP for expert distribution.

Compute Requirements

FLOPs estimation — A common approximation for a forward pass through a dense transformer:

$FLOPs \approx 2 \times N \times T$

where $N$ is parameter count and $T$ is number of training tokens. The backward pass roughly doubles this, so total training FLOPs $\approx 6 NT$ .

Model	Parameters	Tokens	Estimated FLOPs	GPU Hours (approx)
Llama 3 8B	8B	15T	~ $7 \times 1 0^{23}$	~30K H100 hours
Llama 3 70B	70B	15T	~ $6 \times 1 0^{24}$	~250K H100 hours
Llama 3 405B	405B	15T	~ $4 \times 1 0^{25}$	~1.5M H100 hours

Frontier model training runs cost $50 M -$ 500M+ in compute.

Training Stability

Learning rate scheduling — Warmup phase (linear ramp from near-zero) followed by cosine decay. Warmup prevents early instability from large gradients; cosine decay helps convergence.
Gradient clipping — Cap gradient norms to prevent explosion. Typical max norm: 1.0.
Loss spikes — Sudden increases in loss during training. Can be caused by bad data batches, numerical instability, or learning rate issues. Often require rolling back to a checkpoint and skipping the problematic data.
Mixed precision (bf16/fp16) — Train with lower precision to halve memory usage and increase throughput. bf16 (bfloat16) is preferred over fp16 because it has the same exponent range as fp32, avoiding overflow/underflow. Loss scaling is needed for fp16 but not bf16.

Scaling Laws

Power-law relationships govern how model performance (loss) scales with compute, data, and parameters.

Kaplan et al. (2020) — Showed that loss decreases as a power law of compute, data size, and parameter count. Concluded that scaling parameters matters most — train large models on “enough” data.

Chinchilla (Hoffmann et al., 2022) — Challenged Kaplan by showing data and parameters should scale equally for compute-optimal training. For a given compute budget, the optimal model is smaller but trained on more data than Kaplan suggested. Implication: many existing models (GPT-3, PaLM) were undertrained on data.

Chinchilla’s impact: Led directly to Llama’s approach — smaller models (7B, 13B) trained on far more data (1-2T tokens) than prior practice, achieving disproportionately strong results.

Inference scaling laws — Test-time compute scaling (o1, R1). Performance improves logarithmically with the number of reasoning tokens generated before answering. This opened a second axis of scaling beyond training compute.

Densing Law (Xiao et al., 2025) — Capability density (capability per parameter) doubles roughly every 3.5 months. Equivalent performance can be achieved with exponentially fewer parameters over time, driven by better data, architectures (MoE), and training techniques.

Data exhaustion — Current consumption patterns suggest exhaustion of public text data by 2026–2028. Synthetic data generation is the primary proposed solution.

Inference

Temperature

Controls randomness in token selection:

0 — Deterministic, always picks highest probability token (greedy)
0.1–0.7 — Balanced creativity and coherence
1.0+ — More random, creative, potentially incoherent

Top-p (Nucleus Sampling)

Only considers tokens whose cumulative probability exceeds p. Top-p of 0.9 means only the most likely tokens summing to 90% probability are considered.

Top-k

Only considers the k most likely tokens at each step.

Key Optimisations

See Model Serving for deployment details, quantisation formats, and inference server comparison.

KV Cache — In autoregressive generation, each new token’s attention computation needs the keys and values of all previous tokens. Rather than recomputing them, we cache them. Memory cost: $2 \times n_{l a yers} \times n_{h e a d s} \times d_{h e a d} \times seq_len \times precision_bytes$ . For a 70B model at fp16 with 32K context, the KV cache alone can exceed 10GB — often the primary memory bottleneck during inference.

Flash Attention — Standard attention materialises the full $n \times n$ attention matrix in HBM (GPU main memory), which is slow and O(n²) in memory. FlashAttention (Dao et al., 2022) tiles the computation into blocks that fit in SRAM (fast on-chip memory), computing attention without ever materialising the full matrix. Reduces memory from O(n²) to O(n) while being 2-4x faster. FlashAttention-2 improved thread work partitioning. FlashAttention-3 (NeurIPS 2024) targets Hopper GPUs (H100) with asynchronous execution and FP8 support, achieving 85% of theoretical max FLOPs.

PagedAttention — vLLM’s innovation. Manages KV cache like virtual memory pages instead of contiguous blocks. Eliminates memory fragmentation when serving many concurrent requests with different sequence lengths. Enables near-zero waste of GPU memory for KV cache.

Speculative Decoding — A small, fast “draft” model generates several candidate tokens. The main model then verifies all of them in a single forward pass (which is parallelisable, unlike sequential generation). Accepted tokens are kept; on rejection, fall back to the main model’s token. Delivers up to 2-3x speedup with identical output quality. Intel & Weizmann (ICML 2025) showed any draft model works regardless of vocabulary differences.

Continuous Batching — Instead of waiting for all requests in a batch to finish, new requests are dynamically inserted as others complete. Maximises GPU utilisation and throughput.

Prefix Caching — Cache KV states for common prefixes (system prompts) across requests. SGLang’s RadixAttention stores these in a radix tree, enabling up to 5x speedup for workloads with shared system prompts.

Reinforcement Learning for LLMs

Pre-training learns to predict text. Post-training uses RL to make models follow instructions, be helpful, and be safe. See Fine-tuning for practical implementation details.

Why RL?

SFT teaches format and style by imitating demonstrations, but it can’t easily optimise for outcomes. RL optimises a policy (the model) against a reward signal, enabling the model to discover strategies not present in the training data — including novel reasoning chains and self-correction behaviours.

RLHF Pipeline

SFT — Fine-tune the base model on curated instruction-response pairs
Reward model — Train a separate model on human pairwise comparisons (“output A is better than output B”). The reward model scores arbitrary outputs by how much a human would prefer them.
RL optimisation — Use PPO to optimise the SFT model against the reward model, with a KL divergence penalty to prevent the model from drifting too far from the SFT policy

PPO (Proximal Policy Optimisation)

The standard RL algorithm for RLHF. Clips the policy update ratio to a small range (typically $[1 - ϵ, 1 + ϵ]$ where $ϵ = 0.2$ ), preventing catastrophically large updates that could destabilise training. Uses a value model (critic) to estimate expected future reward.

DPO (Direct Preference Optimisation)

Bypasses the reward model entirely. Directly optimises the policy from preference pairs by reparameterising the RLHF objective into a classification loss over preferred vs dispreferred outputs. Simpler, cheaper (no reward model or RL loop), increasingly popular.

GRPO (Group Relative Policy Optimisation)

DeepSeek’s approach. Samples $G$ outputs per prompt, scores them with a reward function, then computes advantages relative to the group mean. No separate critic/value model needed — the group itself serves as the baseline, heavily reducing VRAM. Key enabler of DeepSeek-R1’s reasoning capabilities.

Beyond GRPO (2025+)

DAPO (ByteDance) — Decoupled clipping and dynamic sampling. Relaxes GRPO’s upper clip bound to maintain exploration, resamples extreme-reward prompts, and penalises verbose wrong answers.
Dr. GRPO — Removes response-length normalisation in advantage computation, preventing the model from being rewarded for long wrong answers.
REINFORCE++ — Efficient RLHF with improved robustness to prompt and reward model variations.
RLVR — RL from Verifiable Rewards. Uses programmatically verifiable rewards (maths proofs, code tests) rather than a learned reward model. Non-gameable, enabling much longer training runs. The dominant post-training technique of 2025.

Key Challenges

Reward hacking — Models learn to exploit the reward model rather than genuinely improving. e.g., producing verbose, confident-sounding but incorrect outputs that score highly.
Constitutional AI — Anthropic’s approach: the model self-critiques against a set of principles, then revises its output. Reduces reliance on human labellers while maintaining alignment.
Process vs outcome reward — Outcome reward models judge only the final answer. Process reward models (PRMs) score each reasoning step, providing denser training signal. PRMs significantly improve maths and reasoning tasks but are expensive to annotate.

The Mathematics

Key mathematical foundations. See Machine Learning for gradient descent and backpropagation fundamentals.

Softmax

Converts a vector of logits $z$ into a probability distribution:

$softmax (z_{i}) = \frac{e ^{z_{i}}}{\sum _{j} e ^{z_{j}}}$

Properties: outputs are positive and sum to 1. Amplifies differences — larger logits get disproportionately more probability mass. Temperature $τ$ controls this sharpness: $softmax (z_{i} / τ)$ .

Cross-Entropy Loss

Measures divergence between the model’s predicted distribution $\overset{y}{^}$ and the true distribution $y$ (a one-hot vector in next-token prediction):

$L = - \sum_{i} y_{i} lo g (\overset{y}{^}_{i})$

For next-token prediction, this simplifies to $L = - lo g (\overset{y}{^}_{c})$ where $c$ is the correct token — the negative log probability of the right answer.

Perplexity

$PPL = e^{L}$

The exponential of the average cross-entropy loss. Intuition: “how surprised is the model?” A perplexity of 1 means perfect prediction. Lower is better.

Layer Normalisation

Normalises activations across the feature dimension (not the batch dimension, unlike batch normalisation):

$LayerNorm (x) = γ \cdot \frac{x - μ}{σ ^{2} + ϵ} + β$

where $μ$ and $σ^{2}$ are the mean and variance across features, and $γ$ , $β$ are learned scale and shift parameters. Critical for training stability in deep transformers. Modern models use RMSNorm (root mean square normalisation) which drops the mean centering for efficiency.

Residual Connections

$output = Layer (x) + x$

Skip connections around each sub-layer (attention and FFN). Without them, gradients vanish in deep networks. They also enable each layer to learn a “delta” — a modification to the representation — rather than a complete transformation, making optimisation easier.

Capabilities & Limitations

Emergent Capabilities

In-context learning — Learning from examples in the prompt
Test-Time Compute (System 2 Reasoning) — Allocating extra inference time for deep reasoning, search, and validation (e.g., OpenAI o3, DeepSeek R1). This unlocks “inference scaling laws” where performance scales logarithmically with tokens generated before answering.
Chain-of-thought reasoning — Step-by-step problem solving
Code generation and execution
Multi-turn dialogue
Tool use and function calling

Limitations

Hallucinations — Generating plausible but false information
Knowledge cutoff — Training data has a temporal boundary
Context limitations — Can’t process arbitrarily long inputs
Reasoning failures — Struggles with novel logical problems
Bias — Reflects biases in training data

See AI Safety for mitigations and alignment techniques.

Resources

The Illustrated Transformer
Let’s build GPT — Andrej Karpathy’s walkthrough
LLM Visualization — Interactive 3D visualisation
Anthropic’s Core Views on AI Safety

Seminal Papers

Attention Is All You Need (2017) — The original transformer
BERT (2018) — Bidirectional pre-training
GPT-2 (2019) — Showed scaling emergent capabilities
Scaling Laws for Neural Language Models (2020) — Kaplan et al.
Training language models to follow instructions with human feedback (InstructGPT) (2022) — RLHF for LLMs
Training Compute-Optimal LLMs (Chinchilla) (2022) — Hoffmann et al.
FlashAttention (2022) — Dao et al., IO-aware exact attention
LLaMA (2023) — Open-weight model that changed the field
Direct Preference Optimization (DPO) (2023) — RL without reward models
Mixture of Experts Meets Instruction Tuning (2023)
DeepSeekMath: GRPO (2024) — Group Relative Policy Optimisation
FlashAttention-3 (2024) — Asynchrony and FP8 on Hopper GPUs
DeepSeek-V3 Technical Report (2024) — MLA, MoE at scale
Forgetting Transformer (FoX) (2025) — Softmax attention with a forget gate

Rai Notes

Explorer

LLMs

Architecture

Transformers

Decoder-only vs Encoder-Decoder

Attention Mechanism — The Maths

Scaled Dot-Product Attention

Multi-Head Attention (MHA)

Efficiency Variants

Cross-Attention

Forgetting Attention (FoX)

Positional Encoding

Mixture of Experts (MoE)

Architecture

Load Balancing

”Super Experts”

Real Examples

Trade-offs

Alternative Architectures

Tokenisation

Context Windows

Training

Data

Pre-training

Post-training

Distributed Training

Compute Requirements

Training Stability

Scaling Laws

Inference

Temperature

Top-p (Nucleus Sampling)

Top-k

Key Optimisations

Reinforcement Learning for LLMs

Why RL?

RLHF Pipeline

PPO (Proximal Policy Optimisation)

DPO (Direct Preference Optimisation)

GRPO (Group Relative Policy Optimisation)

Beyond GRPO (2025+)

Key Challenges

The Mathematics

Softmax

Cross-Entropy Loss

Perplexity

Layer Normalisation

Residual Connections

Capabilities & Limitations

Emergent Capabilities

Limitations

See Also

Resources

Seminal Papers

Graph View

Table of Contents

Backlinks