Large Language Models (LLMs) are neural networks trained on vast text corpora to predict the next token in a sequence. Despite this simple objective, they exhibit emergent capabilities including reasoning, code generation, and general knowledge.
Architecture
Transformers
The transformer architecture (Vaswani et al., 2017) revolutionised NLP by replacing recurrence with self-attention, enabling parallel processing and better long-range dependencies.
Key components:
- Self-attention — Allows each token to attend to all other tokens, computing relevance weights
- Multi-head attention — Multiple attention heads capture different relationship types
- Positional encoding — Injects sequence order information (sinusoidal or learned)
- Feed-forward layers — Process each position independently after attention
- Layer normalisation — Stabilises training
Decoder-only vs Encoder-Decoder
- Decoder-only (GPT, Llama, Claude) — Autoregressive, predicts next token, used for generation
- Encoder-decoder (T5, BART) — Encoder processes input, decoder generates output, used for translation/summarisation
- Encoder-only (BERT) — Bidirectional, used for classification and embeddings
Tokenisation
Text must be converted to tokens before processing. Common approaches:
- Byte-Pair Encoding (BPE) — Iteratively merges frequent character pairs
- WordPiece — Similar to BPE, used by BERT
- SentencePiece — Language-agnostic, treats text as raw bytes
- Tiktoken — OpenAI’s fast BPE implementation
Tokenisation affects:
- Context window utilisation (more efficient = more content)
- Multilingual performance (some tokenisers favour English)
- Cost (APIs charge per token)
Context Windows
The context window is the maximum number of tokens a model can process at once.
- GPT-4: 8K–128K tokens
- Claude: 200K tokens
- Gemini: 1M+ tokens
Longer contexts enable:
- Processing entire codebases or documents
- Extended conversations with memory
- Complex reasoning chains
Trade-offs:
- Attention is O(n²) — longer contexts are computationally expensive
- “Lost in the middle” — models may struggle with information in the middle of long contexts
- Cost scales with context length
Training
Pre-training
Self-supervised learning on massive text corpora (internet, books, code). The model learns to predict the next token, acquiring:
- Grammar and syntax
- World knowledge
- Reasoning patterns
Post-training
Refining the base model for specific behaviours:
- Supervised Fine-Tuning (SFT) — Training on curated instruction-response pairs
- RLHF (Reinforcement Learning from Human Feedback) — Using human preferences to align model outputs
- DPO (Direct Preference Optimisation) — Simpler alternative to RLHF without reward models
- Constitutional AI — Self-critique and revision based on principles
Inference
Temperature
Controls randomness in token selection:
- 0 — Deterministic, always picks highest probability token (greedy)
- 0.1–0.7 — Balanced creativity and coherence
- 1.0+ — More random, creative, potentially incoherent
Top-p (Nucleus Sampling)
Only considers tokens whose cumulative probability exceeds p. Top-p of 0.9 means only the most likely tokens summing to 90% probability are considered.
Top-k
Only considers the k most likely tokens at each step.
Capabilities & Limitations
Emergent Capabilities
- In-context learning — Learning from examples in the prompt
- Chain-of-thought reasoning — Step-by-step problem solving
- Code generation and execution
- Multi-turn dialogue
- Tool use and function calling
Limitations
- Hallucinations — Generating plausible but false information
- Knowledge cutoff — Training data has a temporal boundary
- Context limitations — Can’t process arbitrarily long inputs
- Reasoning failures — Struggles with novel logical problems
- Bias — Reflects biases in training data
Resources
- Attention Is All You Need — Original transformer paper
- The Illustrated Transformer
- Let’s build GPT — Andrej Karpathy’s walkthrough
- LLM Visualization — Interactive 3D visualisation
- Anthropic’s Core Views on AI Safety