Large Language Models (LLMs) are neural networks trained on vast text corpora to predict the next token in a sequence. Despite this simple objective, they exhibit emergent capabilities including reasoning, code generation, and general knowledge.

Architecture

Transformers

The transformer architecture (Vaswani et al., 2017) revolutionised NLP by replacing recurrence with self-attention, enabling parallel processing and better long-range dependencies.

Key components:

  • Self-attention — Allows each token to attend to all other tokens, computing relevance weights
  • Multi-head attention — Multiple attention heads capture different relationship types
  • Positional encoding — Injects sequence order information (sinusoidal or learned)
  • Feed-forward layers — Process each position independently after attention
  • Layer normalisation — Stabilises training

Decoder-only vs Encoder-Decoder

  • Decoder-only (GPT, Llama, Claude) — Autoregressive, predicts next token, used for generation
  • Encoder-decoder (T5, BART) — Encoder processes input, decoder generates output, used for translation/summarisation
  • Encoder-only (BERT) — Bidirectional, used for classification and embeddings

Tokenisation

Text must be converted to tokens before processing. Common approaches:

  • Byte-Pair Encoding (BPE) — Iteratively merges frequent character pairs
  • WordPiece — Similar to BPE, used by BERT
  • SentencePiece — Language-agnostic, treats text as raw bytes
  • Tiktoken — OpenAI’s fast BPE implementation

Tokenisation affects:

  • Context window utilisation (more efficient = more content)
  • Multilingual performance (some tokenisers favour English)
  • Cost (APIs charge per token)

Context Windows

The context window is the maximum number of tokens a model can process at once.

  • GPT-4: 8K–128K tokens
  • Claude: 200K tokens
  • Gemini: 1M+ tokens

Longer contexts enable:

  • Processing entire codebases or documents
  • Extended conversations with memory
  • Complex reasoning chains

Trade-offs:

  • Attention is O(n²) — longer contexts are computationally expensive
  • “Lost in the middle” — models may struggle with information in the middle of long contexts
  • Cost scales with context length

Training

Pre-training

Self-supervised learning on massive text corpora (internet, books, code). The model learns to predict the next token, acquiring:

  • Grammar and syntax
  • World knowledge
  • Reasoning patterns

Post-training

Refining the base model for specific behaviours:

  • Supervised Fine-Tuning (SFT) — Training on curated instruction-response pairs
  • RLHF (Reinforcement Learning from Human Feedback) — Using human preferences to align model outputs
  • DPO (Direct Preference Optimisation) — Simpler alternative to RLHF without reward models
  • Constitutional AI — Self-critique and revision based on principles

Inference

Temperature

Controls randomness in token selection:

  • 0 — Deterministic, always picks highest probability token (greedy)
  • 0.1–0.7 — Balanced creativity and coherence
  • 1.0+ — More random, creative, potentially incoherent

Top-p (Nucleus Sampling)

Only considers tokens whose cumulative probability exceeds p. Top-p of 0.9 means only the most likely tokens summing to 90% probability are considered.

Top-k

Only considers the k most likely tokens at each step.

Capabilities & Limitations

Emergent Capabilities

  • In-context learning — Learning from examples in the prompt
  • Chain-of-thought reasoning — Step-by-step problem solving
  • Code generation and execution
  • Multi-turn dialogue
  • Tool use and function calling

Limitations

  • Hallucinations — Generating plausible but false information
  • Knowledge cutoff — Training data has a temporal boundary
  • Context limitations — Can’t process arbitrarily long inputs
  • Reasoning failures — Struggles with novel logical problems
  • Bias — Reflects biases in training data

Resources