Fine-tuning adapts a pre-trained model to specific tasks, domains, or behaviours by training on curated datasets. It modifies model weights rather than just providing context (like RAG or Prompt Engineering).

When to Fine-tune

Good candidates:

  • Consistent style or tone requirements
  • Domain-specific terminology or jargon
  • Structured output formats
  • Task-specific behaviour patterns
  • Reducing prompt length for common tasks

Consider alternatives first:

  • Prompt Engineering — Cheaper, faster iteration
  • RAG — For knowledge that changes frequently
  • Few-shot examples — If you have limited training data

Fine-tuning Approaches

Full Fine-tuning

Update all model parameters. Produces the best results but:

  • Requires significant compute (multiple GPUs)
  • Risk of catastrophic forgetting
  • Creates a complete model copy

Parameter-Efficient Fine-tuning (PEFT)

Train only a small subset of parameters, keeping most weights frozen.

LoRA (Low-Rank Adaptation)

Injects trainable low-rank matrices into transformer layers. The original weights stay frozen.

Original: W (frozen)
LoRA: W + BA where B and A are small trainable matrices

Benefits:

  • 10-100x fewer trainable parameters
  • Can train on consumer GPUs
  • Merge adapters back into base model
  • Stack multiple adapters for different tasks

QLoRA

Combines LoRA with quantisation:

  1. Quantise base model to 4-bit
  2. Train LoRA adapters in higher precision
  3. Dramatically reduces memory requirements

A 65B parameter model can be fine-tuned on a single 48GB GPU.

Other PEFT Methods

  • Prefix tuning — Learn soft prompts prepended to input
  • Adapters — Small bottleneck layers inserted between transformer layers
  • IA³ — Rescales activations with learned vectors

Data Preparation

Dataset Quality

Quality matters more than quantity. A few hundred high-quality examples often beat thousands of mediocre ones.

Guidelines:

  • Diverse examples covering edge cases
  • Consistent format and style
  • Accurate labels/responses
  • Representative of real usage

Dataset Format

Most fine-tuning uses instruction format:

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."}
  ]
}

Dataset Size Guidelines

  • Minimum viable: 50-100 examples
  • Good baseline: 500-1000 examples
  • Production quality: 1000-10000+ examples

More data helps, but with diminishing returns.

Training Process

Hyperparameters

  • Learning rate — Typically 1e-5 to 5e-5 (lower than pre-training)
  • Epochs — Usually 1-5 (overfitting risk with more)
  • Batch size — As large as memory allows
  • LoRA rank — 8-64 typical (higher = more capacity)
  • LoRA alpha — Often 2x the rank

Avoiding Catastrophic Forgetting

The model may lose general capabilities when specialised. Mitigations:

  • Mix in general instruction data
  • Use lower learning rates
  • Early stopping based on validation loss
  • LoRA (inherently preserves base knowledge)

Evaluation

  • Hold out a test set (never seen during training)
  • Compare against base model on both task-specific and general benchmarks
  • Human evaluation for subjective quality

Platforms & Tools

Cloud Services

Self-hosted

Frameworks

  • PEFT — Hugging Face’s PEFT library
  • TRL — Transformer Reinforcement Learning
  • DeepSpeed — Distributed training optimisation

Advanced Topics

Alignment Fine-tuning

Beyond task-specific tuning, alignment makes models helpful, harmless, and honest.

  • SFT (Supervised Fine-tuning) — Train on human-written responses
  • RLHF — Use human preferences to train a reward model, then optimise the LLM against it
  • DPO (Direct Preference Optimisation) — Skip the reward model, directly optimise on preferences
  • ORPO — Combines SFT and preference alignment in one step

Continued Pre-training

Extend pre-training on domain-specific corpora before instruction fine-tuning. Useful for:

  • Specialised domains (legal, medical, scientific)
  • Non-English languages
  • Proprietary knowledge bases

Merging Models

Combine multiple fine-tuned models:

  • Linear interpolation — Average weights
  • TIES — Trim, elect, and merge
  • DARE — Drop and rescale
  • Mergekit

Resources