Fine-tuning

Reference notes.

Fine-tuning adapts a pre-trained model to specific tasks, domains, or behaviours by training on curated datasets. It modifies model weights rather than just providing context (like RAG or Prompt Engineering).

When to Fine-tune

Good candidates:

Consistent style or tone requirements
Domain-specific terminology or jargon
Structured output formats
Task-specific behaviour patterns
Reducing prompt length for common tasks

Consider alternatives first:

Prompt Engineering — Cheaper, faster iteration
RAG — For knowledge that changes frequently
Few-shot examples — If you have limited training data

Fine-tuning Approaches

Full Fine-tuning

Update all model parameters. Produces the best results but:

Requires significant compute (multiple GPUs)
Risk of catastrophic forgetting
Creates a complete model copy

Parameter-Efficient Fine-tuning (PEFT)

Train only a small subset of parameters, keeping most weights frozen.

LoRA (Low-Rank Adaptation)

Injects trainable low-rank matrices into transformer layers. The original weights stay frozen.

Original: W (frozen)
LoRA: W + BA where B and A are small trainable matrices

Benefits:

10-100x fewer trainable parameters
Can train on consumer GPUs
Merge adapters back into base model
Stack multiple adapters for different tasks

QLoRA

Combines LoRA with quantisation:

Quantise base model to 4-bit
Train LoRA adapters in higher precision
Dramatically reduces memory requirements

A 65B parameter model can be fine-tuned on a single 48GB GPU.

Other PEFT Methods

Prefix tuning — Learn soft prompts prepended to input
Adapters — Small bottleneck layers inserted between transformer layers
IA³ — Rescales activations with learned vectors

Data Preparation

Dataset Quality

Quality matters more than quantity. A few hundred high-quality examples often beat thousands of mediocre ones.

Guidelines:

Diverse examples covering edge cases
Consistent format and style
Accurate labels/responses
Representative of real usage

Dataset Format

Most fine-tuning uses instruction format:

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."}
  ]
}

Dataset Size Guidelines

Minimum viable: 50-100 examples
Good baseline: 500-1000 examples
Production quality: 1000-10000+ examples

More data helps, but with diminishing returns.

Training Process

Hyperparameters

Learning rate — Typically 1e-5 to 5e-5 (lower than pre-training)
Epochs — Usually 1-5 (overfitting risk with more)
Batch size — As large as memory allows
LoRA rank — 8-64 typical (higher = more capacity)
LoRA alpha — Often 2x the rank

Avoiding Catastrophic Forgetting

The model may lose general capabilities when specialised. Mitigations:

Mix in general instruction data
Use lower learning rates
Early stopping based on validation loss
LoRA (inherently preserves base knowledge)

Evaluation

Hold out a test set (never seen during training)
Compare against base model on both task-specific and general benchmarks
Human evaluation for subjective quality

Platforms & Tools

Cloud Services

OpenAI Fine-tuning — GPT-4o and others
Anthropic Fine-tuning — Claude
Google Vertex AI — Gemini
Together AI — Open-source models
Fireworks AI

Self-hosted

Hugging Face Transformers — Industry standard
Axolotl — Streamlined fine-tuning
LLaMA-Factory — GUI and CLI for fine-tuning
Unsloth — 2x faster, 50% less memory

Frameworks

PEFT — Hugging Face’s PEFT library
TRL — Transformer Reinforcement Learning
DeepSpeed — Distributed training optimisation

Advanced Topics

Alignment Fine-tuning

Beyond task-specific tuning, alignment makes models helpful, harmless, and honest.

SFT (Supervised Fine-tuning) — Train on human-written responses
RLHF — Use human preferences to train a reward model, then optimise the LLM against it
DPO (Direct Preference Optimisation) — Skip the reward model, directly optimise on preferences
GRPO (Group Relative Policy Optimisation) — A breakthrough RL algorithm that estimates the baseline directly from multiple generated responses rather than using a separate value network, heavily reducing training VRAM (used by DeepSeek R1).
ORPO — Combines SFT and preference alignment in one step

Continued Pre-training

Extend pre-training on domain-specific corpora before instruction fine-tuning. Useful for:

Specialised domains (legal, medical, scientific)
Non-English languages
Proprietary knowledge bases

Merging Models

Combine multiple fine-tuned models:

Linear interpolation — Average weights
TIES — Trim, elect, and merge
DARE — Drop and rescale
Mergekit

Rai Notes

Explorer