Fine-tuning adapts a pre-trained model to specific tasks, domains, or behaviours by training on curated datasets. It modifies model weights rather than just providing context (like RAG or Prompt Engineering).
When to Fine-tune
Good candidates:
- Consistent style or tone requirements
- Domain-specific terminology or jargon
- Structured output formats
- Task-specific behaviour patterns
- Reducing prompt length for common tasks
Consider alternatives first:
- Prompt Engineering — Cheaper, faster iteration
- RAG — For knowledge that changes frequently
- Few-shot examples — If you have limited training data
Fine-tuning Approaches
Full Fine-tuning
Update all model parameters. Produces the best results but:
- Requires significant compute (multiple GPUs)
- Risk of catastrophic forgetting
- Creates a complete model copy
Parameter-Efficient Fine-tuning (PEFT)
Train only a small subset of parameters, keeping most weights frozen.
LoRA (Low-Rank Adaptation)
Injects trainable low-rank matrices into transformer layers. The original weights stay frozen.
Original: W (frozen)
LoRA: W + BA where B and A are small trainable matrices
Benefits:
- 10-100x fewer trainable parameters
- Can train on consumer GPUs
- Merge adapters back into base model
- Stack multiple adapters for different tasks
QLoRA
Combines LoRA with quantisation:
- Quantise base model to 4-bit
- Train LoRA adapters in higher precision
- Dramatically reduces memory requirements
A 65B parameter model can be fine-tuned on a single 48GB GPU.
Other PEFT Methods
- Prefix tuning — Learn soft prompts prepended to input
- Adapters — Small bottleneck layers inserted between transformer layers
- IA³ — Rescales activations with learned vectors
Data Preparation
Dataset Quality
Quality matters more than quantity. A few hundred high-quality examples often beat thousands of mediocre ones.
Guidelines:
- Diverse examples covering edge cases
- Consistent format and style
- Accurate labels/responses
- Representative of real usage
Dataset Format
Most fine-tuning uses instruction format:
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."}
]
}Dataset Size Guidelines
- Minimum viable: 50-100 examples
- Good baseline: 500-1000 examples
- Production quality: 1000-10000+ examples
More data helps, but with diminishing returns.
Training Process
Hyperparameters
- Learning rate — Typically 1e-5 to 5e-5 (lower than pre-training)
- Epochs — Usually 1-5 (overfitting risk with more)
- Batch size — As large as memory allows
- LoRA rank — 8-64 typical (higher = more capacity)
- LoRA alpha — Often 2x the rank
Avoiding Catastrophic Forgetting
The model may lose general capabilities when specialised. Mitigations:
- Mix in general instruction data
- Use lower learning rates
- Early stopping based on validation loss
- LoRA (inherently preserves base knowledge)
Evaluation
- Hold out a test set (never seen during training)
- Compare against base model on both task-specific and general benchmarks
- Human evaluation for subjective quality
Platforms & Tools
Cloud Services
- OpenAI Fine-tuning — GPT-3.5/4
- Anthropic Fine-tuning — Claude
- Google Vertex AI — Gemini
- Together AI — Open-source models
- Fireworks AI
Self-hosted
- Hugging Face Transformers — Industry standard
- Axolotl — Streamlined fine-tuning
- LLaMA-Factory — GUI and CLI for fine-tuning
- Unsloth — 2x faster, 50% less memory
Frameworks
- PEFT — Hugging Face’s PEFT library
- TRL — Transformer Reinforcement Learning
- DeepSpeed — Distributed training optimisation
Advanced Topics
Alignment Fine-tuning
Beyond task-specific tuning, alignment makes models helpful, harmless, and honest.
- SFT (Supervised Fine-tuning) — Train on human-written responses
- RLHF — Use human preferences to train a reward model, then optimise the LLM against it
- DPO (Direct Preference Optimisation) — Skip the reward model, directly optimise on preferences
- ORPO — Combines SFT and preference alignment in one step
Continued Pre-training
Extend pre-training on domain-specific corpora before instruction fine-tuning. Useful for:
- Specialised domains (legal, medical, scientific)
- Non-English languages
- Proprietary knowledge bases
Merging Models
Combine multiple fine-tuned models:
- Linear interpolation — Average weights
- TIES — Trim, elect, and merge
- DARE — Drop and rescale
- Mergekit