Model serving is the infrastructure and techniques for deploying ML models to production, handling inference requests efficiently at scale.
Deployment Options
Managed APIs
Use a provider’s hosted models via API.
Advantages:
- No infrastructure management
- Always latest models
- Scales automatically
Disadvantages:
- Cost at scale
- Data leaves your environment
- Vendor lock-in
- Rate limits
Providers: OpenAI, Anthropic, Google, Cohere, AWS Bedrock, Azure OpenAI
Self-hosted
Run models on your own infrastructure.
Advantages:
- Data privacy
- Cost control at scale
- Customisation
- No rate limits
Disadvantages:
- GPU infrastructure required
- Operational complexity
- Model updates are your responsibility
Inference Servers
vLLM
High-throughput LLM serving with PagedAttention.
- Continuous batching
- Tensor parallelism
- OpenAI-compatible API
- Best for: High throughput production
vllm serve meta-llama/Llama-3-8B-Instruct --port 8000Text Generation Inference (TGI)
Hugging Face’s production-ready server.
- Continuous batching
- Flash attention
- Quantisation support
- Best for: Hugging Face ecosystem
Ollama
Simple local model running.
- Easy model management
- Automatic quantisation
- REST API
- Best for: Local development, prototyping
ollama run llama3LMStudio
Desktop app for running local models.
- GUI interface
- One-click downloads
- Local OpenAI-compatible server
llama.cpp
Highly optimised C++ inference.
- CPU and GPU support
- Minimal dependencies
- GGUF format
- Best for: Edge deployment, low resources
Others
- SGLang — Fast structured generation
- TensorRT-LLM — NVIDIA-optimised
- MLC LLM — Universal deployment
Quantisation
Reduce model precision to decrease memory and increase speed.
Formats
| Format | Bits | Memory Reduction | Quality Loss |
|---|---|---|---|
| FP32 | 32 | Baseline | None |
| FP16 | 16 | 50% | Minimal |
| INT8 | 8 | 75% | Small |
| INT4 | 4 | 87.5% | Moderate |
Methods
Post-training quantisation (PTQ)
Quantise after training. Fast but may lose quality.
- GPTQ — Popular 4-bit method
- AWQ — Activation-aware, better quality
- GGUF — llama.cpp format, various bit-widths
- bitsandbytes — Easy integration with Transformers
Quantisation-aware training (QAT)
Train with quantisation in mind. Better quality but expensive.
Practical Guidelines
- 8-bit: Negligible quality loss for most tasks
- 4-bit: Good for constrained environments, test quality
- Below 4-bit: Significant degradation, use carefully
Batching
Static Batching
Wait for N requests, process together. Simple but adds latency.
Dynamic Batching
Set max wait time, batch whatever arrives. Balances throughput and latency.
Continuous Batching
Process requests as they complete, add new ones to running batch. Best throughput.
vLLM and TGI use continuous batching by default.
Scaling
Horizontal Scaling
Multiple model replicas behind a load balancer.
- Stateless inference enables easy scaling
- Use Kubernetes for orchestration
- Consider GPU availability constraints
Tensor Parallelism
Split model across multiple GPUs on same node.
- Required for models exceeding single GPU memory
- Linear communication overhead
Pipeline Parallelism
Split model layers across GPUs.
- Lower communication overhead
- Creates pipeline bubbles
Expert Parallelism
For Mixture-of-Experts models, distribute experts across GPUs.
Hardware
GPUs
| GPU | VRAM | Use Case |
|---|---|---|
| RTX 4090 | 24GB | Development, small models |
| A10G | 24GB | Cloud inference (AWS) |
| L4 | 24GB | Cloud inference (GCP) |
| A100 40GB | 40GB | Medium models |
| A100 80GB | 80GB | Large models |
| H100 | 80GB | Latest, highest throughput |
Memory Estimation
Rough formula for fp16:
Memory (GB) ≈ Parameters (B) × 2
With 4-bit quantisation:
Memory (GB) ≈ Parameters (B) × 0.5
7B model: ~14GB fp16, ~3.5GB 4-bit
Cloud GPU Providers
- AWS (p4d, p5 instances)
- GCP (A100, H100)
- Azure
- Lambda Labs
- RunPod
- Vast.ai
- Together AI
- Modal
Optimisation Techniques
KV Cache
Cache key-value pairs from previous tokens. Essential for autoregressive generation.
Flash Attention
Memory-efficient attention computation. Now standard in most servers.
Speculative Decoding
Use smaller draft model to propose tokens, verify with main model in parallel.
Prefix Caching
Cache common prefixes (system prompts) across requests.
Monitoring
Key Metrics
- Latency — Time to first token (TTFT), inter-token latency, total latency
- Throughput — Requests/second, tokens/second
- GPU utilisation — Memory, compute
- Queue depth — Pending requests
Tools
- Prometheus + Grafana
- Built-in server metrics endpoints
- LangSmith
- Arize Phoenix
Production Checklist
- Load testing under expected traffic
- Graceful degradation under overload
- Health checks and auto-restart
- Request timeouts
- Rate limiting
- Logging and tracing
- Cost monitoring
- Model versioning strategy