Model Serving

Reference notes.

Model serving is the infrastructure and techniques for deploying ML models to production, handling inference requests efficiently at scale.

Deployment Options

Managed APIs

Use a provider’s hosted models via API.

Advantages:

No infrastructure management
Always latest models
Scales automatically

Disadvantages:

Cost at scale
Data leaves your environment
Vendor lock-in
Rate limits

Providers: OpenAI, Anthropic, Google, Cohere, AWS Bedrock, Azure OpenAI

Self-hosted

Run models on your own infrastructure.

Advantages:

Data privacy
Cost control at scale
Customisation
No rate limits

Disadvantages:

GPU infrastructure required
Operational complexity
Model updates are your responsibility

Inference Servers

vLLM

High-throughput LLM serving with PagedAttention.

Continuous batching
Tensor parallelism
OpenAI-compatible API
Best for: High throughput production

vllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000

Text Generation Inference (TGI)

Hugging Face’s production-ready server.

Continuous batching
Flash attention
Quantisation support
Best for: Hugging Face ecosystem

Ollama

Simple local model running.

Easy model management
Automatic quantisation
REST API
Best for: Local development, prototyping

ollama run llama3

LMStudio

Desktop app for running local models.

GUI interface
One-click downloads
Local OpenAI-compatible server

llama.cpp

Highly optimised C++ inference.

CPU and GPU support
Minimal dependencies
GGUF format
Best for: Edge deployment, low resources

SGLang

High-performance serving from UC Berkeley, now the de facto standard for multi-turn and agentic workloads.

RadixAttention — Caches shared prefixes in a radix tree, up to 5x faster for workloads with common system prompts
29% higher throughput than vLLM (~16,200 vs ~12,500 tok/s on H100)
Zero-overhead batch scheduler (CPU scheduling <2% of total time)
Runs on NVIDIA, AMD GPUs, and TPUs (via SGLang-Jax)
Best for: Multi-turn conversations, agentic tasks, structured generation

python -m sglang.launch_server --model meta-llama/Llama-3.3-70B-Instruct --port 8000

LMDeploy

Optimised for running large models on limited hardware.

Matches SGLang throughput (~16,100 tok/s on H100)
Strong Int4/FP8 quantisation support
Best for: Large models (405B+) on constrained hardware

Others

TensorRT-LLM — NVIDIA-optimised, best on B200 GPUs
NVIDIA Dynamo — Disaggregated serving framework for dynamic GPU assignment
MLC LLM — Universal deployment

Choosing an Inference Engine

Engine	Best For
SGLang	Multi-turn, agentic, shared prefixes
vLLM	Broadest model support (218+ architectures), stable API
LMDeploy	Large models on limited hardware
TensorRT-LLM	Maximum performance on NVIDIA hardware
Ollama	Local development and prototyping

Quantisation

Reduce model precision to decrease memory and increase speed.

Formats

Format	Bits	Memory Reduction	Quality Loss
FP32	32	Baseline	None
FP16/BF16	16	50%	Minimal
FP8	8	75%	Negligible (native on H100/B200)
INT8	8	75%	Small
INT4	4	87.5%	Moderate

Methods

Post-training quantisation (PTQ)
Quantise after training. Fast but may lose quality.

GPTQ — Popular 4-bit method
AWQ — Activation-aware, better quality
GGUF — llama.cpp format, various bit-widths
bitsandbytes — Easy integration with Transformers

Quantisation-aware training (QAT)
Train with quantisation in mind. Better quality but expensive.

Practical Guidelines

8-bit: Negligible quality loss for most tasks
4-bit: Good for constrained environments, test quality
Below 4-bit: Significant degradation, use carefully

Batching

Static Batching

Wait for N requests, process together. Simple but adds latency.

Dynamic Batching

Set max wait time, batch whatever arrives. Balances throughput and latency.

Continuous Batching

Process requests as they complete, add new ones to running batch. Best throughput.

vLLM and TGI use continuous batching by default.

Scaling

Horizontal Scaling

Multiple model replicas behind a load balancer.

Stateless inference enables easy scaling
Use Kubernetes for orchestration
Consider GPU availability constraints

Tensor Parallelism

Split model across multiple GPUs on same node.

Required for models exceeding single GPU memory
Linear communication overhead

Pipeline Parallelism

Split model layers across GPUs.

Lower communication overhead
Creates pipeline bubbles

Expert Parallelism

For Mixture-of-Experts models, distribute experts across GPUs.

Hardware

GPUs

GPU	VRAM	Use Case
RTX 4090	24GB	Development, small models
A10G	24GB	Cloud inference (AWS)
L4	24GB	Cloud inference (GCP)
A100 40GB	40GB	Medium models
A100 80GB	80GB	Large models
H100	80GB	High throughput
B200 (Blackwell)	192GB	Latest, massive throughput & MoE scaling

Memory Estimation

Rough formula for fp16:

Memory (GB) ≈ Parameters (B) × 2

With 4-bit quantisation:

Memory (GB) ≈ Parameters (B) × 0.5

7B model: ~14GB fp16, ~3.5GB 4-bit

Cloud GPU Providers

AWS (p4d, p5 instances)
GCP (A100, H100)
Azure
Lambda Labs
RunPod
Vast.ai
Together AI
Modal

Optimisation Techniques

KV Cache

Cache key-value pairs from previous tokens. Essential for autoregressive generation.

Rai Notes

Explorer