Model serving is the infrastructure and techniques for deploying ML models to production, handling inference requests efficiently at scale.

Deployment Options

Managed APIs

Use a provider’s hosted models via API.

Advantages:

  • No infrastructure management
  • Always latest models
  • Scales automatically

Disadvantages:

  • Cost at scale
  • Data leaves your environment
  • Vendor lock-in
  • Rate limits

Providers: OpenAI, Anthropic, Google, Cohere, AWS Bedrock, Azure OpenAI

Self-hosted

Run models on your own infrastructure.

Advantages:

  • Data privacy
  • Cost control at scale
  • Customisation
  • No rate limits

Disadvantages:

  • GPU infrastructure required
  • Operational complexity
  • Model updates are your responsibility

Inference Servers

vLLM

High-throughput LLM serving with PagedAttention.

  • Continuous batching
  • Tensor parallelism
  • OpenAI-compatible API
  • Best for: High throughput production
vllm serve meta-llama/Llama-3-8B-Instruct --port 8000

Text Generation Inference (TGI)

Hugging Face’s production-ready server.

  • Continuous batching
  • Flash attention
  • Quantisation support
  • Best for: Hugging Face ecosystem

Ollama

Simple local model running.

  • Easy model management
  • Automatic quantisation
  • REST API
  • Best for: Local development, prototyping
ollama run llama3

LMStudio

Desktop app for running local models.

  • GUI interface
  • One-click downloads
  • Local OpenAI-compatible server

llama.cpp

Highly optimised C++ inference.

  • CPU and GPU support
  • Minimal dependencies
  • GGUF format
  • Best for: Edge deployment, low resources

Others

Quantisation

Reduce model precision to decrease memory and increase speed.

Formats

FormatBitsMemory ReductionQuality Loss
FP3232BaselineNone
FP161650%Minimal
INT8875%Small
INT4487.5%Moderate

Methods

Post-training quantisation (PTQ)
Quantise after training. Fast but may lose quality.

  • GPTQ — Popular 4-bit method
  • AWQ — Activation-aware, better quality
  • GGUF — llama.cpp format, various bit-widths
  • bitsandbytes — Easy integration with Transformers

Quantisation-aware training (QAT)
Train with quantisation in mind. Better quality but expensive.

Practical Guidelines

  • 8-bit: Negligible quality loss for most tasks
  • 4-bit: Good for constrained environments, test quality
  • Below 4-bit: Significant degradation, use carefully

Batching

Static Batching

Wait for N requests, process together. Simple but adds latency.

Dynamic Batching

Set max wait time, batch whatever arrives. Balances throughput and latency.

Continuous Batching

Process requests as they complete, add new ones to running batch. Best throughput.

vLLM and TGI use continuous batching by default.

Scaling

Horizontal Scaling

Multiple model replicas behind a load balancer.

  • Stateless inference enables easy scaling
  • Use Kubernetes for orchestration
  • Consider GPU availability constraints

Tensor Parallelism

Split model across multiple GPUs on same node.

  • Required for models exceeding single GPU memory
  • Linear communication overhead

Pipeline Parallelism

Split model layers across GPUs.

  • Lower communication overhead
  • Creates pipeline bubbles

Expert Parallelism

For Mixture-of-Experts models, distribute experts across GPUs.

Hardware

GPUs

GPUVRAMUse Case
RTX 409024GBDevelopment, small models
A10G24GBCloud inference (AWS)
L424GBCloud inference (GCP)
A100 40GB40GBMedium models
A100 80GB80GBLarge models
H10080GBLatest, highest throughput

Memory Estimation

Rough formula for fp16:

Memory (GB) ≈ Parameters (B) × 2

With 4-bit quantisation:

Memory (GB) ≈ Parameters (B) × 0.5

7B model: ~14GB fp16, ~3.5GB 4-bit

Cloud GPU Providers

Optimisation Techniques

KV Cache

Cache key-value pairs from previous tokens. Essential for autoregressive generation.

Flash Attention

Memory-efficient attention computation. Now standard in most servers.

Speculative Decoding

Use smaller draft model to propose tokens, verify with main model in parallel.

Prefix Caching

Cache common prefixes (system prompts) across requests.

Monitoring

Key Metrics

  • Latency — Time to first token (TTFT), inter-token latency, total latency
  • Throughput — Requests/second, tokens/second
  • GPU utilisation — Memory, compute
  • Queue depth — Pending requests

Tools

Production Checklist

  • Load testing under expected traffic
  • Graceful degradation under overload
  • Health checks and auto-restart
  • Request timeouts
  • Rate limiting
  • Logging and tracing
  • Cost monitoring
  • Model versioning strategy

Resources