Service Mesh

A service mesh is a dedicated infrastructure layer for managing service-to-service communication in distributed applications. It provides observability, security, and traffic management capabilities transparently, without requiring changes to application code.

Why Service Meshes Exist

In Microservices architectures, applications are decomposed into hundreds of services, each potentially having thousands of instances dynamically scheduled by Kubernetes. This creates challenges:

Reliability: Network calls fail; services need retries, timeouts, and circuit breaking
Security: Services must authenticate each other and encrypt traffic
Observability: Understanding request flows across services is difficult
Traffic control: Canary deployments, A/B testing, and traffic shifting require sophisticated routing

Historically, these concerns were handled by libraries embedded in applications (Twitter’s Finagle, Netflix’s Hystrix). Service meshes extract this logic into the infrastructure layer, making it language-agnostic and consistent across all services.

Architecture

Data Plane

The data plane consists of lightweight network proxies deployed alongside each service instance. These proxies intercept all inbound and outbound traffic, applying policies and collecting telemetry. The most common deployment pattern is the sidecar, where a proxy container runs in the same pod as the application.

Some newer implementations (Cilium, Istio ambient mode) use sidecar-less approaches, running proxies per-node or using eBPF to reduce resource overhead.

Control Plane

The control plane manages and configures the proxies. It:

Distributes configuration and routing rules
Manages certificates for mTLS
Aggregates telemetry data
Provides APIs for operators to define policies

Key Features

Mutual TLS (mTLS)

Automatic encryption and authentication between services using certificates. The mesh handles certificate issuance, rotation, and validation transparently.

Traffic Management

Load balancing: Latency-aware, round-robin, or weighted distribution
Traffic splitting: Route percentages of traffic to different versions (canary deployments)
Retries and timeouts: Automatic retry with configurable budgets
Circuit breaking: Prevent cascade failures by stopping requests to unhealthy services

Observability

Service meshes provide Observability without code changes:

Metrics: Request rates, latencies, error rates (golden signals)
Distributed tracing: Request flow visualisation across services
Access logging: Detailed logs of all service communication

Security Policies

Authorisation policies controlling which services can communicate
Rate limiting to prevent abuse
Identity-based access control (not just network-based)

Popular Implementations

Mesh	Proxy	Key Characteristics
Istio	Envoy	Feature-rich, complex, strong ecosystem. Supports sidecar and ambient (sidecar-less) modes. CNCF graduated.
Linkerd	linkerd2-proxy (Rust)	Lightweight, simple, low resource overhead. Focused on simplicity and performance. CNCF graduated.
Cilium	eBPF + Envoy	Uses eBPF for kernel-level networking, reducing proxy overhead. Part of Cilium CNI. CNCF graduated.
Consul Connect	Envoy	Integrates with HashiCorp ecosystem. Works across Kubernetes, VMs, and Nomad.

Choosing Between Them

Istio: When you need comprehensive features and can handle complexity
Linkerd: When simplicity and low overhead are priorities
Cilium: When already using Cilium CNI and want unified networking/mesh
Consul Connect: When running hybrid (Kubernetes + VMs) or using HashiCorp stack

When to Use a Service Mesh

Good candidates:

Large microservices deployments (dozens+ services)
Strict security requirements (zero-trust, compliance)
Need for traffic management (canary, blue-green deployments)
Multi-cluster or hybrid cloud deployments
Teams lacking expertise to implement these features in-app

Probably overkill:

Monolithic applications
Small number of services (<10)
Simple request patterns without complex routing needs
Teams already have robust observability and security

Trade-offs and Complexity

Costs

Resource overhead: Sidecar proxies consume CPU and memory (typically 50-100MB per pod)
Latency: Additional network hops add milliseconds
Operational complexity: Another system to configure, upgrade, and debug
Learning curve: New abstractions and debugging patterns

Mitigation

Start with observability features before enabling complex traffic management
Use sidecar-less modes where available (Cilium, Istio ambient)
Ensure team understands the mesh before production deployment
Monitor proxy resource usage and tune accordingly

Kubernetes - Most service meshes are designed for Kubernetes
Microservices - The architecture pattern that drives service mesh adoption
Observability - A core capability provided by service meshes

Resources

Service Mesh Primer - Comprehensive book on service mesh concepts
Istio Documentation
Linkerd Documentation
Cilium Service Mesh
Consul Connect
The Service Mesh Manifesto - Detailed explanation from Linkerd creators

Rai Notes

Explorer

Service Mesh

Why Service Meshes Exist

Architecture

Data Plane

Control Plane

Key Features

Mutual TLS (mTLS)

Traffic Management

Observability

Security Policies

Popular Implementations

Choosing Between Them

When to Use a Service Mesh

Trade-offs and Complexity

Costs

Mitigation

Resources

Graph View

Table of Contents

Backlinks

Rai Notes

Explorer

Service Mesh

Why Service Meshes Exist

Architecture

Data Plane

Control Plane

Key Features

Mutual TLS (mTLS)

Traffic Management

Observability

Security Policies

Popular Implementations

Choosing Between Them

When to Use a Service Mesh

Trade-offs and Complexity

Costs

Mitigation

Related Concepts

Resources

Graph View

Table of Contents

Backlinks