Monitoring and alerting provide visibility into your systems’ health and behaviour. Good observability practices help you detect problems early, diagnose issues quickly, and understand system performance over time.
See also: Reliability Metrics, Incident Management, Cost Optimisation.
The Three Pillars of Observability
Logs
Discrete events with structured or unstructured text describing what happened.
Characteristics:
- High cardinality — can capture any detail
- Immutable records of events
- Essential for debugging and forensics
- Can be expensive to store and query at scale
Best practices:
- Use structured logging (JSON) for machine-parseable logs
- Include correlation IDs to trace requests across services
- Log at appropriate levels (DEBUG, INFO, WARN, ERROR)
- Avoid logging sensitive data (PII, credentials)
- Implement log rotation and retention policies
Example structured log:
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "ERROR",
"service": "payment-api",
"trace_id": "abc123",
"message": "Payment failed",
"error": "Card declined",
"user_id": "user_456",
"amount": 99.99
}Metrics
Numeric measurements aggregated over time, describing system state.
Characteristics:
- Highly efficient to store and query
- Good for dashboards and alerts
- Limited cardinality (by design)
- Show trends and patterns
Metric types:
- Counter — Monotonically increasing value (requests, errors, bytes sent)
- Gauge — Current value that can go up or down (temperature, queue depth, active connections)
- Histogram — Distribution of values (latency percentiles, request sizes)
- Summary — Pre-calculated quantiles (similar to histogram, different trade-offs)
Naming conventions:
# Format: <namespace>_<name>_<unit>
http_requests_total
http_request_duration_seconds
process_memory_bytes
queue_depth_messages
Traces
Records of requests as they flow through distributed systems.
Characteristics:
- Show the path of a request across services
- Reveal latency contributions from each component
- Essential for debugging distributed systems
- Require instrumentation across all services
Key concepts:
- Trace — The complete journey of a request
- Span — A single operation within a trace
- Context propagation — Passing trace IDs between services
The Four Golden Signals
Google’s SRE book recommends monitoring these four signals for any service:
1. Latency
The time it takes to serve a request.
What to measure:
- Percentiles (p50, p95, p99) — not averages
- Distinguish successful vs failed request latency
- Latency by endpoint/operation
Why percentiles matter:
Average latency can hide problems. If 99% of requests take 10ms and 1% take 10 seconds, the average is ~110ms — which hides the fact that 1 in 100 users has a terrible experience.
2. Traffic
The amount of demand being placed on the system.
What to measure:
- Requests per second
- Concurrent users/connections
- Messages processed per second
- Transactions per second
3. Errors
The rate of failed requests.
What to measure:
- HTTP 5xx errors (server errors)
- HTTP 4xx errors (client errors — track separately)
- Application-specific error codes
- Timeout errors
- Failed dependency calls
4. Saturation
How “full” your service is — how close to capacity.
What to measure:
- CPU utilisation
- Memory utilisation
- Disk I/O and space
- Network bandwidth
- Connection pool usage
- Queue depth
RED and USE Methods
RED Method (for services)
- Rate — Requests per second
- Errors — Failed requests per second
- Duration — Time per request (latency distribution)
USE Method (for resources)
- Utilisation — Percentage of resource busy
- Saturation — Amount of work queued
- Errors — Count of error events
Use RED for request-driven services; USE for infrastructure resources.
Alerting Strategies
Alert on Symptoms, Not Causes
Alert on what affects users, not on potential causes.
Bad: Alert when CPU > 80%
Good: Alert when request latency p99 > 500ms
High CPU might be fine if users aren’t affected. Focus alerts on user-facing symptoms; investigate causes during incident response.
Alert Severity
| Level | Description | Response |
|---|---|---|
| Page | User-facing impact, requires immediate action | Wake someone up |
| Ticket | Needs attention but not urgent | Address during business hours |
| Log | For investigation, no immediate action | Review in dashboards |
Only page for things that require immediate human intervention.
Avoiding Alert Fatigue
Alert fatigue is dangerous — when people receive too many alerts, they start ignoring them.
Symptoms of alert fatigue:
- Alerts are routinely ignored or snoozed
- On-call engineers feel burned out
- Real issues get lost in noise
Solutions:
- Delete alerts that never require action
- Tune thresholds to reduce false positives
- Combine related alerts
- Add context and runbook links to alerts
- Regularly review alert frequency and value
SLO-Based Alerting
Alert on error budget burn rate rather than instantaneous values. Multi-window, multi-burn-rate alerting (fast burn for incidents, slow burn for chronic issues) dramatically reduces noise while catching real problems.
See SLO burn-rate alerting for the full pattern and example queries.
Dashboards
Dashboard Design Principles
- Start with user impact — Top-level dashboards should show what users experience
- Progressive disclosure — High-level overview → detailed drill-down
- Consistency — Use the same colours, layouts, and conventions across dashboards
- Context — Show baselines, thresholds, and time-shifted comparisons
- Reduce cognitive load — Don’t overwhelm with too many charts
Dashboard Hierarchy
1. Executive/Service Overview
└── Shows: Availability, latency, error rate, traffic
2. Service Detail
└── Shows: All golden signals, broken down by endpoint/operation
3. Infrastructure
└── Shows: Compute, database, network, storage metrics
4. Debug/Investigation
└── Shows: Detailed metrics for troubleshooting
Common Dashboard Patterns
Single stat panels — Current value of key metrics (big numbers)
Time series graphs — Trends over time
Heatmaps — Distribution of values (great for latency)
Tables — Top-N lists, current state of multiple items
Instrumentation
What to Instrument
Application layer:
- Incoming requests (count, latency, errors)
- Outgoing requests (to databases, APIs, caches)
- Business metrics (signups, purchases, uploads)
- Queue processing (depth, processing time, failures)
Infrastructure layer:
- CPU, memory, disk, network
- Container/pod metrics
- Database connection pools
- Cache hit rates
Instrumentation Libraries
- OpenTelemetry — Vendor-neutral standard for metrics, logs, and traces
- Prometheus client libraries — For Prometheus-style metrics
- StatsD — Simple UDP-based metrics protocol
Cardinality Considerations
High-cardinality labels (user IDs, request IDs, IP addresses) can explode metric storage costs. Use labels for things with bounded, low cardinality:
- Good:
status_code="200",method="GET",endpoint="/api/users" - Bad:
user_id="12345",request_id="abc-123",ip="1.2.3.4"
For high-cardinality data, use logs or traces instead of metrics.
Log Aggregation
Centralised log aggregation is essential for distributed systems.
Pipeline:
Application → Log shipper → Message queue → Processor → Storage → Query interface
(Fluent Bit) (Kafka) (Logstash) (Elasticsearch) (Kibana)
Considerations:
- Log volume can be enormous — plan for storage costs
- Implement sampling or filtering for high-volume, low-value logs
- Set retention policies based on compliance and debugging needs
- Index strategically for query performance
Tools
Metrics and Dashboards
- Prometheus — Open-source metrics and alerting
- Grafana — Visualisation and dashboards
- Datadog — Full-stack monitoring platform
- New Relic — Application performance monitoring
- CloudWatch (AWS), Cloud Monitoring (GCP), Azure Monitor
Logging
- Elasticsearch + Kibana (ELK) — Open-source log aggregation
- Loki — Log aggregation by Grafana (Prometheus-style)
- Splunk — Enterprise log management
- Papertrail, Logtail
Tracing
- Jaeger — Open-source distributed tracing
- Zipkin — Open-source distributed tracing
- Tempo — Distributed tracing backend by Grafana
All-in-One Observability
- OpenTelemetry — Vendor-neutral instrumentation standard
- Honeycomb — Observability for distributed systems
- Lightstep — Observability platform
- Grafana Cloud — Managed Grafana, Prometheus, Loki, Tempo
Observability Culture
- Instrument by default — Make observability part of the development process
- Dashboards in code — Version control your dashboards alongside your services
- On-call reviews — Regularly review what alerts fired and why
- Observability as a feature — Budget time for improving observability
- Share knowledge — Document what metrics mean and how to interpret them