Monitoring and alerting provide visibility into your systems’ health and behaviour. Good observability practices help you detect problems early, diagnose issues quickly, and understand system performance over time.
The Three Pillars of Observability
Logs
Discrete events with structured or unstructured text describing what happened.
Characteristics:
- High cardinality — can capture any detail
- Immutable records of events
- Essential for debugging and forensics
- Can be expensive to store and query at scale
Best practices:
- Use structured logging (JSON) for machine-parseable logs
- Include correlation IDs to trace requests across services
- Log at appropriate levels (DEBUG, INFO, WARN, ERROR)
- Avoid logging sensitive data (PII, credentials)
- Implement log rotation and retention policies
Example structured log:
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "ERROR",
"service": "payment-api",
"trace_id": "abc123",
"message": "Payment failed",
"error": "Card declined",
"user_id": "user_456",
"amount": 99.99
}Metrics
Numeric measurements aggregated over time, describing system state.
Characteristics:
- Highly efficient to store and query
- Good for dashboards and alerts
- Limited cardinality (by design)
- Show trends and patterns
Metric types:
- Counter — Monotonically increasing value (requests, errors, bytes sent)
- Gauge — Current value that can go up or down (temperature, queue depth, active connections)
- Histogram — Distribution of values (latency percentiles, request sizes)
- Summary — Pre-calculated quantiles (similar to histogram, different trade-offs)
Naming conventions:
# Format: <namespace>_<name>_<unit>
http_requests_total
http_request_duration_seconds
process_memory_bytes
queue_depth_messages
Traces
Records of requests as they flow through distributed systems.
Characteristics:
- Show the path of a request across services
- Reveal latency contributions from each component
- Essential for debugging distributed systems
- Require instrumentation across all services
Key concepts:
- Trace — The complete journey of a request
- Span — A single operation within a trace
- Context propagation — Passing trace IDs between services
The Four Golden Signals
Google’s SRE book recommends monitoring these four signals for any service:
1. Latency
The time it takes to serve a request.
What to measure:
- Percentiles (p50, p95, p99) — not averages
- Distinguish successful vs failed request latency
- Latency by endpoint/operation
Why percentiles matter:
Average latency can hide problems. If 99% of requests take 10ms and 1% take 10 seconds, the average is ~110ms — which hides the fact that 1 in 100 users has a terrible experience.
2. Traffic
The amount of demand being placed on the system.
What to measure:
- Requests per second
- Concurrent users/connections
- Messages processed per second
- Transactions per second
3. Errors
The rate of failed requests.
What to measure:
- HTTP 5xx errors (server errors)
- HTTP 4xx errors (client errors — track separately)
- Application-specific error codes
- Timeout errors
- Failed dependency calls
4. Saturation
How “full” your service is — how close to capacity.
What to measure:
- CPU utilisation
- Memory utilisation
- Disk I/O and space
- Network bandwidth
- Connection pool usage
- Queue depth
RED and USE Methods
RED Method (for services)
- Rate — Requests per second
- Errors — Failed requests per second
- Duration — Time per request (latency distribution)
USE Method (for resources)
- Utilisation — Percentage of resource busy
- Saturation — Amount of work queued
- Errors — Count of error events
Use RED for request-driven services; USE for infrastructure resources.
Alerting Strategies
Alert on Symptoms, Not Causes
Alert on what affects users, not on potential causes.
Bad: Alert when CPU > 80%
Good: Alert when request latency p99 > 500ms
High CPU might be fine if users aren’t affected. Focus alerts on user-facing symptoms; investigate causes during incident response.
Alert Severity
| Level | Description | Response |
|---|---|---|
| Page | User-facing impact, requires immediate action | Wake someone up |
| Ticket | Needs attention but not urgent | Address during business hours |
| Log | For investigation, no immediate action | Review in dashboards |
Only page for things that require immediate human intervention.
Avoiding Alert Fatigue
Alert fatigue is dangerous — when people receive too many alerts, they start ignoring them.
Symptoms of alert fatigue:
- Alerts are routinely ignored or snoozed
- On-call engineers feel burned out
- Real issues get lost in noise
Solutions:
- Delete alerts that never require action
- Tune thresholds to reduce false positives
- Combine related alerts
- Add context and runbook links to alerts
- Regularly review alert frequency and value
SLO-Based Alerting
Alert based on error budget burn rate rather than instantaneous values.
Multi-window, multi-burn-rate alerting:
- Fast burn: Alert if burning 14x error budget (catching major incidents quickly)
- Slow burn: Alert if burning 1x error budget over longer period (catching chronic issues)
This approach dramatically reduces alert noise while catching real problems.
Example:
# Alert if we'll exhaust monthly error budget in 1 hour (severe)
rate(errors[5m]) / rate(requests[5m]) > 14 * (1 - 0.999)
# Alert if we'll exhaust monthly error budget in 3 days (chronic)
rate(errors[6h]) / rate(requests[6h]) > 1 * (1 - 0.999)
See Reliability Metrics for more on SLOs and error budgets.
Dashboards
Dashboard Design Principles
- Start with user impact — Top-level dashboards should show what users experience
- Progressive disclosure — High-level overview → detailed drill-down
- Consistency — Use the same colours, layouts, and conventions across dashboards
- Context — Show baselines, thresholds, and time-shifted comparisons
- Reduce cognitive load — Don’t overwhelm with too many charts
Dashboard Hierarchy
1. Executive/Service Overview
└── Shows: Availability, latency, error rate, traffic
2. Service Detail
└── Shows: All golden signals, broken down by endpoint/operation
3. Infrastructure
└── Shows: Compute, database, network, storage metrics
4. Debug/Investigation
└── Shows: Detailed metrics for troubleshooting
Common Dashboard Patterns
Single stat panels — Current value of key metrics (big numbers)
Time series graphs — Trends over time
Heatmaps — Distribution of values (great for latency)
Tables — Top-N lists, current state of multiple items
Instrumentation
What to Instrument
Application layer:
- Incoming requests (count, latency, errors)
- Outgoing requests (to databases, APIs, caches)
- Business metrics (signups, purchases, uploads)
- Queue processing (depth, processing time, failures)
Infrastructure layer:
- CPU, memory, disk, network
- Container/pod metrics
- Database connection pools
- Cache hit rates
Instrumentation Libraries
- OpenTelemetry — Vendor-neutral standard for metrics, logs, and traces
- Prometheus client libraries — For Prometheus-style metrics
- StatsD — Simple UDP-based metrics protocol
Cardinality Considerations
High-cardinality labels (user IDs, request IDs, IP addresses) can explode metric storage costs. Use labels for things with bounded, low cardinality:
- Good:
status_code="200",method="GET",endpoint="/api/users" - Bad:
user_id="12345",request_id="abc-123",ip="1.2.3.4"
For high-cardinality data, use logs or traces instead of metrics.
Log Aggregation
Centralised log aggregation is essential for distributed systems.
Pipeline:
Application → Log shipper → Message queue → Processor → Storage → Query interface
(Fluent Bit) (Kafka) (Logstash) (Elasticsearch) (Kibana)
Considerations:
- Log volume can be enormous — plan for storage costs
- Implement sampling or filtering for high-volume, low-value logs
- Set retention policies based on compliance and debugging needs
- Index strategically for query performance
Tools
Metrics and Dashboards
- Prometheus — Open-source metrics and alerting
- Grafana — Visualisation and dashboards
- Datadog — Full-stack monitoring platform
- New Relic — Application performance monitoring
- CloudWatch (AWS), Cloud Monitoring (GCP), Azure Monitor
Logging
- Elasticsearch + Kibana (ELK) — Open-source log aggregation
- Loki — Log aggregation by Grafana (Prometheus-style)
- Splunk — Enterprise log management
- Papertrail, Logtail
Tracing
- Jaeger — Open-source distributed tracing
- Zipkin — Open-source distributed tracing
- Tempo — Distributed tracing backend by Grafana
All-in-One Observability
- OpenTelemetry — Vendor-neutral instrumentation standard
- Honeycomb — Observability for distributed systems
- Lightstep — Observability platform
- Grafana Cloud — Managed Grafana, Prometheus, Loki, Tempo
Observability Culture
- Instrument by default — Make observability part of the development process
- Dashboards in code — Version control your dashboards alongside your services
- On-call reviews — Regularly review what alerts fired and why
- Observability as a feature — Budget time for improving observability
- Share knowledge — Document what metrics mean and how to interpret them