Monitoring and alerting provide visibility into your systems’ health and behaviour. Good observability practices help you detect problems early, diagnose issues quickly, and understand system performance over time.

The Three Pillars of Observability

Logs

Discrete events with structured or unstructured text describing what happened.

Characteristics:

  • High cardinality — can capture any detail
  • Immutable records of events
  • Essential for debugging and forensics
  • Can be expensive to store and query at scale

Best practices:

  • Use structured logging (JSON) for machine-parseable logs
  • Include correlation IDs to trace requests across services
  • Log at appropriate levels (DEBUG, INFO, WARN, ERROR)
  • Avoid logging sensitive data (PII, credentials)
  • Implement log rotation and retention policies

Example structured log:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "ERROR",
  "service": "payment-api",
  "trace_id": "abc123",
  "message": "Payment failed",
  "error": "Card declined",
  "user_id": "user_456",
  "amount": 99.99
}

Metrics

Numeric measurements aggregated over time, describing system state.

Characteristics:

  • Highly efficient to store and query
  • Good for dashboards and alerts
  • Limited cardinality (by design)
  • Show trends and patterns

Metric types:

  • Counter — Monotonically increasing value (requests, errors, bytes sent)
  • Gauge — Current value that can go up or down (temperature, queue depth, active connections)
  • Histogram — Distribution of values (latency percentiles, request sizes)
  • Summary — Pre-calculated quantiles (similar to histogram, different trade-offs)

Naming conventions:

# Format: <namespace>_<name>_<unit>
http_requests_total
http_request_duration_seconds
process_memory_bytes
queue_depth_messages

Traces

Records of requests as they flow through distributed systems.

Characteristics:

  • Show the path of a request across services
  • Reveal latency contributions from each component
  • Essential for debugging distributed systems
  • Require instrumentation across all services

Key concepts:

  • Trace — The complete journey of a request
  • Span — A single operation within a trace
  • Context propagation — Passing trace IDs between services

The Four Golden Signals

Google’s SRE book recommends monitoring these four signals for any service:

1. Latency

The time it takes to serve a request.

What to measure:

  • Percentiles (p50, p95, p99) — not averages
  • Distinguish successful vs failed request latency
  • Latency by endpoint/operation

Why percentiles matter:
Average latency can hide problems. If 99% of requests take 10ms and 1% take 10 seconds, the average is ~110ms — which hides the fact that 1 in 100 users has a terrible experience.

2. Traffic

The amount of demand being placed on the system.

What to measure:

  • Requests per second
  • Concurrent users/connections
  • Messages processed per second
  • Transactions per second

3. Errors

The rate of failed requests.

What to measure:

  • HTTP 5xx errors (server errors)
  • HTTP 4xx errors (client errors — track separately)
  • Application-specific error codes
  • Timeout errors
  • Failed dependency calls

4. Saturation

How “full” your service is — how close to capacity.

What to measure:

  • CPU utilisation
  • Memory utilisation
  • Disk I/O and space
  • Network bandwidth
  • Connection pool usage
  • Queue depth

RED and USE Methods

RED Method (for services)

  • Rate — Requests per second
  • Errors — Failed requests per second
  • Duration — Time per request (latency distribution)

USE Method (for resources)

  • Utilisation — Percentage of resource busy
  • Saturation — Amount of work queued
  • Errors — Count of error events

Use RED for request-driven services; USE for infrastructure resources.

Alerting Strategies

Alert on Symptoms, Not Causes

Alert on what affects users, not on potential causes.

Bad: Alert when CPU > 80%
Good: Alert when request latency p99 > 500ms

High CPU might be fine if users aren’t affected. Focus alerts on user-facing symptoms; investigate causes during incident response.

Alert Severity

LevelDescriptionResponse
PageUser-facing impact, requires immediate actionWake someone up
TicketNeeds attention but not urgentAddress during business hours
LogFor investigation, no immediate actionReview in dashboards

Only page for things that require immediate human intervention.

Avoiding Alert Fatigue

Alert fatigue is dangerous — when people receive too many alerts, they start ignoring them.

Symptoms of alert fatigue:

  • Alerts are routinely ignored or snoozed
  • On-call engineers feel burned out
  • Real issues get lost in noise

Solutions:

  • Delete alerts that never require action
  • Tune thresholds to reduce false positives
  • Combine related alerts
  • Add context and runbook links to alerts
  • Regularly review alert frequency and value

SLO-Based Alerting

Alert based on error budget burn rate rather than instantaneous values.

Multi-window, multi-burn-rate alerting:

  • Fast burn: Alert if burning 14x error budget (catching major incidents quickly)
  • Slow burn: Alert if burning 1x error budget over longer period (catching chronic issues)

This approach dramatically reduces alert noise while catching real problems.

Example:

# Alert if we'll exhaust monthly error budget in 1 hour (severe)
rate(errors[5m]) / rate(requests[5m]) > 14 * (1 - 0.999)

# Alert if we'll exhaust monthly error budget in 3 days (chronic)
rate(errors[6h]) / rate(requests[6h]) > 1 * (1 - 0.999)

See Reliability Metrics for more on SLOs and error budgets.

Dashboards

Dashboard Design Principles

  • Start with user impact — Top-level dashboards should show what users experience
  • Progressive disclosure — High-level overview → detailed drill-down
  • Consistency — Use the same colours, layouts, and conventions across dashboards
  • Context — Show baselines, thresholds, and time-shifted comparisons
  • Reduce cognitive load — Don’t overwhelm with too many charts

Dashboard Hierarchy

1. Executive/Service Overview
   └── Shows: Availability, latency, error rate, traffic
   
2. Service Detail
   └── Shows: All golden signals, broken down by endpoint/operation
   
3. Infrastructure
   └── Shows: Compute, database, network, storage metrics
   
4. Debug/Investigation
   └── Shows: Detailed metrics for troubleshooting

Common Dashboard Patterns

Single stat panels — Current value of key metrics (big numbers)
Time series graphs — Trends over time
Heatmaps — Distribution of values (great for latency)
Tables — Top-N lists, current state of multiple items

Instrumentation

What to Instrument

Application layer:

  • Incoming requests (count, latency, errors)
  • Outgoing requests (to databases, APIs, caches)
  • Business metrics (signups, purchases, uploads)
  • Queue processing (depth, processing time, failures)

Infrastructure layer:

  • CPU, memory, disk, network
  • Container/pod metrics
  • Database connection pools
  • Cache hit rates

Instrumentation Libraries

  • OpenTelemetry — Vendor-neutral standard for metrics, logs, and traces
  • Prometheus client libraries — For Prometheus-style metrics
  • StatsD — Simple UDP-based metrics protocol

Cardinality Considerations

High-cardinality labels (user IDs, request IDs, IP addresses) can explode metric storage costs. Use labels for things with bounded, low cardinality:

  • Good: status_code="200", method="GET", endpoint="/api/users"
  • Bad: user_id="12345", request_id="abc-123", ip="1.2.3.4"

For high-cardinality data, use logs or traces instead of metrics.

Log Aggregation

Centralised log aggregation is essential for distributed systems.

Pipeline:

Application → Log shipper → Message queue → Processor → Storage → Query interface
                (Fluent Bit)   (Kafka)      (Logstash)   (Elasticsearch)  (Kibana)

Considerations:

  • Log volume can be enormous — plan for storage costs
  • Implement sampling or filtering for high-volume, low-value logs
  • Set retention policies based on compliance and debugging needs
  • Index strategically for query performance

Tools

Metrics and Dashboards

Logging

Tracing

  • Jaeger — Open-source distributed tracing
  • Zipkin — Open-source distributed tracing
  • Tempo — Distributed tracing backend by Grafana

All-in-One Observability

Observability Culture

  • Instrument by default — Make observability part of the development process
  • Dashboards in code — Version control your dashboards alongside your services
  • On-call reviews — Regularly review what alerts fired and why
  • Observability as a feature — Budget time for improving observability
  • Share knowledge — Document what metrics mean and how to interpret them