Observability

Monitoring and alerting provide visibility into your systems’ health and behaviour. Good observability practices help you detect problems early, diagnose issues quickly, and understand system performance over time.

The Three Pillars of Observability

Logs

Discrete events with structured or unstructured text describing what happened.

Characteristics:

High cardinality — can capture any detail
Immutable records of events
Essential for debugging and forensics
Can be expensive to store and query at scale

Best practices:

Use structured logging (JSON) for machine-parseable logs
Include correlation IDs to trace requests across services
Log at appropriate levels (DEBUG, INFO, WARN, ERROR)
Avoid logging sensitive data (PII, credentials)
Implement log rotation and retention policies

Example structured log:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "ERROR",
  "service": "payment-api",
  "trace_id": "abc123",
  "message": "Payment failed",
  "error": "Card declined",
  "user_id": "user_456",
  "amount": 99.99
}

Metrics

Numeric measurements aggregated over time, describing system state.

Characteristics:

Highly efficient to store and query
Good for dashboards and alerts
Limited cardinality (by design)
Show trends and patterns

Metric types:

Counter — Monotonically increasing value (requests, errors, bytes sent)
Gauge — Current value that can go up or down (temperature, queue depth, active connections)
Histogram — Distribution of values (latency percentiles, request sizes)
Summary — Pre-calculated quantiles (similar to histogram, different trade-offs)

Naming conventions:

# Format: <namespace>_<name>_<unit>
http_requests_total
http_request_duration_seconds
process_memory_bytes
queue_depth_messages

Traces

Records of requests as they flow through distributed systems.

Characteristics:

Show the path of a request across services
Reveal latency contributions from each component
Essential for debugging distributed systems
Require instrumentation across all services

Key concepts:

Trace — The complete journey of a request
Span — A single operation within a trace
Context propagation — Passing trace IDs between services

The Four Golden Signals

Google’s SRE book recommends monitoring these four signals for any service:

1. Latency

The time it takes to serve a request.

What to measure:

Percentiles (p50, p95, p99) — not averages
Distinguish successful vs failed request latency
Latency by endpoint/operation

Why percentiles matter:
Average latency can hide problems. If 99% of requests take 10ms and 1% take 10 seconds, the average is ~110ms — which hides the fact that 1 in 100 users has a terrible experience.

2. Traffic

The amount of demand being placed on the system.

What to measure:

Requests per second
Concurrent users/connections
Messages processed per second
Transactions per second

3. Errors

The rate of failed requests.

What to measure:

HTTP 5xx errors (server errors)
HTTP 4xx errors (client errors — track separately)
Application-specific error codes
Timeout errors
Failed dependency calls

4. Saturation

How “full” your service is — how close to capacity.

What to measure:

CPU utilisation
Memory utilisation
Disk I/O and space
Network bandwidth
Connection pool usage
Queue depth

RED and USE Methods

RED Method (for services)

Rate — Requests per second
Errors — Failed requests per second
Duration — Time per request (latency distribution)

USE Method (for resources)

Utilisation — Percentage of resource busy
Saturation — Amount of work queued
Errors — Count of error events

Use RED for request-driven services; USE for infrastructure resources.

Alerting Strategies

Alert on Symptoms, Not Causes

Alert on what affects users, not on potential causes.

Bad: Alert when CPU > 80%
Good: Alert when request latency p99 > 500ms

High CPU might be fine if users aren’t affected. Focus alerts on user-facing symptoms; investigate causes during incident response.

Alert Severity

Level	Description	Response
Page	User-facing impact, requires immediate action	Wake someone up
Ticket	Needs attention but not urgent	Address during business hours
Log	For investigation, no immediate action	Review in dashboards

Only page for things that require immediate human intervention.

Avoiding Alert Fatigue

Alert fatigue is dangerous — when people receive too many alerts, they start ignoring them.

Symptoms of alert fatigue:

Alerts are routinely ignored or snoozed
On-call engineers feel burned out
Real issues get lost in noise

Solutions:

Delete alerts that never require action
Tune thresholds to reduce false positives
Combine related alerts
Add context and runbook links to alerts
Regularly review alert frequency and value

SLO-Based Alerting

Alert based on error budget burn rate rather than instantaneous values.

Multi-window, multi-burn-rate alerting:

Fast burn: Alert if burning 14x error budget (catching major incidents quickly)
Slow burn: Alert if burning 1x error budget over longer period (catching chronic issues)

This approach dramatically reduces alert noise while catching real problems.

Example:

# Alert if we'll exhaust monthly error budget in 1 hour (severe)
rate(errors[5m]) / rate(requests[5m]) > 14 * (1 - 0.999)

# Alert if we'll exhaust monthly error budget in 3 days (chronic)
rate(errors[6h]) / rate(requests[6h]) > 1 * (1 - 0.999)

See Reliability Metrics for more on SLOs and error budgets.

Dashboards

Dashboard Design Principles

Start with user impact — Top-level dashboards should show what users experience
Progressive disclosure — High-level overview → detailed drill-down
Consistency — Use the same colours, layouts, and conventions across dashboards
Context — Show baselines, thresholds, and time-shifted comparisons
Reduce cognitive load — Don’t overwhelm with too many charts

Dashboard Hierarchy

1. Executive/Service Overview
   └── Shows: Availability, latency, error rate, traffic

2. Service Detail
   └── Shows: All golden signals, broken down by endpoint/operation

3. Infrastructure
   └── Shows: Compute, database, network, storage metrics

4. Debug/Investigation
   └── Shows: Detailed metrics for troubleshooting

Common Dashboard Patterns

Single stat panels — Current value of key metrics (big numbers)
Time series graphs — Trends over time
Heatmaps — Distribution of values (great for latency)
Tables — Top-N lists, current state of multiple items

Instrumentation

What to Instrument

Application layer:

Incoming requests (count, latency, errors)
Outgoing requests (to databases, APIs, caches)
Business metrics (signups, purchases, uploads)
Queue processing (depth, processing time, failures)

Infrastructure layer:

CPU, memory, disk, network
Container/pod metrics
Database connection pools
Cache hit rates

Instrumentation Libraries

OpenTelemetry — Vendor-neutral standard for metrics, logs, and traces
Prometheus client libraries — For Prometheus-style metrics
StatsD — Simple UDP-based metrics protocol

Cardinality Considerations

High-cardinality labels (user IDs, request IDs, IP addresses) can explode metric storage costs. Use labels for things with bounded, low cardinality:

Good: status_code="200", method="GET", endpoint="/api/users"
Bad: user_id="12345", request_id="abc-123", ip="1.2.3.4"

For high-cardinality data, use logs or traces instead of metrics.

Log Aggregation

Centralised log aggregation is essential for distributed systems.

Pipeline:

Application → Log shipper → Message queue → Processor → Storage → Query interface
                (Fluent Bit)   (Kafka)      (Logstash)   (Elasticsearch)  (Kibana)

Considerations:

Log volume can be enormous — plan for storage costs
Implement sampling or filtering for high-volume, low-value logs
Set retention policies based on compliance and debugging needs
Index strategically for query performance

Tools

Metrics and Dashboards

Prometheus — Open-source metrics and alerting
Grafana — Visualisation and dashboards
Datadog — Full-stack monitoring platform
New Relic — Application performance monitoring
CloudWatch (AWS), Cloud Monitoring (GCP), Azure Monitor

Logging

Elasticsearch + Kibana (ELK) — Open-source log aggregation
Loki — Log aggregation by Grafana (Prometheus-style)
Splunk — Enterprise log management
Papertrail, Logtail

Tracing

Jaeger — Open-source distributed tracing
Zipkin — Open-source distributed tracing
Tempo — Distributed tracing backend by Grafana

All-in-One Observability

OpenTelemetry — Vendor-neutral instrumentation standard
Honeycomb — Observability for distributed systems
Lightstep — Observability platform
Grafana Cloud — Managed Grafana, Prometheus, Loki, Tempo

Observability Culture

Instrument by default — Make observability part of the development process
Dashboards in code — Version control your dashboards alongside your services
On-call reviews — Regularly review what alerts fired and why
Observability as a feature — Budget time for improving observability
Share knowledge — Document what metrics mean and how to interpret them

Rai Notes

Explorer

Observability

The Three Pillars of Observability

Logs

Metrics

Traces

The Four Golden Signals

1. Latency

2. Traffic

3. Errors

4. Saturation

RED and USE Methods

RED Method (for services)

USE Method (for resources)

Alerting Strategies

Alert on Symptoms, Not Causes

Alert Severity

Avoiding Alert Fatigue

SLO-Based Alerting

Dashboards

Dashboard Design Principles

Dashboard Hierarchy

Common Dashboard Patterns

Instrumentation

What to Instrument

Instrumentation Libraries

Cardinality Considerations

Log Aggregation

Tools

Metrics and Dashboards

Logging

Tracing

All-in-One Observability

Observability Culture

Links

Graph View

Table of Contents

Backlinks