Reliability metrics provide a quantitative way to define, measure, and communicate the reliability of a service. They form the foundation for setting expectations with customers and making data-driven decisions about reliability investments.

Core Concepts

SLI (Service Level Indicator)

A quantitative measure of some aspect of the service level being provided. SLIs are the raw metrics that describe how your service is performing.

Good SLIs are:

  • Directly tied to user experience
  • Measurable and objective
  • Simple to understand

Common SLI categories:

  • Availability — Proportion of time the service is operational
  • Latency — Time taken to respond to a request
  • Throughput — Rate of successful requests processed
  • Error rate — Proportion of requests that fail
  • Durability — Likelihood that data will be retained over time
  • Freshness — How up-to-date the data is

SLO (Service Level Objective)

A target value or range for a service level measured by an SLI. SLOs define what “good enough” looks like.

Example SLOs:

  • 99.9% of requests return successfully (availability)
  • 95% of requests complete in under 200ms (latency)
  • 99.99% of stored objects are retrievable (durability)

SLA (Service Level Agreement)

A contract between a service provider and customer that specifies consequences (usually financial) if SLOs are not met. SLAs are typically more conservative than internal SLOs.

Relationship: SLIs measure → SLOs target → SLAs promise

SLI: "The proportion of successful HTTP requests"
SLO: "99.9% of requests will succeed, measured monthly"
SLA: "If availability drops below 99.9%, customers receive service credits"

Choosing Good SLIs

The User Journey Approach

Start with critical user journeys and ask: “What does the user care about?”

User JourneyWhat Users Care AboutSLI
Loading a pageIt loads quicklyRequest latency (p50, p95, p99)
Submitting a formIt succeedsSuccess rate
Uploading a fileIt doesn’t get lostDurability
Viewing a dashboardData is currentFreshness

SLI Specification vs Implementation

  • Specification — What you want to measure (e.g., “user-perceived latency”)
  • Implementation — How you actually measure it (e.g., “server-side request duration from logs”)

Be aware of the gap between these. Server-side metrics miss client-side rendering time, network latency, etc. Measure as close to the user as possible.

Common SLI Patterns

Request-driven services (APIs, web servers):

  • Availability: successful requests / total requests
  • Latency: requests faster than threshold / total requests

Pipeline/batch systems:

  • Freshness: time since last successful run
  • Coverage: records processed successfully / total records

Storage systems:

  • Durability: successful reads of previously written data
  • Throughput: bytes read/written per second

Setting SLOs

Principles

  1. Start with user expectations — What do users actually need?
  2. Be achievable — Set targets you can realistically meet
  3. Leave room for error — Don’t set SLOs at 100%
  4. Iterate — Refine SLOs based on data and feedback

The “Nines” of Availability

AvailabilityDowntime per yearDowntime per monthDowntime per week
99% (two nines)3.65 days7.3 hours1.68 hours
99.9% (three nines)8.76 hours43.8 minutes10.1 minutes
99.95%4.38 hours21.9 minutes5.04 minutes
99.99% (four nines)52.6 minutes4.38 minutes1.01 minutes
99.999% (five nines)5.26 minutes26.3 seconds6.05 seconds

Each additional “nine” is roughly 10x harder and more expensive to achieve.

SLO Windows

  • Rolling window — e.g., “99.9% over the past 30 days” — provides continuous feedback
  • Calendar window — e.g., “99.9% per calendar month” — aligns with billing and reporting cycles

Rolling windows are generally preferred for operational decisions; calendar windows for business reporting.

Error Budgets

An error budget is the inverse of your SLO — the amount of unreliability you’re allowed.

If SLO = 99.9% availability
Error budget = 0.1% = 43.8 minutes of downtime per month

Using Error Budgets

Error budgets create a shared framework for balancing reliability and velocity:

Budget StatusAction
Budget healthyShip features, take calculated risks, run experiments
Budget depletingSlow down, increase testing, review recent changes
Budget exhaustedFreeze feature releases, focus entirely on reliability

Error Budget Policies

Document explicit policies for what happens as budget is consumed:

  • At 50% consumed: Review recent incidents, increase monitoring
  • At 75% consumed: Require additional review for risky changes
  • At 100% consumed: Halt non-critical deployments until budget recovers

Beyond DORA: Expanding Your Metrics

DORA metrics measure software delivery performance. Reliability metrics complement DORA by measuring operational outcomes.

DORA MetricReliability Counterpart
Change Failure RateError budget consumption from deployments
Mean Time to RecoveryTime to restore SLO compliance
Deployment FrequencyDeployment impact on error budget
Lead Time for ChangesTime to deploy reliability fixes

Additional Operational Metrics

Toil metrics:

  • Time spent on manual, repetitive operational work
  • Ratio of toil to engineering work
  • Trend over time (should decrease)

On-call health:

  • Pages per on-call shift
  • Out-of-hours pages
  • False positive rate
  • Time to acknowledge/resolve

Incident metrics:

  • Incidents per service/team
  • Mean time to detect (MTTD)
  • Mean time to mitigate (MTTM)
  • Mean time to resolve (MTTR)
  • Incident recurrence rate

Implementing SLOs

Step 1: Choose What to Measure

  • Identify 3-5 critical user journeys
  • Define SLIs for each journey
  • Start simple — you can add complexity later

Step 2: Collect Data

  • Instrument your services to emit the required metrics
  • Measure at the edge where possible (load balancer, CDN)
  • Store metrics with sufficient granularity and retention

Step 3: Set Initial Targets

  • Analyse historical data to understand current performance
  • Set SLOs slightly below current performance initially
  • Avoid “aspirational” SLOs that you can’t meet

Step 4: Alert on SLO Burn Rate

Rather than alerting on instantaneous values, alert on the rate at which you’re consuming your error budget:

  • Fast burn — Alert immediately if burning budget at >14x normal rate
  • Slow burn — Alert within hours if burning at >1x normal rate

This approach reduces alert noise while catching real problems.

Step 5: Review and Iterate

  • Review SLO performance monthly or quarterly
  • Adjust targets based on data and changing requirements
  • Involve stakeholders in SLO reviews

Common Pitfalls

  • Too many SLOs — Start with 1-3 per service; more creates confusion
  • SLOs at 100% — Impossible to achieve; leaves no room for error budgets
  • Measuring the wrong thing — Server uptime ≠ user experience
  • No error budget policy — Without consequences, SLOs are just numbers
  • Set and forget — SLOs need regular review and adjustment

Tools