Reliability Metrics

Reliability metrics provide a quantitative way to define, measure, and communicate the reliability of a service. They form the foundation for setting expectations with customers and making data-driven decisions about reliability investments.

Core Concepts

SLI (Service Level Indicator)

A quantitative measure of some aspect of the service level being provided. SLIs are the raw metrics that describe how your service is performing.

Good SLIs are:

Directly tied to user experience
Measurable and objective
Simple to understand

Common SLI categories:

Availability — Proportion of time the service is operational
Latency — Time taken to respond to a request
Throughput — Rate of successful requests processed
Error rate — Proportion of requests that fail
Durability — Likelihood that data will be retained over time
Freshness — How up-to-date the data is

SLO (Service Level Objective)

A target value or range for a service level measured by an SLI. SLOs define what “good enough” looks like.

Example SLOs:

99.9% of requests return successfully (availability)
95% of requests complete in under 200ms (latency)
99.99% of stored objects are retrievable (durability)

SLA (Service Level Agreement)

A contract between a service provider and customer that specifies consequences (usually financial) if SLOs are not met. SLAs are typically more conservative than internal SLOs.

Relationship: SLIs measure → SLOs target → SLAs promise

SLI: "The proportion of successful HTTP requests"
SLO: "99.9% of requests will succeed, measured monthly"
SLA: "If availability drops below 99.9%, customers receive service credits"

Choosing Good SLIs

The User Journey Approach

Start with critical user journeys and ask: “What does the user care about?”

User Journey	What Users Care About	SLI
Loading a page	It loads quickly	Request latency (p50, p95, p99)
Submitting a form	It succeeds	Success rate
Uploading a file	It doesn’t get lost	Durability
Viewing a dashboard	Data is current	Freshness

SLI Specification vs Implementation

Specification — What you want to measure (e.g., “user-perceived latency”)
Implementation — How you actually measure it (e.g., “server-side request duration from logs”)

Be aware of the gap between these. Server-side metrics miss client-side rendering time, network latency, etc. Measure as close to the user as possible.

Common SLI Patterns

Request-driven services (APIs, web servers):

Availability: successful requests / total requests
Latency: requests faster than threshold / total requests

Pipeline/batch systems:

Freshness: time since last successful run
Coverage: records processed successfully / total records

Storage systems:

Durability: successful reads of previously written data
Throughput: bytes read/written per second

Setting SLOs

Principles

Start with user expectations — What do users actually need?
Be achievable — Set targets you can realistically meet
Leave room for error — Don’t set SLOs at 100%
Iterate — Refine SLOs based on data and feedback

The “Nines” of Availability

Availability	Downtime per year	Downtime per month	Downtime per week
99% (two nines)	3.65 days	7.3 hours	1.68 hours
99.9% (three nines)	8.76 hours	43.8 minutes	10.1 minutes
99.95%	4.38 hours	21.9 minutes	5.04 minutes
99.99% (four nines)	52.6 minutes	4.38 minutes	1.01 minutes
99.999% (five nines)	5.26 minutes	26.3 seconds	6.05 seconds

Each additional “nine” is roughly 10x harder and more expensive to achieve.

SLO Windows

Rolling window — e.g., “99.9% over the past 30 days” — provides continuous feedback
Calendar window — e.g., “99.9% per calendar month” — aligns with billing and reporting cycles

Rolling windows are generally preferred for operational decisions; calendar windows for business reporting.

Error Budgets

An error budget is the inverse of your SLO — the amount of unreliability you’re allowed.

If SLO = 99.9% availability
Error budget = 0.1% = 43.8 minutes of downtime per month

Using Error Budgets

Error budgets create a shared framework for balancing reliability and velocity:

Budget Status	Action
Budget healthy	Ship features, take calculated risks, run experiments
Budget depleting	Slow down, increase testing, review recent changes
Budget exhausted	Freeze feature releases, focus entirely on reliability

Error Budget Policies

Document explicit policies for what happens as budget is consumed:

At 50% consumed: Review recent incidents, increase monitoring
At 75% consumed: Require additional review for risky changes
At 100% consumed: Halt non-critical deployments until budget recovers

Beyond DORA: Expanding Your Metrics

DORA metrics measure software delivery performance. Reliability metrics complement DORA by measuring operational outcomes.

DORA Metric	Reliability Counterpart
Change Failure Rate	Error budget consumption from deployments
Mean Time to Recovery	Time to restore SLO compliance
Deployment Frequency	Deployment impact on error budget
Lead Time for Changes	Time to deploy reliability fixes

Additional Operational Metrics

Toil metrics:

Time spent on manual, repetitive operational work
Ratio of toil to engineering work
Trend over time (should decrease)

On-call health:

Pages per on-call shift
Out-of-hours pages
False positive rate
Time to acknowledge/resolve

Incident metrics:

Incidents per service/team
Mean time to detect (MTTD)
Mean time to mitigate (MTTM)
Mean time to resolve (MTTR)
Incident recurrence rate

Implementing SLOs

Step 1: Choose What to Measure

Identify 3-5 critical user journeys
Define SLIs for each journey
Start simple — you can add complexity later

Step 2: Collect Data

Instrument your services to emit the required metrics
Measure at the edge where possible (load balancer, CDN)
Store metrics with sufficient granularity and retention

Step 3: Set Initial Targets

Analyse historical data to understand current performance
Set SLOs slightly below current performance initially
Avoid “aspirational” SLOs that you can’t meet

Step 4: Alert on SLO Burn Rate

Rather than alerting on instantaneous values, alert on the rate at which you’re consuming your error budget:

Fast burn — Alert immediately if burning budget at >14x normal rate
Slow burn — Alert within hours if burning at >1x normal rate

This approach reduces alert noise while catching real problems.

Step 5: Review and Iterate

Review SLO performance monthly or quarterly
Adjust targets based on data and changing requirements
Involve stakeholders in SLO reviews

Common Pitfalls

Too many SLOs — Start with 1-3 per service; more creates confusion
SLOs at 100% — Impossible to achieve; leaves no room for error budgets
Measuring the wrong thing — Server uptime ≠ user experience
No error budget policy — Without consequences, SLOs are just numbers
Set and forget — SLOs need regular review and adjustment

Tools

SLO platforms — Nobl9, Blameless
Observability with SLO support — Datadog, Honeycomb, Google Cloud SLO Monitoring
Open source — Sloth (Kubernetes SLO generator), Pyrra

Rai Notes

Explorer