Reliability metrics provide a quantitative way to define, measure, and communicate the reliability of a service. They form the foundation for setting expectations with customers and making data-driven decisions about reliability investments.
Core Concepts
SLI (Service Level Indicator)
A quantitative measure of some aspect of the service level being provided. SLIs are the raw metrics that describe how your service is performing.
Good SLIs are:
- Directly tied to user experience
- Measurable and objective
- Simple to understand
Common SLI categories:
- Availability — Proportion of time the service is operational
- Latency — Time taken to respond to a request
- Throughput — Rate of successful requests processed
- Error rate — Proportion of requests that fail
- Durability — Likelihood that data will be retained over time
- Freshness — How up-to-date the data is
SLO (Service Level Objective)
A target value or range for a service level measured by an SLI. SLOs define what “good enough” looks like.
Example SLOs:
- 99.9% of requests return successfully (availability)
- 95% of requests complete in under 200ms (latency)
- 99.99% of stored objects are retrievable (durability)
SLA (Service Level Agreement)
A contract between a service provider and customer that specifies consequences (usually financial) if SLOs are not met. SLAs are typically more conservative than internal SLOs.
Relationship: SLIs measure → SLOs target → SLAs promise
SLI: "The proportion of successful HTTP requests"
SLO: "99.9% of requests will succeed, measured monthly"
SLA: "If availability drops below 99.9%, customers receive service credits"
Choosing Good SLIs
The User Journey Approach
Start with critical user journeys and ask: “What does the user care about?”
| User Journey | What Users Care About | SLI |
|---|---|---|
| Loading a page | It loads quickly | Request latency (p50, p95, p99) |
| Submitting a form | It succeeds | Success rate |
| Uploading a file | It doesn’t get lost | Durability |
| Viewing a dashboard | Data is current | Freshness |
SLI Specification vs Implementation
- Specification — What you want to measure (e.g., “user-perceived latency”)
- Implementation — How you actually measure it (e.g., “server-side request duration from logs”)
Be aware of the gap between these. Server-side metrics miss client-side rendering time, network latency, etc. Measure as close to the user as possible.
Common SLI Patterns
Request-driven services (APIs, web servers):
- Availability:
successful requests / total requests - Latency:
requests faster than threshold / total requests
Pipeline/batch systems:
- Freshness:
time since last successful run - Coverage:
records processed successfully / total records
Storage systems:
- Durability:
successful reads of previously written data - Throughput:
bytes read/written per second
Setting SLOs
Principles
- Start with user expectations — What do users actually need?
- Be achievable — Set targets you can realistically meet
- Leave room for error — Don’t set SLOs at 100%
- Iterate — Refine SLOs based on data and feedback
The “Nines” of Availability
| Availability | Downtime per year | Downtime per month | Downtime per week |
|---|---|---|---|
| 99% (two nines) | 3.65 days | 7.3 hours | 1.68 hours |
| 99.9% (three nines) | 8.76 hours | 43.8 minutes | 10.1 minutes |
| 99.95% | 4.38 hours | 21.9 minutes | 5.04 minutes |
| 99.99% (four nines) | 52.6 minutes | 4.38 minutes | 1.01 minutes |
| 99.999% (five nines) | 5.26 minutes | 26.3 seconds | 6.05 seconds |
Each additional “nine” is roughly 10x harder and more expensive to achieve.
SLO Windows
- Rolling window — e.g., “99.9% over the past 30 days” — provides continuous feedback
- Calendar window — e.g., “99.9% per calendar month” — aligns with billing and reporting cycles
Rolling windows are generally preferred for operational decisions; calendar windows for business reporting.
Error Budgets
An error budget is the inverse of your SLO — the amount of unreliability you’re allowed.
If SLO = 99.9% availability
Error budget = 0.1% = 43.8 minutes of downtime per month
Using Error Budgets
Error budgets create a shared framework for balancing reliability and velocity:
| Budget Status | Action |
|---|---|
| Budget healthy | Ship features, take calculated risks, run experiments |
| Budget depleting | Slow down, increase testing, review recent changes |
| Budget exhausted | Freeze feature releases, focus entirely on reliability |
Error Budget Policies
Document explicit policies for what happens as budget is consumed:
- At 50% consumed: Review recent incidents, increase monitoring
- At 75% consumed: Require additional review for risky changes
- At 100% consumed: Halt non-critical deployments until budget recovers
Beyond DORA: Expanding Your Metrics
DORA metrics measure software delivery performance. Reliability metrics complement DORA by measuring operational outcomes.
| DORA Metric | Reliability Counterpart |
|---|---|
| Change Failure Rate | Error budget consumption from deployments |
| Mean Time to Recovery | Time to restore SLO compliance |
| Deployment Frequency | Deployment impact on error budget |
| Lead Time for Changes | Time to deploy reliability fixes |
Additional Operational Metrics
Toil metrics:
- Time spent on manual, repetitive operational work
- Ratio of toil to engineering work
- Trend over time (should decrease)
On-call health:
- Pages per on-call shift
- Out-of-hours pages
- False positive rate
- Time to acknowledge/resolve
Incident metrics:
- Incidents per service/team
- Mean time to detect (MTTD)
- Mean time to mitigate (MTTM)
- Mean time to resolve (MTTR)
- Incident recurrence rate
Implementing SLOs
Step 1: Choose What to Measure
- Identify 3-5 critical user journeys
- Define SLIs for each journey
- Start simple — you can add complexity later
Step 2: Collect Data
- Instrument your services to emit the required metrics
- Measure at the edge where possible (load balancer, CDN)
- Store metrics with sufficient granularity and retention
Step 3: Set Initial Targets
- Analyse historical data to understand current performance
- Set SLOs slightly below current performance initially
- Avoid “aspirational” SLOs that you can’t meet
Step 4: Alert on SLO Burn Rate
Rather than alerting on instantaneous values, alert on the rate at which you’re consuming your error budget:
- Fast burn — Alert immediately if burning budget at >14x normal rate
- Slow burn — Alert within hours if burning at >1x normal rate
This approach reduces alert noise while catching real problems.
Step 5: Review and Iterate
- Review SLO performance monthly or quarterly
- Adjust targets based on data and changing requirements
- Involve stakeholders in SLO reviews
Common Pitfalls
- Too many SLOs — Start with 1-3 per service; more creates confusion
- SLOs at 100% — Impossible to achieve; leaves no room for error budgets
- Measuring the wrong thing — Server uptime ≠ user experience
- No error budget policy — Without consequences, SLOs are just numbers
- Set and forget — SLOs need regular review and adjustment
Tools
- SLO platforms — Nobl9, Blameless
- Observability with SLO support — Datadog, Honeycomb, Google Cloud SLO Monitoring
- Open source — Sloth (Kubernetes SLO generator), Pyrra