Incident Management

Incident management is the practice of responding to unplanned events or service interruptions and restoring services to their operational state. A mature incident management process reduces downtime, builds customer trust, and creates learning opportunities.

Incident Lifecycle

1. Detection

Incidents can be detected through:

Automated monitoring — Alerts from monitoring systems when thresholds are breached
Customer reports — Support tickets, social media, or direct communication
Internal discovery — Engineers noticing anomalies during routine work

Aim for automated detection to catch issues before customers do. The gap between incident start and detection is your Time to Detect (TTD).

2. Triage

Quickly assess the incident to determine:

Impact — How many users/systems are affected?
Urgency — Is it getting worse? Is there a workaround?
Severity — Assign a severity level (see below)

3. Response

Assemble the right people and begin mitigation:

Page the on-call engineer
Escalate to subject matter experts if needed
Communicate status to stakeholders
Focus on mitigation first, root cause later

4. Resolution

Restore service to normal operation:

Apply a fix (temporary or permanent)
Verify the fix works
Monitor for recurrence

5. Post-Incident Review

After the incident is resolved:

Conduct a blameless postmortem
Document findings and action items
Share learnings with the wider organisation

Severity Levels

A consistent severity classification helps teams prioritise and respond appropriately.

Severity	Description	Response Time	Examples
SEV1 / Critical	Complete service outage or major security breach	Immediate (24/7)	Production down, data breach, payment system failure
SEV2 / High	Significant degradation affecting many users	< 1 hour	Major feature broken, severe performance issues
SEV3 / Medium	Partial degradation with workaround available	< 4 hours	Non-critical feature broken, intermittent errors
SEV4 / Low	Minor issue with minimal impact	Next business day	Cosmetic bugs, minor inconveniences

Adjust thresholds based on your organisation’s needs and SLAs.

On-Call Practices

Rotation Design

Primary and secondary — Have a backup on-call to prevent single points of failure
Rotation length — Typically 1 week; shorter rotations (3-4 days) reduce fatigue
Follow the sun — For global teams, hand off to engineers in different time zones
Fair distribution — Ensure on-call burden is shared equitably across the team

On-Call Expectations

Acknowledge alerts within a defined SLA (e.g., 5-15 minutes)
Have access to necessary tools and documentation
Know when and how to escalate
Document actions taken during incidents

Reducing On-Call Burden

Fix noisy alerts — If an alert pages but requires no action, fix or remove it
Automate remediation — Script common fixes where safe to do so
Improve system reliability — Fewer incidents means fewer pages
Compensate fairly — Recognise on-call work with time off or additional pay

On-Call Health

Track pages per shift and time spent on incidents
Monitor for alert fatigue and burnout
Review on-call experience in retrospectives
Target: most on-call shifts should be quiet

Runbooks

Runbooks are documented procedures for handling specific incidents or alerts.

What to Include

Alert context — What does this alert mean? Why does it fire?
Severity assessment — How to determine impact
Diagnostic steps — Commands, dashboards, and logs to check
Remediation steps — How to fix or mitigate the issue
Escalation path — Who to contact if you can’t resolve it
Post-resolution steps — Verification and cleanup

Runbook Best Practices

Keep runbooks close to the alert (link directly from alert notifications)
Review and update after each use
Write for someone unfamiliar with the system
Include copy-pasteable commands
Version control runbooks alongside code

Blameless Postmortems

A postmortem (or post-incident review) is a structured analysis of an incident to understand what happened and how to prevent recurrence.

Blameless Culture

Focus on systems and processes, not individuals
Assume people made the best decisions with the information they had
Blame inhibits learning and encourages hiding mistakes
Celebrate incident reporters and those who discover issues

Postmortem Template

## Incident Summary
- Date/Time:
- Duration:
- Severity:
- Impact:

## Timeline
- HH:MM — Event description
- HH:MM — Event description

## Root Cause
What was the underlying cause?

## Contributing Factors
What conditions allowed this to happen?

## What Went Well
- Effective responses
- Good practices observed

## What Could Be Improved
- Gaps in detection, response, or tooling

## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| ... | ... | ... |

## Lessons Learned
Key takeaways for the team

Postmortem Process

Write the initial draft — Incident commander or responders
Review with participants — Ensure accuracy and completeness
Hold a review meeting — Discuss findings, generate action items
Publish and share — Make postmortems accessible to the organisation
Track action items — Follow up on completion

Incident Communication

Internal Communication

Use a dedicated incident channel (e.g., Slack #incident-<id>)
Designate a communications lead for larger incidents
Provide regular updates even if there’s no change
Keep a timeline of events as they happen

External Communication

Maintain a status page for customer visibility
Communicate early, even if details are limited
Update regularly until resolution
Follow up with a customer-facing incident summary

Incident Roles

For larger incidents, assign explicit roles:

Role	Responsibility
Incident Commander	Coordinates response, makes decisions, delegates tasks
Technical Lead	Leads investigation and remediation efforts
Communications Lead	Handles internal and external updates
Scribe	Documents timeline and actions taken

Tools

Paging and alerting — PagerDuty, Opsgenie, Grafana OnCall
Status pages — Statuspage, Instatus, Cachet
Incident management platforms — incident.io, FireHydrant, Rootly

Rai Notes

Explorer