Incident management is the practice of responding to unplanned events or service interruptions and restoring services to their operational state. A mature incident management process reduces downtime, builds customer trust, and creates learning opportunities.
Incident Lifecycle
1. Detection
Incidents can be detected through:
- Automated monitoring — Alerts from monitoring systems when thresholds are breached
- Customer reports — Support tickets, social media, or direct communication
- Internal discovery — Engineers noticing anomalies during routine work
Aim for automated detection to catch issues before customers do. The gap between incident start and detection is your Time to Detect (TTD).
2. Triage
Quickly assess the incident to determine:
- Impact — How many users/systems are affected?
- Urgency — Is it getting worse? Is there a workaround?
- Severity — Assign a severity level (see below)
3. Response
Assemble the right people and begin mitigation:
- Page the on-call engineer
- Escalate to subject matter experts if needed
- Communicate status to stakeholders
- Focus on mitigation first, root cause later
4. Resolution
Restore service to normal operation:
- Apply a fix (temporary or permanent)
- Verify the fix works
- Monitor for recurrence
5. Post-Incident Review
After the incident is resolved:
- Conduct a blameless postmortem
- Document findings and action items
- Share learnings with the wider organisation
Severity Levels
A consistent severity classification helps teams prioritise and respond appropriately.
| Severity | Description | Response Time | Examples |
|---|---|---|---|
| SEV1 / Critical | Complete service outage or major security breach | Immediate (24/7) | Production down, data breach, payment system failure |
| SEV2 / High | Significant degradation affecting many users | < 1 hour | Major feature broken, severe performance issues |
| SEV3 / Medium | Partial degradation with workaround available | < 4 hours | Non-critical feature broken, intermittent errors |
| SEV4 / Low | Minor issue with minimal impact | Next business day | Cosmetic bugs, minor inconveniences |
Adjust thresholds based on your organisation’s needs and SLAs.
On-Call Practices
Rotation Design
- Primary and secondary — Have a backup on-call to prevent single points of failure
- Rotation length — Typically 1 week; shorter rotations (3-4 days) reduce fatigue
- Follow the sun — For global teams, hand off to engineers in different time zones
- Fair distribution — Ensure on-call burden is shared equitably across the team
On-Call Expectations
- Acknowledge alerts within a defined SLA (e.g., 5-15 minutes)
- Have access to necessary tools and documentation
- Know when and how to escalate
- Document actions taken during incidents
Reducing On-Call Burden
- Fix noisy alerts — If an alert pages but requires no action, fix or remove it
- Automate remediation — Script common fixes where safe to do so
- Improve system reliability — Fewer incidents means fewer pages
- Compensate fairly — Recognise on-call work with time off or additional pay
On-Call Health
- Track pages per shift and time spent on incidents
- Monitor for alert fatigue and burnout
- Review on-call experience in retrospectives
- Target: most on-call shifts should be quiet
Runbooks
Runbooks are documented procedures for handling specific incidents or alerts.
What to Include
- Alert context — What does this alert mean? Why does it fire?
- Severity assessment — How to determine impact
- Diagnostic steps — Commands, dashboards, and logs to check
- Remediation steps — How to fix or mitigate the issue
- Escalation path — Who to contact if you can’t resolve it
- Post-resolution steps — Verification and cleanup
Runbook Best Practices
- Keep runbooks close to the alert (link directly from alert notifications)
- Review and update after each use
- Write for someone unfamiliar with the system
- Include copy-pasteable commands
- Version control runbooks alongside code
Blameless Postmortems
A postmortem (or post-incident review) is a structured analysis of an incident to understand what happened and how to prevent recurrence.
Blameless Culture
- Focus on systems and processes, not individuals
- Assume people made the best decisions with the information they had
- Blame inhibits learning and encourages hiding mistakes
- Celebrate incident reporters and those who discover issues
Postmortem Template
## Incident Summary
- Date/Time:
- Duration:
- Severity:
- Impact:
## Timeline
- HH:MM — Event description
- HH:MM — Event description
## Root Cause
What was the underlying cause?
## Contributing Factors
What conditions allowed this to happen?
## What Went Well
- Effective responses
- Good practices observed
## What Could Be Improved
- Gaps in detection, response, or tooling
## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| ... | ... | ... |
## Lessons Learned
Key takeaways for the team
Postmortem Process
- Write the initial draft — Incident commander or responders
- Review with participants — Ensure accuracy and completeness
- Hold a review meeting — Discuss findings, generate action items
- Publish and share — Make postmortems accessible to the organisation
- Track action items — Follow up on completion
Incident Communication
Internal Communication
- Use a dedicated incident channel (e.g., Slack
#incident-<id>) - Designate a communications lead for larger incidents
- Provide regular updates even if there’s no change
- Keep a timeline of events as they happen
External Communication
- Maintain a status page for customer visibility
- Communicate early, even if details are limited
- Update regularly until resolution
- Follow up with a customer-facing incident summary
Incident Roles
For larger incidents, assign explicit roles:
| Role | Responsibility |
|---|---|
| Incident Commander | Coordinates response, makes decisions, delegates tasks |
| Technical Lead | Leads investigation and remediation efforts |
| Communications Lead | Handles internal and external updates |
| Scribe | Documents timeline and actions taken |
Tools
- Paging and alerting — PagerDuty, Opsgenie, Grafana OnCall
- Status pages — Statuspage, Instatus, Cachet
- Incident management platforms — incident.io, FireHydrant, Rootly