Disaster recovery (DR) is the practice of preparing for, responding to, and recovering from events that cause significant disruption to your systems. A well-designed DR strategy ensures business continuity when the unexpected happens.
Key Metrics
RTO (Recovery Time Objective)
The maximum acceptable time to restore service after a disaster. This is how long your business can tolerate being offline.
Examples:
- E-commerce checkout: RTO = 1 hour (every minute costs revenue)
- Internal reporting system: RTO = 24 hours (can wait until next business day)
- Archive storage: RTO = 72 hours (rarely accessed)
RPO (Recovery Point Objective)
The maximum acceptable amount of data loss measured in time. This is how much data you can afford to lose.
Examples:
- Financial transactions: RPO = 0 (no data loss acceptable)
- User-generated content: RPO = 1 hour (hourly backups acceptable)
- Log data: RPO = 24 hours (daily backups sufficient)
Relationship Between RTO and RPO
Disaster occurs
↓
|-------- RPO --------|-------- RTO --------|
↑ ↑
Last good backup Service restored
(data loss window) (downtime window)
Tighter RTO and RPO requirements increase complexity and cost.
Disaster Recovery Strategies
Strategies are listed in order of increasing cost and decreasing recovery time.
1. Backup and Restore
RTO: Hours to days | RPO: Hours to days | Cost: Low
Regularly back up data and restore it to new infrastructure when disaster strikes.
How it works:
- Schedule regular backups of data and configurations
- Store backups in a different region or cloud
- When disaster occurs, provision new infrastructure and restore data
Pros:
- Lowest cost — no standby infrastructure
- Simple to implement
Cons:
- Longest recovery time
- Requires tested restore procedures
- Potential for significant data loss
Best for: Non-critical systems, development environments, archival data
2. Pilot Light
RTO: Hours | RPO: Minutes to hours | Cost: Low-Medium
Keep a minimal version of your environment running in the DR region, with core components ready to scale up.
How it works:
- Replicate critical data continuously to DR region
- Maintain minimal infrastructure (database replicas, core services) in DR
- When disaster occurs, scale up the pilot light environment
Characteristics:
- Database and data stores are kept in sync
- Compute and application servers are provisioned on-demand
- DNS/load balancers configured but pointing to primary
Best for: Systems where some downtime is acceptable but data loss is not
3. Warm Standby
RTO: Minutes to hours | RPO: Seconds to minutes | Cost: Medium-High
Run a scaled-down but functional copy of your production environment in the DR region.
How it works:
- Maintain a smaller replica of production in DR region
- All components running but at reduced capacity
- Continuously replicate data
- When disaster occurs, scale up DR environment and redirect traffic
Characteristics:
- Can handle a subset of production traffic immediately
- Faster recovery than pilot light (infrastructure already running)
- Regular testing possible without full cutover
Best for: Business-critical applications that need faster recovery
4. Multi-Site Active-Active (Hot Standby)
RTO: Near-zero | RPO: Near-zero | Cost: High
Run full production capacity in multiple regions simultaneously, with traffic distributed across all sites.
How it works:
- Production workload runs in two or more regions
- Traffic distributed via global load balancing
- Data replicated synchronously or near-synchronously
- If one region fails, others absorb the traffic
Characteristics:
- No failover required — other regions continue serving
- Highest availability and resilience
- Most complex to implement and operate
- Requires careful handling of data consistency
Best for: Mission-critical systems requiring maximum availability
Strategy Comparison
| Strategy | RTO | RPO | Cost | Complexity |
|---|---|---|---|---|
| Backup & Restore | Hours-days | Hours-days | $ | Low |
| Pilot Light | Hours | Minutes-hours | $$ | Medium |
| Warm Standby | Minutes-hours | Seconds-minutes | $$$ | Medium-High |
| Active-Active | Near-zero | Near-zero | $$$$ | High |
Implementing Disaster Recovery
1. Business Impact Analysis
Before choosing a strategy, understand the business requirements:
- Which systems are critical to business operations?
- What is the cost of downtime per hour?
- What is the cost of data loss?
- What compliance or regulatory requirements exist?
Map systems to appropriate RTO/RPO based on business impact.
2. DR Architecture Patterns
Data replication:
- Synchronous — Write confirmed only after replicated (zero data loss, higher latency)
- Asynchronous — Write confirmed before replication (some data loss possible, lower latency)
DNS failover:
- Use low TTL values to enable faster failover
- Implement health checks to automate DNS changes
- Consider using Route 53, Cloud DNS, or Traffic Manager
Database replication:
- Cross-region read replicas
- Multi-region database services (Aurora Global, Spanner, CockroachDB)
- Backup and restore for less critical data
3. Infrastructure as Code
DR benefits enormously from Infrastructure as Code:
- Consistent environments between primary and DR
- Rapid provisioning of DR resources
- Version-controlled, auditable configurations
- Tested infrastructure deployments
Store IaC in a separate location from primary infrastructure (different region, different cloud, or external repository).
4. Runbooks and Documentation
Create detailed runbooks for DR scenarios:
- Step-by-step failover procedures
- Verification checklists
- Communication templates
- Rollback procedures
- Contact lists for key personnel
Keep runbooks accessible outside your primary infrastructure (printed copies, external wiki, mobile-accessible).
Testing Disaster Recovery
A DR plan that hasn’t been tested is just a hypothesis.
Types of DR Tests
| Test Type | Description | Frequency | Impact |
|---|---|---|---|
| Tabletop exercise | Walk through scenarios verbally | Quarterly | None |
| Checklist review | Verify runbooks and documentation | Monthly | None |
| Simulation | Practice procedures without actual failover | Quarterly | Low |
| Parallel test | Bring up DR environment, verify functionality | Semi-annually | Low |
| Full failover | Actually switch production to DR | Annually | Medium-High |
Testing Best Practices
- Schedule tests regularly and treat them as mandatory
- Test during business hours initially (more people available)
- Document everything that happens during the test
- Conduct a retrospective after each test
- Update runbooks based on findings
- Gradually increase test complexity and realism
Game Days
Run planned “game days” where you intentionally simulate failures:
- Fail over to DR region
- Kill critical services
- Corrupt or delete data (in test environments)
- Simulate network partitions
These exercises build confidence and identify gaps before real disasters occur.
Chaos Engineering
Chaos engineering takes testing further by continuously injecting failures into systems.
Principles:
- Define steady state (what normal looks like)
- Hypothesise that steady state will continue during experiments
- Inject real-world events (server failures, network issues, dependency failures)
- Look for differences between control and experiment
Tools:
- Chaos Monkey — Randomly terminates instances (Netflix)
- Gremlin — Failure as a service
- Litmus — Kubernetes-native chaos engineering
- Chaos Mesh — Cloud-native chaos engineering platform
- AWS Fault Injection Simulator
Start small: inject minor failures in non-production, then gradually increase scope and move toward production.
Common Failure Scenarios
Infrastructure Failures
- Region or availability zone outage
- Data centre power or cooling failure
- Network connectivity loss
- Hardware failures (disk, server, network equipment)
Application Failures
- Deployment gone wrong
- Configuration errors
- Dependency failures (third-party services, APIs)
- Resource exhaustion (memory, connections, disk)
Data Failures
- Accidental deletion
- Data corruption
- Ransomware or malicious attacks
- Schema migration failures
Human Factors
- Operator error
- Security breaches
- Knowledge loss (key person leaves)
Backup Best Practices
Backups are the foundation of most DR strategies.
The 3-2-1 Rule
- 3 copies of your data
- 2 different storage types
- 1 copy offsite (different region or cloud)
Backup Verification
- Regularly test restores (not just backup completion)
- Verify data integrity after restore
- Measure actual restore time
- Automate verification where possible
Backup Security
- Encrypt backups at rest and in transit
- Protect backup credentials separately from primary credentials
- Use immutable backups where possible (protection against ransomware)
- Implement access controls and audit logging
Cloud-Specific Considerations
AWS
- AWS DR strategies
- Multi-AZ deployments for high availability
- Cross-region replication for DR
- AWS Backup for centralised backup management
- Route 53 health checks and DNS failover
Google Cloud
- Regional and multi-regional resources
- Cloud SQL cross-region replicas
- Spanner for global distribution
- Cloud DNS with geolocation routing
Azure
- Availability Zones and paired regions
- Azure Site Recovery for VM replication
- Geo-redundant storage (GRS)
- Traffic Manager for DNS-based failover