Disaster Recovery

Disaster recovery (DR) is the practice of preparing for, responding to, and recovering from events that cause significant disruption to your systems. A well-designed DR strategy ensures business continuity when the unexpected happens.

Key Metrics

RTO (Recovery Time Objective)

The maximum acceptable time to restore service after a disaster. This is how long your business can tolerate being offline.

Examples:

E-commerce checkout: RTO = 1 hour (every minute costs revenue)
Internal reporting system: RTO = 24 hours (can wait until next business day)
Archive storage: RTO = 72 hours (rarely accessed)

RPO (Recovery Point Objective)

The maximum acceptable amount of data loss measured in time. This is how much data you can afford to lose.

Examples:

Financial transactions: RPO = 0 (no data loss acceptable)
User-generated content: RPO = 1 hour (hourly backups acceptable)
Log data: RPO = 24 hours (daily backups sufficient)

Relationship Between RTO and RPO

                    Disaster occurs
                          ↓
|-------- RPO --------|-------- RTO --------|
                      ↑                     ↑
              Last good backup        Service restored
              (data loss window)      (downtime window)

Tighter RTO and RPO requirements increase complexity and cost.

Disaster Recovery Strategies

Strategies are listed in order of increasing cost and decreasing recovery time.

1. Backup and Restore

RTO: Hours to days | RPO: Hours to days | Cost: Low

Regularly back up data and restore it to new infrastructure when disaster strikes.

How it works:

Schedule regular backups of data and configurations
Store backups in a different region or cloud
When disaster occurs, provision new infrastructure and restore data

Pros:

Lowest cost — no standby infrastructure
Simple to implement

Cons:

Longest recovery time
Requires tested restore procedures
Potential for significant data loss

Best for: Non-critical systems, development environments, archival data

2. Pilot Light

RTO: Hours | RPO: Minutes to hours | Cost: Low-Medium

Keep a minimal version of your environment running in the DR region, with core components ready to scale up.

How it works:

Replicate critical data continuously to DR region
Maintain minimal infrastructure (database replicas, core services) in DR
When disaster occurs, scale up the pilot light environment

Characteristics:

Database and data stores are kept in sync
Compute and application servers are provisioned on-demand
DNS/load balancers configured but pointing to primary

Best for: Systems where some downtime is acceptable but data loss is not

3. Warm Standby

RTO: Minutes to hours | RPO: Seconds to minutes | Cost: Medium-High

Run a scaled-down but functional copy of your production environment in the DR region.

How it works:

Maintain a smaller replica of production in DR region
All components running but at reduced capacity
Continuously replicate data
When disaster occurs, scale up DR environment and redirect traffic

Characteristics:

Can handle a subset of production traffic immediately
Faster recovery than pilot light (infrastructure already running)
Regular testing possible without full cutover

Best for: Business-critical applications that need faster recovery

4. Multi-Site Active-Active (Hot Standby)

RTO: Near-zero | RPO: Near-zero | Cost: High

Run full production capacity in multiple regions simultaneously, with traffic distributed across all sites.

How it works:

Production workload runs in two or more regions
Traffic distributed via global load balancing
Data replicated synchronously or near-synchronously
If one region fails, others absorb the traffic

Characteristics:

No failover required — other regions continue serving
Highest availability and resilience
Most complex to implement and operate
Requires careful handling of data consistency

Best for: Mission-critical systems requiring maximum availability

Strategy Comparison

Strategy	RTO	RPO	Cost	Complexity
Backup & Restore	Hours-days	Hours-days	$	Low
Pilot Light	Hours	Minutes-hours	$$	Medium
Warm Standby	Minutes-hours	Seconds-minutes	$$$	Medium-High
Active-Active	Near-zero	Near-zero	$$$$	High

Implementing Disaster Recovery

1. Business Impact Analysis

Before choosing a strategy, understand the business requirements:

Which systems are critical to business operations?
What is the cost of downtime per hour?
What is the cost of data loss?
What compliance or regulatory requirements exist?

Map systems to appropriate RTO/RPO based on business impact.

2. DR Architecture Patterns

Data replication:

Synchronous — Write confirmed only after replicated (zero data loss, higher latency)
Asynchronous — Write confirmed before replication (some data loss possible, lower latency)

DNS failover:

Use low TTL values to enable faster failover
Implement health checks to automate DNS changes
Consider using Route 53, Cloud DNS, or Traffic Manager

Database replication:

Cross-region read replicas
Multi-region database services (Aurora Global, Spanner, CockroachDB)
Backup and restore for less critical data

3. Infrastructure as Code

DR benefits enormously from Infrastructure as Code:

Consistent environments between primary and DR
Rapid provisioning of DR resources
Version-controlled, auditable configurations
Tested infrastructure deployments

Store IaC in a separate location from primary infrastructure (different region, different cloud, or external repository).

4. Runbooks and Documentation

Create detailed runbooks for DR scenarios:

Step-by-step failover procedures
Verification checklists
Communication templates
Rollback procedures
Contact lists for key personnel

Keep runbooks accessible outside your primary infrastructure (printed copies, external wiki, mobile-accessible).

Testing Disaster Recovery

A DR plan that hasn’t been tested is just a hypothesis.

Types of DR Tests

Test Type	Description	Frequency	Impact
Tabletop exercise	Walk through scenarios verbally	Quarterly	None
Checklist review	Verify runbooks and documentation	Monthly	None
Simulation	Practice procedures without actual failover	Quarterly	Low
Parallel test	Bring up DR environment, verify functionality	Semi-annually	Low
Full failover	Actually switch production to DR	Annually	Medium-High

Testing Best Practices

Schedule tests regularly and treat them as mandatory
Test during business hours initially (more people available)
Document everything that happens during the test
Conduct a retrospective after each test
Update runbooks based on findings
Gradually increase test complexity and realism

Game Days

Run planned “game days” where you intentionally simulate failures:

Fail over to DR region
Kill critical services
Corrupt or delete data (in test environments)
Simulate network partitions

These exercises build confidence and identify gaps before real disasters occur.

Chaos Engineering

Chaos engineering takes testing further by continuously injecting failures into systems.

Principles:

Define steady state (what normal looks like)
Hypothesise that steady state will continue during experiments
Inject real-world events (server failures, network issues, dependency failures)
Look for differences between control and experiment

Tools:

Chaos Monkey — Randomly terminates instances (Netflix)
Gremlin — Failure as a service
Litmus — Kubernetes-native chaos engineering
Chaos Mesh — Cloud-native chaos engineering platform
AWS Fault Injection Simulator

Start small: inject minor failures in non-production, then gradually increase scope and move toward production.

Common Failure Scenarios

Infrastructure Failures

Region or availability zone outage
Data centre power or cooling failure
Network connectivity loss
Hardware failures (disk, server, network equipment)

Application Failures

Deployment gone wrong
Configuration errors
Dependency failures (third-party services, APIs)
Resource exhaustion (memory, connections, disk)

Data Failures

Accidental deletion
Data corruption
Ransomware or malicious attacks
Schema migration failures

Human Factors

Operator error
Security breaches
Knowledge loss (key person leaves)

Backup Best Practices

Backups are the foundation of most DR strategies.

The 3-2-1 Rule

3 copies of your data
2 different storage types
1 copy offsite (different region or cloud)

Backup Verification

Regularly test restores (not just backup completion)
Verify data integrity after restore
Measure actual restore time
Automate verification where possible

Backup Security

Encrypt backups at rest and in transit
Protect backup credentials separately from primary credentials
Use immutable backups where possible (protection against ransomware)
Implement access controls and audit logging

Cloud-Specific Considerations

AWS

AWS DR strategies
Multi-AZ deployments for high availability
Cross-region replication for DR
AWS Backup for centralised backup management
Route 53 health checks and DNS failover

Google Cloud

Regional and multi-regional resources
Cloud SQL cross-region replicas
Spanner for global distribution
Cloud DNS with geolocation routing

Azure

Availability Zones and paired regions
Azure Site Recovery for VM replication
Geo-redundant storage (GRS)
Traffic Manager for DNS-based failover

Rai Notes

Explorer

Disaster Recovery

Key Metrics

RTO (Recovery Time Objective)

RPO (Recovery Point Objective)

Relationship Between RTO and RPO

Disaster Recovery Strategies

1. Backup and Restore

2. Pilot Light

3. Warm Standby

4. Multi-Site Active-Active (Hot Standby)

Strategy Comparison

Implementing Disaster Recovery

1. Business Impact Analysis

2. DR Architecture Patterns

3. Infrastructure as Code

4. Runbooks and Documentation

Testing Disaster Recovery

Types of DR Tests

Testing Best Practices

Game Days

Chaos Engineering

Common Failure Scenarios

Infrastructure Failures

Application Failures

Data Failures

Human Factors

Backup Best Practices

The 3-2-1 Rule

Backup Verification

Backup Security

Cloud-Specific Considerations

AWS

Google Cloud

Azure

Links

Graph View

Table of Contents

Backlinks