Why I'm writing about this
About a month ago, I was planning the migration of my product to a new cloud environment.
I was mapping out all the integrations that came with the product I'd recently taken over. And when I asked how one of them worked, my dev team said: "We actually don't know. A different team manages that."
But which team? Where? Nobody knew.
And my brain immediately went to: what happens if it breaks? How do we fix something we don't even understand, with no one to call?
I think a lot of you will recognise this. Whether you inherited a product or built one from scratch, there's alway
s a point where you have to ask: what's our plan when things go wrong?
Because if you don't have a plan, panic takes over. And panic is not a recovery strategy.
What is disaster recovery?
Disaster recovery (DR) is the set of documented processes your team follows to restore normal operations after a serious incident.
Not "we'll figure it out." A plan. Written down. Tested.
It covers things like:
→ A database going down or getting corrupted
→ An API failing under load or being taken out by a DDoS attack
→ A third-party integration breaking
→ A cyberattack or ransomware event
→ Human error during a deployment or migration
If those sound abstract, here are two real-world examples you probably heard about:
CrowdStrike (2024): A faulty software update pushed to millions of Windows machines caused the largest IT outage in history. Banks, airlines, hospitals, broadcasters all went offline. The fix required physically rebooting affected machines one by one. Companies without clear recovery procedures were paralyzed for days.
Marks & Spencer (2025): A cyberattack knocked out their online ordering system for weeks. Click-and-collect gone. Online checkout gone. Estimated losses ran into hundreds of millions. A well-rehearsed DR plan doesn't prevent attacks, but it dramatically shortens recovery time.
The point isn't to scare you. It's this: disasters happen to everyone. The difference between a bad day and a catastrophe is whether procedure takes over from panic.
How to create your disaster recovery plan
Step 1: Map your architecture
You can't protect what you don't understand.
Start with a clear picture of your application and all its components: the database, APIs, integrations, infrastructure, and any third-party dependencies.
If you inherited a product, this is your starting point. If gaps exist (like my mystery integration), that's the first thing to fix.
→ Draw or document your app's components and how they connect
→ Note who owns each one, internally and externally
→ Flag any areas where knowledge is missing or sits with one person only
Step 2: Identify what can go wrong
Go component by component and ask: what's the realistic failure mode here?
Think across categories:
→ Infrastructure: server outage, cloud region failure, storage corruption
→ Data: database overload, data loss, failed backup
→ Network: DDoS attack, API rate limits breached, connectivity loss
→ Security: unauthorised access, ransomware, credential compromise
→ Human error: bad deployment, misconfiguration, accidental deletion
→ Third-party: external API goes down, vendor changes terms, integration breaks
Step 3: Define business criticality
Not every failure needs the same response. Prioritise by impact.
For each scenario, ask:
→ How many users does this affect?
→ Does it stop the business from operating?
→ Is there a regulatory or financial consequence?
High criticality = you need a documented recovery strategy. Low criticality = monitor and fix in normal working hours.
Step 4: Build a recovery strategy for each high-priority scenario
For each critical disaster type, define:
→ What happens (the specific failure)
→ Who responds (named roles, not just "the dev team")
→ How long recovery is expected to take (your Recovery Time Objective, or RTO)
→ What's next after recovery (post-incident review, communication to stakeholders)
Write it down. Make it accessible. Review it when your product changes significantly.
A worked example
Let's say your product is an internal order management system. It processes thousands of orders from suppliers, handles material purchasing, and is used daily by procurement and operations teams.
Step 1- Architecture overview
→ Frontend: web app used by procurement and operations teams
→ Backend: REST API service handling order processing logic
→ Database: PostgreSQL storing all order and supplier data
→ External integrations: supplier EDI connections, ERP system, email notification service
→ Infrastructure: cloud-hosted, single region
Step 2 & 3- Disaster scenarios
Step 4 - Recovery strategy: database unavailable
Scenario: PostgreSQL instance goes down. All order data becomes inaccessible. Procurement and operations teams cannot view, create, or update orders.
Recovery Time Objective (RTO): 2 hours
Recovery Point Objective (RPO): Last automated backup (max 1 hour of data loss)
Download the template
I've put together a simple disaster recovery template you can adapt for your own product.
It covers architecture mapping, disaster identification, criticality scoring, and recovery strategy format.
[CLICK HERE TO DOWNLOAD]