What happens when your product breaks and no one knows why

Hi Reader,

today is another story from the battlefield.

The kind of work that doesn't feel exciting, but the kind that saves you from a disaster. Literally.

Do you have a plan for when something goes seriously wrong with your product?

A database gets overloaded. An API goes down. A cyberattack hits. A critical integration just... stops working.

What do you do?

That's what we're talking about today.

Today in 10 minutes you will:

Understand what disaster recovery actually means for internal PMs
Learn the four steps to build your own DR plan
See a worked example with architecture, disaster types, and recovery strategies
Download a free template to start mapping your own

Why I'm writing about this

About a month ago, I was planning the migration of my product to a new cloud environment.

I was mapping out all the integrations that came with the product I'd recently taken over. And when I asked how one of them worked, my dev team said: "We actually don't know. A different team manages that."

But which team? Where? Nobody knew.

And my brain immediately went to: what happens if it breaks? How do we fix something we don't even understand, with no one to call?

I think a lot of you will recognise this. Whether you inherited a product or built one from scratch, there's alway

s a point where you have to ask: what's our plan when things go wrong?

Because if you don't have a plan, panic takes over. And panic is not a recovery strategy.

What is disaster recovery?

Disaster recovery (DR) is the set of documented processes your team follows to restore normal operations after a serious incident.

Not "we'll figure it out." A plan. Written down. Tested.

It covers things like:

→ A database going down or getting corrupted

→ An API failing under load or being taken out by a DDoS attack

→ A third-party integration breaking

→ A cyberattack or ransomware event

→ Human error during a deployment or migration

If those sound abstract, here are two real-world examples you probably heard about:

CrowdStrike (2024): A faulty software update pushed to millions of Windows machines caused the largest IT outage in history. Banks, airlines, hospitals, broadcasters all went offline. The fix required physically rebooting affected machines one by one. Companies without clear recovery procedures were paralyzed for days.

Marks & Spencer (2025): A cyberattack knocked out their online ordering system for weeks. Click-and-collect gone. Online checkout gone. Estimated losses ran into hundreds of millions. A well-rehearsed DR plan doesn't prevent attacks, but it dramatically shortens recovery time.

The point isn't to scare you. It's this: disasters happen to everyone. The difference between a bad day and a catastrophe is whether procedure takes over from panic.

How to create your disaster recovery plan

Step 1: Map your architecture

You can't protect what you don't understand.

Start with a clear picture of your application and all its components: the database, APIs, integrations, infrastructure, and any third-party dependencies.

If you inherited a product, this is your starting point. If gaps exist (like my mystery integration), that's the first thing to fix.

→ Draw or document your app's components and how they connect

→ Note who owns each one, internally and externally

→ Flag any areas where knowledge is missing or sits with one person only

Step 2: Identify what can go wrong

Go component by component and ask: what's the realistic failure mode here?

Think across categories:

→ Infrastructure: server outage, cloud region failure, storage corruption

→ Data: database overload, data loss, failed backup

→ Network: DDoS attack, API rate limits breached, connectivity loss

→ Security: unauthorised access, ransomware, credential compromise

→ Human error: bad deployment, misconfiguration, accidental deletion

→ Third-party: external API goes down, vendor changes terms, integration breaks

Step 3: Define business criticality

Not every failure needs the same response. Prioritise by impact.

For each scenario, ask:

→ How many users does this affect?

→ Does it stop the business from operating?

→ Is there a regulatory or financial consequence?

High criticality = you need a documented recovery strategy. Low criticality = monitor and fix in normal working hours.

Step 4: Build a recovery strategy for each high-priority scenario

For each critical disaster type, define:

→ What happens (the specific failure)

→ Who responds (named roles, not just "the dev team")

→ How long recovery is expected to take (your Recovery Time Objective, or RTO)

→ What's next after recovery (post-incident review, communication to stakeholders)

Write it down. Make it accessible. Review it when your product changes significantly.

A worked example

Let's say your product is an internal order management system. It processes thousands of orders from suppliers, handles material purchasing, and is used daily by procurement and operations teams.

Step 1- Architecture overview

→ Frontend: web app used by procurement and operations teams

→ Backend: REST API service handling order processing logic

→ Database: PostgreSQL storing all order and supplier data

→ External integrations: supplier EDI connections, ERP system, email notification service

→ Infrastructure: cloud-hosted, single region

Step 2 & 3- Disaster scenarios

Step 4 - Recovery strategy: database unavailable

Scenario: PostgreSQL instance goes down. All order data becomes inaccessible. Procurement and operations teams cannot view, create, or update orders.

Recovery Time Objective (RTO): 2 hours

Recovery Point Objective (RPO): Last automated backup (max 1 hour of data loss)

Download the template

I've put together a simple disaster recovery template you can adapt for your own product.

It covers architecture mapping, disaster identification, criticality scoring, and recovery strategy format.

[CLICK HERE TO DOWNLOAD]

Behind the Scenes

I finally built it.

My first workshop for internal Product Managers to lead them through what discovery actually looks like for internal products.

Not the startup B2B/B2C discovery practices you hear about online.

The real practices that you can implement with your next feature or project. The steps that won't leave you worrying at night, "did I miss anything? Will it be a disaster?"

It is taking place on June 11, 2PM CET. Check it out here: https://workshop.mariakorteleva.com/

What about you?

Do you have a disaster recovery plan for your product, or is it mostly "we'd figure it out"?

Hit reply and let me know. I'm curious how many internal PMs have actually done this exercise.

See you next week,

Maria

Frankfurt am Main, 60311, Germany
Unsubscribe · Preferences

Maria Korteleva