What happens when your product breaks and no one knows why


Hi Reader,

today is another story from the battlefield.

The kind of work that doesn't feel exciting, but the kind that saves you from a disaster. Literally.

Do you have a plan for when something goes seriously wrong with your product?

A database gets overloaded. An API goes down. A cyberattack hits. A critical integration just... stops working.

What do you do?

That's what we're talking about today.

Today in 10 minutes you will:

  • Understand what disaster recovery actually means for internal PMs
  • Learn the four steps to build your own DR plan
  • See a worked example with architecture, disaster types, and recovery strategies
  • Download a free template to start mapping your own

Why I'm writing about this

About a month ago, I was planning the migration of my product to a new cloud environment.

I was mapping out all the integrations that came with the product I'd recently taken over. And when I asked how one of them worked, my dev team said: "We actually don't know. A different team manages that."

But which team? Where? Nobody knew.

And my brain immediately went to: what happens if it breaks? How do we fix something we don't even understand, with no one to call?

I think a lot of you will recognise this. Whether you inherited a product or built one from scratch, there's alway

s a point where you have to ask: what's our plan when things go wrong?

Because if you don't have a plan, panic takes over. And panic is not a recovery strategy.


What is disaster recovery?

Disaster recovery (DR) is the set of documented processes your team follows to restore normal operations after a serious incident.

Not "we'll figure it out." A plan. Written down. Tested.

It covers things like:

→ A database going down or getting corrupted

→ An API failing under load or being taken out by a DDoS attack

→ A third-party integration breaking

→ A cyberattack or ransomware event

→ Human error during a deployment or migration

If those sound abstract, here are two real-world examples you probably heard about:

CrowdStrike (2024): A faulty software update pushed to millions of Windows machines caused the largest IT outage in history. Banks, airlines, hospitals, broadcasters all went offline. The fix required physically rebooting affected machines one by one. Companies without clear recovery procedures were paralyzed for days.

Marks & Spencer (2025): A cyberattack knocked out their online ordering system for weeks. Click-and-collect gone. Online checkout gone. Estimated losses ran into hundreds of millions. A well-rehearsed DR plan doesn't prevent attacks, but it dramatically shortens recovery time.

The point isn't to scare you. It's this: disasters happen to everyone. The difference between a bad day and a catastrophe is whether procedure takes over from panic.


How to create your disaster recovery plan

Step 1: Map your architecture

You can't protect what you don't understand.

Start with a clear picture of your application and all its components: the database, APIs, integrations, infrastructure, and any third-party dependencies.

If you inherited a product, this is your starting point. If gaps exist (like my mystery integration), that's the first thing to fix.

→ Draw or document your app's components and how they connect

→ Note who owns each one, internally and externally

→ Flag any areas where knowledge is missing or sits with one person only

Step 2: Identify what can go wrong

Go component by component and ask: what's the realistic failure mode here?

Think across categories:

Infrastructure: server outage, cloud region failure, storage corruption

Data: database overload, data loss, failed backup

Network: DDoS attack, API rate limits breached, connectivity loss

Security: unauthorised access, ransomware, credential compromise

Human error: bad deployment, misconfiguration, accidental deletion

Third-party: external API goes down, vendor changes terms, integration breaks

Step 3: Define business criticality

Not every failure needs the same response. Prioritise by impact.

For each scenario, ask:

→ How many users does this affect?

→ Does it stop the business from operating?

→ Is there a regulatory or financial consequence?

High criticality = you need a documented recovery strategy. Low criticality = monitor and fix in normal working hours.

Step 4: Build a recovery strategy for each high-priority scenario

For each critical disaster type, define:

What happens (the specific failure)

Who responds (named roles, not just "the dev team")

How long recovery is expected to take (your Recovery Time Objective, or RTO)

What's next after recovery (post-incident review, communication to stakeholders)

Write it down. Make it accessible. Review it when your product changes significantly.

A worked example

Let's say your product is an internal order management system. It processes thousands of orders from suppliers, handles material purchasing, and is used daily by procurement and operations teams.

Step 1- Architecture overview

→ Frontend: web app used by procurement and operations teams

→ Backend: REST API service handling order processing logic

→ Database: PostgreSQL storing all order and supplier data

→ External integrations: supplier EDI connections, ERP system, email notification service

→ Infrastructure: cloud-hosted, single region

Step 2 & 3- Disaster scenarios

Step 4 - Recovery strategy: database unavailable

Scenario: PostgreSQL instance goes down. All order data becomes inaccessible. Procurement and operations teams cannot view, create, or update orders.

Recovery Time Objective (RTO): 2 hours

Recovery Point Objective (RPO): Last automated backup (max 1 hour of data loss)


Download the template

I've put together a simple disaster recovery template you can adapt for your own product.

It covers architecture mapping, disaster identification, criticality scoring, and recovery strategy format.

[CLICK HERE TO DOWNLOAD]

Behind the Scenes

I finally built it.

My first workshop for internal Product Managers to lead them through what discovery actually looks like for internal products.

Not the startup B2B/B2C discovery practices you hear about online.

The real practices that you can implement with your next feature or project. The steps that won't leave you worrying at night, "did I miss anything? Will it be a disaster?"

It is taking place on June 11, 2PM CET. Check it out here: https://workshop.mariakorteleva.com/

What about you?

Do you have a disaster recovery plan for your product, or is it mostly "we'd figure it out"?

Hit reply and let me know. I'm curious how many internal PMs have actually done this exercise.

See you next week,

Maria

Frankfurt am Main, 60311, Germany
Unsubscribe · Preferences

Maria Korteleva

Hi, I’m Maria. For the past 7 years, I’ve been building internal products across FMCG and tech companies.Now, I share everything I’ve learned to help junior PMs master delivery from technical skills to stakeholder communication. Join 200+ Internal PMs who get weekly insights from the Build Internal Products newsletter.

Read more from Maria Korteleva

Hi Reader, Has your software ever frozen for no clear reason? Bugs showing up where you least expect them? A tiny change somehow setting off a chain of other changes nobody planned for? If any of that sounds familiar, my friend, you (well, your product) probably have technical debt. And right now, I am living it. Today in 10 minutes you will: See why it happens, even on great teams Understand why you should care as a PM (with the data to back it up) Know the main types of technical debt Get...

Hi Reader, Do you love politics or hate it? Or are you just wondering why you have to deal with it at all? You just want to get the work done. Here's the thing: internal product management is infamous for it. But there's a way to play the game without feeling like you're playing it. And the answer starts with how you see it. Today in 10 minutes you will: See what it looks like in internal PM through real scenarios Reframe your mindset so you stop avoiding it Get practical tools to navigate it...

Hi Reader, There is one thing that separates a PM who follows the flow and just builds from the PM who owns the business outcome. It's knowing your Total Cost of Ownership. And having a clear cost recovery strategy. PMs who know what their product costs the business, and actively work to optimize it, are the ones the business will fight to keep. Everyone else is just floating. If you want to be in the first group, this issue is for you. Today in 10 minutes you will: Understand what TCO...