Monitoring Isn’t an Ops Problem


Hi Reader,

If you’re constantly getting incidents and tickets from users…
and your operations team is stuck reproducing issues without always finding the root cause…

You might need to invest in monitoring and observability for your product.

And if that sentence already makes you slightly uncomfortable, this newsletter is for you.

Today in 10 minutes you will:

  • Learn why monitoring is a product responsibility, not just an ops task
  • Understand the difference between incidents, problems, monitoring, and observability
  • Get a simple mental model for how these concepts connect
  • Learn what to monitor first, with practical examples at each level

My experience with monitoring

I was lucky enough to work with excellent observability and operations experts early in my career.

They taught me a distinction that completely changed how I think about incidents:

Fixing an incident is reactive.
Fixing a problem is strategic.

When I worked on manufacturing business processes, this distinction really mattered.

These systems were business-critical.
SLAs were tight.
Downtime was expensive, financially and emotionally.

Our job wasn’t just to get things back up and running.

Our job was to prevent the same incidents from happening again.

Because the risk, stress, and value lost every single time was simply too high.

The mistake I see teams make

The solution is not to design systems that never fail.

That’s unrealistic.

Incidents will happen.
Things will break.
Dependencies will behave in unexpected ways.

The real lever is this:

→ Design systems for monitoring and observability
→ Train teams to actually use them

That’s how you move from firefighting…
to learning…
to prevention.


The 4 concepts that made it click for me

I can’t explain the full discipline of operations in one newsletter.

But I can explain the parts you need to understand as a product manager.

Because throwing your product “over the wall” to operations is not good internal product management.
Your product is your responsibility across its entire lifecycle.

I boil it down to four concepts.


Monitoring

Tells you that something is wrong.

→ Is the system up?
→ Are errors or latency spiking?
→ Should someone pay attention right now?

Monitoring detects change.
It triggers alerts.
It doesn’t explain why.


Observability

Tells you why something is wrong.

→ Where is the failure happening?
→ Which service, request, or dependency caused it?
→ What changed before things broke?

Observability turns signals into understanding.


Incidents

User impact happening now.

→ Things are broken or unusable
→ Speed and coordination matter
→ Goal: restore service quickly

Incidents are about response, not learning.


Problems

The underlying cause behind incidents.

→ Structural issues
→ Recurring failure patterns
→ Missing safeguards or visibility

Problems are where prevention happens.


How they connect

Monitoring, incidents, observability, and problems form a loop.

Monitoring → detects incidents
Incidents → show user impact
Observability → explains what happened
Problems → tell you what to fix so it doesn’t happen again

Every improvement feeds back into better monitoring.


You might notice that all four concepts start with one thing: monitoring.

Without the right monitoring in place:
→ incidents are harder to spot
→ observability has nothing to work with
→ problems stay hidden

That’s why the next question isn’t “Do we monitor?”
It’s “What do we monitor first?”


The Monitoring Pyramid

Think of monitoring as a pyramid.

The bottom layers are boring and absolutely essential.
The top layers are powerful but useless without the foundation.


Level 1 – User impact (foundation)

Can users do their job?

→ Is the product reachable?
→ Are key flows working?
→ Are users experiencing failures or extreme slowness?

This is where incidents become visible.

What this practically means

At this level, you monitor user journeys, not systems.

Examples:

  1. Monitor critical errors on key user flows
    • Login
    • Submit / save / approve
    • Any “if this breaks, work stops” journey
    • Spikes in HTTP 5xx or 4xx on these paths
  2. Monitor application availability
    • Is the app up or down from a user perspective?
  3. Monitor user-side response time
    • Pages loading extremely slowly
    • Requests timing out
    • Sudden latency spikes

If you skip this level, your first alert will be a user ticket.


Level 2 – Application health

Is the application behaving as expected?

→ Is something degrading overall?
→ Is this affecting many users or just one?

This layer confirms: “Yes, this is a real issue.”

What this practically means

Here you monitor aggregated application signals.

Examples:

  1. Error rates
    • Percentage of failed requests
    • Trends over time, not individual errors
  2. Latency metrics
    • Average vs p95 / p99 response time
    • Sudden shifts compared to baseline
  3. Throughput
    • Request volume dropping or spiking unexpectedly

This level helps you distinguish:
“one user had a bad day”
vs
“the system is unhealthy.”


Level 3 – Services and dependencies

Which part of the system is failing?

→ Is it our service or a dependency?
→ Is one component dragging everything down?

This is where observability starts to become possible.

What this practically means

Here you break the app into parts.

Examples:

  1. Monitor individual services or APIs
    • Error rate per service
    • Latency per endpoint
  2. Monitor dependencies
    • External APIs
    • Internal downstream systems
    • Message queues or brokers
  3. Monitor retries and timeouts
    • Silent retry storms
    • Backlogs building up

This level turns “the app is slow”
into “this service times out when calling that dependency.”


Level 4 – Infrastructure signals

Is the system under pressure?

→ Are resources saturated?
→ Is performance degrading over time?

These signals explain why things get worse.

What this practically means

Here you look at capacity and pressure, not user behavior.

Examples:

  • CPU usage
  • Memory consumption
  • Disk I/O
  • Network saturation
  • Queue depth

On their own, these metrics don’t tell a story.

But combined with the levels above, they often explain why incidents repeat.


Level 5 – Logs and traces (advanced)

What actually happened in this specific case?

→ What failed?
→ In what order?
→ For this exact request?

This is where root causes live.

What this practically means

Here you invest in deep investigation tooling.

Examples:

  1. Structured logs
    • Consistent formats
    • Meaningful error messages
    • Searchable fields
  2. Correlation IDs
    • One request across multiple services
  3. Distributed tracing
    • End-to-end request paths
    • Where time is spent
    • Where failures occur

Without the lower layers, this becomes expensive noise.
With them, it’s incredibly powerful.


A simple PM rule of thumb

If a signal doesn’t help you:
→ detect an incident
→ understand why it happened
→ or prevent it next time

It probably doesn’t belong in your monitoring setup.

Behind the Scenes

This week was… disruptive.

We got so much snow that transport across Germany was completely thrown off.
My days were basically: office → train → waiting on the train → moving a bit → waiting again.

A lot of hanging around.
A lot of rescheduling.

But at least there was a silver lining.

Sitting on the train with nowhere to rush to, looking outside, everything covered in snow, it was actually kind of beautiful.
Not the most productive week.
But a very wintery one.

What do you think?

How does monitoring look for your product today?

→ Mostly user tickets
→ Alerts, but not very helpful
→ Or a setup you actually trust during incidents?

Hit reply and tell me where you are right now.

See you next Tuesday,

Maria

Frankfurt am Main, 60311, Germany
Unsubscribe · Preferences

Maria Korteleva

Hi, I’m Maria. For the past 7 years, I’ve been building internal products across FMCG and tech companies.Now, I share everything I’ve learned to help junior PMs master delivery from technical skills to stakeholder communication. Join 80+ Internal PMs who get weekly insights from the Build Internal Products newsletter.

Read more from Maria Korteleva

Hi Reader, Today I want to give you an unfiltered look at what a week as an internal PM actually looks like. Not the LinkedIn version. The real one. Spoiler: I updated servers, got humbled by engineering, and built something I'm proud of. All in one week. Today in 10 minutes you will: See what a real internal PM week looks like (calendar and all) Learn how I protect focus in a meeting-heavy environment See what surprised me and what I'm actually proud of New product, new surprise Last...

Hi Reader, today I am going to share a secret with you on how to stop being the human FAQ for your product. You can set it up in 30 minutes. And it will save you hours. The secret? An AI agent that answers your users before they ping you. Today in 10 minutes you will: Learn what documentation AI agents actually do See the 5 most common enterprise tools and their AI options Follow a step-by-step overview of how I built one Get a ready-to-use prompt to build your own Read the 5 lessons I...

Hi Reader, What’s one thing you slightly dread…and when it finally happens, it never quite goes to plan? There are many contenders.But for a lot of internal PMs, 1:1s with product or business leadership are right up there. Talking about your product.Your roadmap.Your priorities.Your asks. So today, let’s prep together. And if you don’t have a 1:1 scheduled yet, my hope is this gives you the confidence to set one up. Today in 10 minutes you will: Understand why 1:1s with leadership actually...