Monitoring Isn’t an Ops Problem

Hi Reader,

If you’re constantly getting incidents and tickets from users…
and your operations team is stuck reproducing issues without always finding the root cause…

You might need to invest in monitoring and observability for your product.

And if that sentence already makes you slightly uncomfortable, this newsletter is for you.

Today in 10 minutes you will:

Learn why monitoring is a product responsibility, not just an ops task
Understand the difference between incidents, problems, monitoring, and observability
Get a simple mental model for how these concepts connect
Learn what to monitor first, with practical examples at each level

My experience with monitoring

I was lucky enough to work with excellent observability and operations experts early in my career.

They taught me a distinction that completely changed how I think about incidents:

Fixing an incident is reactive.
Fixing a problem is strategic.

When I worked on manufacturing business processes, this distinction really mattered.

These systems were business-critical.
SLAs were tight.
Downtime was expensive, financially and emotionally.

Our job wasn’t just to get things back up and running.

Our job was to prevent the same incidents from happening again.

Because the risk, stress, and value lost every single time was simply too high.

The mistake I see teams make

The solution is not to design systems that never fail.

That’s unrealistic.

Incidents will happen.
Things will break.
Dependencies will behave in unexpected ways.

The real lever is this:

→ Design systems for monitoring and observability
→ Train teams to actually use them

That’s how you move from firefighting…
to learning…
to prevention.

The 4 concepts that made it click for me

I can’t explain the full discipline of operations in one newsletter.

But I can explain the parts you need to understand as a product manager.

Because throwing your product “over the wall” to operations is not good internal product management.
Your product is your responsibility across its entire lifecycle.

I boil it down to four concepts.

Monitoring

Tells you that something is wrong.

→ Is the system up?
→ Are errors or latency spiking?
→ Should someone pay attention right now?

Monitoring detects change.
It triggers alerts.
It doesn’t explain why.

Observability

Tells you why something is wrong.

→ Where is the failure happening?
→ Which service, request, or dependency caused it?
→ What changed before things broke?

Observability turns signals into understanding.

Incidents

User impact happening now.

→ Things are broken or unusable
→ Speed and coordination matter
→ Goal: restore service quickly

Incidents are about response, not learning.

Problems

The underlying cause behind incidents.

→ Structural issues
→ Recurring failure patterns
→ Missing safeguards or visibility

Problems are where prevention happens.

How they connect

Monitoring, incidents, observability, and problems form a loop.

Monitoring → detects incidents
Incidents → show user impact
Observability → explains what happened
Problems → tell you what to fix so it doesn’t happen again

Every improvement feeds back into better monitoring.

You might notice that all four concepts start with one thing: monitoring.

Without the right monitoring in place:
→ incidents are harder to spot
→ observability has nothing to work with
→ problems stay hidden

That’s why the next question isn’t “Do we monitor?”
It’s “What do we monitor first?”

The Monitoring Pyramid

Think of monitoring as a pyramid.

The bottom layers are boring and absolutely essential.
The top layers are powerful but useless without the foundation.

Level 1 – User impact (foundation)

Can users do their job?

→ Is the product reachable?
→ Are key flows working?
→ Are users experiencing failures or extreme slowness?

This is where incidents become visible.

What this practically means

At this level, you monitor user journeys, not systems.

Examples:

Monitor critical errors on key user flows
- Login
- Submit / save / approve
- Any “if this breaks, work stops” journey
- Spikes in HTTP 5xx or 4xx on these paths
Monitor application availability
- Is the app up or down from a user perspective?
Monitor user-side response time
- Pages loading extremely slowly
- Requests timing out
- Sudden latency spikes

If you skip this level, your first alert will be a user ticket.

Level 2 – Application health

Is the application behaving as expected?

→ Is something degrading overall?
→ Is this affecting many users or just one?

This layer confirms: “Yes, this is a real issue.”

What this practically means

Here you monitor aggregated application signals.

Examples:

Error rates
- Percentage of failed requests
- Trends over time, not individual errors
Latency metrics
- Average vs p95 / p99 response time
- Sudden shifts compared to baseline
Throughput
- Request volume dropping or spiking unexpectedly

This level helps you distinguish:
“one user had a bad day”
vs
“the system is unhealthy.”

Level 3 – Services and dependencies

Which part of the system is failing?

→ Is it our service or a dependency?
→ Is one component dragging everything down?

This is where observability starts to become possible.

What this practically means

Here you break the app into parts.

Examples:

Monitor individual services or APIs
- Error rate per service
- Latency per endpoint
Monitor dependencies
- External APIs
- Internal downstream systems
- Message queues or brokers
Monitor retries and timeouts
- Silent retry storms
- Backlogs building up

This level turns “the app is slow”
into “this service times out when calling that dependency.”

Level 4 – Infrastructure signals

Is the system under pressure?

→ Are resources saturated?
→ Is performance degrading over time?

These signals explain why things get worse.

What this practically means

Here you look at capacity and pressure, not user behavior.

Examples:

CPU usage
Memory consumption
Disk I/O
Network saturation
Queue depth

On their own, these metrics don’t tell a story.

But combined with the levels above, they often explain why incidents repeat.

Level 5 – Logs and traces (advanced)

What actually happened in this specific case?

→ What failed?
→ In what order?
→ For this exact request?

This is where root causes live.

What this practically means

Here you invest in deep investigation tooling.

Examples:

Structured logs
- Consistent formats
- Meaningful error messages
- Searchable fields
Correlation IDs
- One request across multiple services
Distributed tracing
- End-to-end request paths
- Where time is spent
- Where failures occur

Without the lower layers, this becomes expensive noise.
With them, it’s incredibly powerful.

A simple PM rule of thumb

If a signal doesn’t help you:
→ detect an incident
→ understand why it happened
→ or prevent it next time

It probably doesn’t belong in your monitoring setup.

Behind the Scenes

This week was… disruptive.

We got so much snow that transport across Germany was completely thrown off.
My days were basically: office → train → waiting on the train → moving a bit → waiting again.

A lot of hanging around.
A lot of rescheduling.

But at least there was a silver lining.

Sitting on the train with nowhere to rush to, looking outside, everything covered in snow, it was actually kind of beautiful.
Not the most productive week.
But a very wintery one.

What do you think?

How does monitoring look for your product today?

→ Mostly user tickets
→ Alerts, but not very helpful
→ Or a setup you actually trust during incidents?

Hit reply and tell me where you are right now.

See you next Tuesday,

Maria

Frankfurt am Main, 60311, Germany
Unsubscribe · Preferences

Maria Korteleva

Monitoring Isn’t an Ops Problem

My experience with monitoring

The mistake I see teams make

The 4 concepts that made it click for me

Monitoring

Observability

Incidents

Problems

How they connect

The Monitoring Pyramid

Level 1 – User impact (foundation)

Level 2 – Application health

Level 3 – Services and dependencies

Level 4 – Infrastructure signals

Level 5 – Logs and traces (advanced)

A simple PM rule of thumb

Why your security work gets ignored

The 11 rules of good API design (for PMs)

Fix broken processes (without starting over)