Ask HN: Debugging failure in large interconnected back end systems

I’m trying to understand how teams actually debug production issues in systems made up of multiple services and external integrations (e.g. Stripe, Twilio, internal microservices, queues, webhooks, etc.).

In practice, when something breaks, it seems like the workflow is usually:

an alert fires (Datadog/Sentry/CloudWatch/etc.)

or a customer complains

engineers then start checking logs, traces, dashboards across multiple systems

and eventually manually reconstruct what happened across services

What I’m curious about:

How do you actually trace a single failed request or transaction across multiple services today?

What tools do you rely on most in practice (not in theory)?

Where does it usually break down — logs, tracing, instrumentation, or just missing context?

How long does it typically take to go from “something is wrong” → “we know exactly why it broke”?

What part of this is still mostly manual stitching together of information?

Trying to understand what the real pain points are in practice, especially in systems with lots of external integrations and async flows.

1 points | by Ifedayo_s 1 hour ago

1 comments

  • verdverm 14 minutes ago
    A good harness with read access to systems and code is the best place to start these days.