The Observability Stack That Couldn't See Anything
The first sign something was wrong wasn’t an alert. It wasn’t a spike in the error rate or a pager going off. It was a message from finance at the end of the month asking why room revenue was lower than expected.
The engineering team pulled up the logs. Order service: reservation created. Room service: room assigned. Payment service: charge processed. Everything green. Every service reporting success. But somehow, users were checking into rooms they hadn’t actually paid for. The money wasn’t there.
Nobody could explain it. Not because the system wasn’t logging, it was. Not because there was no observability stack, there was. Grafana was deployed. Loki was ingesting logs from every service. Tempo was ready for traces. The team had spent two sprints setting it all up and had proudly declared themselves production-ready.
And when the moment of truth came, they were just as blind as before.
Everything logged. Nothing traced. And somewhere in the gap between those two things, money disappeared.
The Night Finance Asked a Question Nobody Could Answer
Here’s what the investigation actually looked like.
The booking flow spanned three services. A user would place an order, which triggered a room reservation in the inventory service, followed by a payment charge in the payment service. If payment failed, the reservation should roll back. Clean saga pattern, well-designed on paper.
When the team started digging, the logs were technically there. reservation_id: abc-123 in the order service. room_id: room-42 in the inventory service. transaction_id: txn-789 in the payment service. But here’s the problem: those three IDs had no relationship to each other in the log system. There was no way to answer the simplest question in debugging: did these three log entries come from the same user request?
What the team actually had was three independent event streams that happened to contain similar timestamps. They couldn’t tell if txn-789 was the payment attempt for reservation abc-123, or for a completely different booking. They couldn’t reconstruct what happened for any specific user. They couldn’t see whether the rollback ran or silently failed. The fraud detection logic had triggered on certain transactions, causing payment to return an error, but the rollback that should have followed never executed cleanly, and nobody could trace why.
The logs told them what happened inside each box. Nobody had built anything to tell them what happened between the boxes.
Logs Are Not Traces
This is the distinction that most teams learn too late, usually in a conversation with finance.
A log entry is a record of something that happened inside one service at one point in time. It’s valuable. It’s necessary. But it’s inherently local, it has no awareness of the larger request it’s part of.
A trace is a record of a single request’s journey across every service it touched, with timing at each step, parent-child relationships between operations, and a single ID that ties everything together. When a user places an order, a distributed trace shows you the HTTP call that came in, the inventory check that happened as a result, the payment attempt that followed, and the rollback that triggered when payment failed, all as one unified story, not three separate entries in three separate log streams.
The difference matters most when things go wrong in between services. Log-based debugging forces you to manually correlate events using timestamps and IDs that your own developers happened to include in log messages. Trace-based debugging shows you the exact path, the exact failure point, and the exact moment the saga broke down.
One is archaeology. The other is a flight recorder.
The Three-Piece Puzzle
Here’s where teams get burned, and it’s almost never obvious until you’re staring at Tempo wondering why every service has its own separate trace.
OpenTelemetry distributed tracing requires three components to be present simultaneously. Not two. All three. And the frustrating part is that missing any one of them produces no error, no warning, and no indication that anything is wrong. Your metrics still look fine. Your logs still come in. Your traces still appear in Tempo. They’re just all disconnected from each other.
The propagator. This is what teaches OpenTelemetry how to read and write trace context across service boundaries. Without it, every service starts a brand new trace when it receives a request, because it doesn’t know how to look for an existing trace ID in the incoming headers.
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
))
The instrumented HTTP transport. When your service makes an outbound HTTP call to another service, the trace context lives in the Go context.Context attached to the request. Without an instrumented transport, nothing reads that context and injects the traceparent header into the outgoing request. The downstream service receives the call with no trace context, so it starts a new trace.
http.Client{Transport: otelhttp.NewTransport(http.DefaultTransport)}
The server middleware. When your service receives an inbound HTTP request, it needs to extract the trace context from the traceparent header and attach it to the request context. Without middleware, your handlers have no trace context, so any spans they create float in isolation, disconnected from the upstream caller.
r.Use(otelgin.Middleware("order-svc"))
Three pieces. All required. The propagator defines the language. The transport writes it into outbound requests. The middleware reads it from inbound requests. Remove any one of them and you get the worst possible failure mode for an observability tool: it runs silently, produces data, and gives you false confidence that everything is working.
The hotel booking team had deployed OpenTelemetry across all three services. But the payment service was added later by a different developer who didn’t know about the propagator setup. Two services spoke the same tracing language. One didn’t. The result was three islands of telemetry with no bridges between them.
If you want to see these three components wired together end-to-end, I put together a working demo at github.com/vourteen14/grafana-tempo that shows exactly this pattern across three Go services connected to Grafana, Loki, and Tempo. You can reproduce the broken and working state by pulling one of the three components out and watching the traces fall apart.
What Google’s W3C Bet Tells Us
The propagator mechanism isn’t arbitrary. It implements the W3C Trace Context specification, which became an internet standard precisely because distributed tracing kept breaking when different frameworks and vendors used incompatible header formats.
Before standardization, a request leaving a Java service would carry a Zipkin-format trace header. The Python service receiving it wouldn’t recognize the format, would start a fresh trace, and the correlation was lost. Different vendors, different formats, different incompatibilities. Google, Microsoft, and major observability vendors pushed for a single standard because they’d all watched their customers lose trace correlation at service boundaries.
The traceparent header is what that standardization produced. Every compliant instrumentation library knows how to read it and write it. But “compliant” requires that you explicitly set up the propagator. It doesn’t happen automatically. And because the failure is silent, teams deploy distributed tracing thinking it’s working, run it in production for months, and only discover the broken linkage when they need to debug something specific.
Google’s own Cloud Trace documentation leads with propagation setup. Not metrics. Not dashboards. Propagation. Because without it, nothing else matters.
What Actually Happened to the Money
Once the hotel booking team correctly configured all three components and redeployed, the answer was visible in sixty seconds.
The unified trace showed the full story. An order comes in. Inventory reserves a room, that span is there, with the reservation_id attached as an attribute. Payment service attempts to charge the card, that span is there too, marked red, with a fraud check event that shows the transaction amount exceeded the card’s fraud limit. Payment returns an error. Back in the order service, the rollback logic runs, but there’s a span marked red there as well, with an event that says release_reservation_failed: context deadline exceeded.
The rollback was timing out. The room was already assigned. Payment had failed. But the reservation release call was exceeding its timeout, returning silently, and leaving the room in a reserved state. Users could check in because the room showed as reserved in the system. Finance noticed because the payment never cleared.
The bug wasn’t exotic. A missing timeout configuration in the HTTP client used for the rollback call. But finding it without the distributed trace would have taken days, if it was found at all. The logs had all the individual events. None of them could tell you they were part of the same request, or that the rollback that should have followed a payment failure had silently timed out instead.
The False Confidence Problem
There’s a specific kind of blind spot that observability tooling creates when it’s partially broken. It’s worse than having no tooling at all.
When you have nothing, you know you’re flying blind. You build careful logging, you add manual correlation IDs, you trace things through by hand. It’s slow and painful, but you’re aware of your limitations.
When you have a partially broken observability stack, you believe you’re covered. You show stakeholders the Grafana dashboard. You point at the traces in Tempo. You explain that every request is being tracked. You have the confidence of someone who has done the work, except the work has a gap in it that you can’t see, because broken tracing doesn’t announce itself.
This is what makes the three-piece puzzle dangerous. It’s not a misconfiguration that causes errors. It’s a misconfiguration that causes gaps, and gaps are invisible until the exact moment you need the data that should have been in them.
We’ve gotten good at building observability infrastructure. Helm charts, managed collectors, pre-built Grafana dashboards, cloud-native tracing backends. The tooling has never been more accessible. But accessibility makes it easy to go through the motions of observability without achieving it. A deployed OpenTelemetry collector isn’t the same thing as distributed tracing that works. A Tempo datasource in Grafana isn’t the same thing as traces that are connected across your services.
The hotel booking system had every component in place. Prometheus scraping metrics. Loki aggregating logs. Tempo storing traces. OTEL Collector routing everything to the right backend. The team had done everything right in terms of infrastructure, and nothing right in terms of the three lines that actually make distributed tracing work.
Real observability isn’t a stack you deploy. It’s a property your system either has or doesn’t, and the gap between the two can be as small as a missing propagator configuration.
The money disappears quietly. The traces stay disconnected. And the dashboards stay green the whole time.