09 May 2026
There is a particular kind of anxiety that comes with being a DevOps engineer. Not the kind from outages or failed deployments, though those are present too. The quieter kind. The background hum of knowing that something could break at any moment and that when it does, people will be waiting for you to fix it.
The standard assumption baked into that responsibility is that you will be at your desk. That you have a terminal open, or can get one open quickly. That your laptop is somewhere nearby, that you can SSH into things, run commands, check logs, and do the actual work.
Most of the time that is true. And then sometimes you are on a commute, or standing in a queue, or halfway through a trip with your laptop sitting at home, and a notification arrives telling you something is down.
More …
01 May 2026
The first sign something was wrong wasn’t an alert. It wasn’t a spike in the error rate or a pager going off. It was a message from finance at the end of the month asking why room revenue was lower than expected.
The engineering team pulled up the logs. Order service: reservation created. Room service: room assigned. Payment service: charge processed. Everything green. Every service reporting success. But somehow, users were checking into rooms they hadn’t actually paid for. The money wasn’t there.
Nobody could explain it. Not because the system wasn’t logging, it was. Not because there was no observability stack, there was. Grafana was deployed. Loki was ingesting logs from every service. Tempo was ready for traces. The team had spent two sprints setting it all up and had proudly declared themselves production-ready.
More …
16 Mar 2026
Google OAuth was broken in production. Not intermittently. Completely. Every user who clicked “Login with Google” hit an error page. The callback URL was wrong, NextAuth was rejecting it, and nothing in our recent deployments explained why.
The logs showed a clean 308 redirect. HAProxy was adding a trailing slash to our OAuth callback path. One character, silently appended, was enough to invalidate the entire flow.
What made it worse was that we had not touched the Ingress responsible for that route. No config changes, no deployments to that service, nothing. The Ingress looked exactly as it should.
More …
16 Mar 2026
There was a time when the database schema was clean. Table names followed a consistent pattern. Database changes went through a defined process. Developers knew what they were responsible for and where the boundary was. The infrastructure team could manage environments with confidence because what they saw in development roughly resembled what they would see in production.
Then the principal backend left.
Not immediately, not in a single dramatic incident, but gradually. The kind of collapse that you only recognize clearly in hindsight. Standards started becoming suggestions, and suggestions started becoming optional.
More …
02 Mar 2026
Switching ingress controllers is not a lift-and-shift operation. NGINX and HAProxy are built on different architectural assumptions, and those differences compound at every layer — from how configuration is loaded to how certificates are selected to how the system behaves when a single rule is malformed.
This is a post-migration review of what we found, what broke, and what needs to be in place before any team runs this in production.
More …