20 Oct 2025
The alert sound is burned into my brain now. That specific PagerDuty tone that means something is really wrong. Not “a pod restarted” wrong. Not “latency spike” wrong. The kind of wrong that makes your stomach drop before you even look at your phone.
Late Sunday night. I’d finally convinced myself to stop checking Slack every five minutes and actually relax. Big mistake.
More …
16 Oct 2025
Running API Gateway in Kubernetes isn’t straightforward. Most documentation glosses over real issues like the etcd image registry problem post-VMware acquisition, CRD-based configuration patterns, and plugin troubleshooting. This guide covers deploying APISIX with local chart customization to handle these issues, implementing traffic management patterns (rate limiting, circuit breaker, caching) through Kubernetes CRDs
More …
16 Sep 2025
The Challenge
Running Keycloak in production is notoriously challenging. Session loss during scaling, complex external cache configurations, and maintaining high availability while ensuring session persistence across multiple replicas are common pain points. Traditional approaches often require external Infinispan clusters or Redis, adding operational complexity and potential failure points.
Solution Overview
Instead of managing external caching systems, we can leverage Keycloak’s built-in clustering capabilities with Kubernetes-native service discovery. This approach uses JGroups with DNS-based discovery through headless services,
More …
01 Sep 2025
Yesterday’s daily standup was supposed to be 15 minutes. It turned into a 2-hour debugging session instead. A feature we’d already demoed to the client last week suddenly wasn’t showing up in production. The app team kept saying ‘it works fine in staging,’ while DevOps insisted ‘infrastructure looks good on our end.’ Meanwhile, the client kept asking when it would go live.
What made it frustrating was that all our monitoring was green. Database connections healthy, API response times normal, zero error rate. But somehow the new feature just wasn’t there. No error logs, no exceptions, nothing crashed.
More …
24 Aug 2025
Your service is down. Again. Third time this week, and it’s only Tuesday.
The alerts started flooding in around lunch. Connection timeouts to ActiveMQ, database pool exhaustion, pods stuck in CrashLoopBackOff. Your team scrambles to investigate, but here’s the weird part: every individual component looks healthy. ActiveMQ is running fine, database performance is normal, pods would start successfully if they could just get past the connection phase.
You check the obvious suspects. Network? Fine. DNS? Working. Firewall rules? All good. Load balancer health checks? Passing. So what’s killing your perfectly healthy infrastructure?
More …