← Back to Home

The Self-Healing Systems That Kill Themselves

Your service is down. Again. Third time this week, and it’s only Tuesday.

The alerts started flooding in around lunch. Connection timeouts to ActiveMQ, database pool exhaustion, pods stuck in CrashLoopBackOff. Your team scrambles to investigate, but here’s the weird part: every individual component looks healthy. ActiveMQ is running fine, database performance is normal, pods would start successfully if they could just get past the connection phase.

You check the obvious suspects. Network? Fine. DNS? Working. Firewall rules? All good. Load balancer health checks? Passing. So what’s killing your perfectly healthy infrastructure?

The answer is hiding in your application logs, buried under millions of entries that all say the same thing: “Connection failed, retrying in 0ms.”

That retry logic you implemented to make your system resilient? It’s not healing your system. It’s strangling it to death, one connection attempt at a time.

The ActiveMQ Murder Mystery

Here’s how I learned that good intentions can kill production systems.

We had a beautiful microservices architecture. Clean separation of concerns, proper message patterns, everything communicating through ActiveMQ. The kind of setup that looks great in architecture reviews and works perfectly in development.

One afternoon, our order processing service started throwing connection errors. Just occasional timeouts, nothing dramatic. The kind of thing that happens in distributed systems all the time. Our retry logic kicked in, exactly as designed.

Except “exactly as designed” turned out to be exactly wrong.

The connection timeout was set to 5 seconds. Reasonable for development, way too aggressive for production network conditions. When the connection failed, our retry logic immediately tried again. No backoff period, no pause, just instant retry. The developer who wrote it probably thought, “Networks are unreliable, we need to be persistent.”

What we got wasn’t persistence. We got a connection storm.

Every failed connection triggered an immediate retry. We had twelve instances of the service running, all experiencing the same timeout issue, all retrying at the exact same rate. That’s twelve connection attempts per second becoming twelve hundred connection attempts per second in the span of a few minutes.

ActiveMQ started spending more CPU cycles handling connection attempts than processing actual messages. The message broker was healthy, but it couldn’t accept new connections because it was drowning in retry attempts from services that were trying to “help.”

Here’s what really pisses me off about this whole thing. We spent hours debugging network issues that didn’t exist. We checked DNS configurations that were fine. We investigated load balancer settings that were working perfectly. Meanwhile, the real problem was sitting right there in our application code, disguised as a feature.

Meanwhile, our application logs were generating 30 million entries per day. Thirty million variations of “connection failed, will retry immediately.” Our log aggregation system started struggling. Our monitoring became useless. Our alerting turned into noise.

The retry logic designed to make our system bulletproof had turned our services into a firing squad aimed at our own infrastructure.

When Good Architecture Goes Bad

Three weeks later, different service, same story. Database connection pool exhausted, applications timing out, users complaining. But PostgreSQL was running beautifully. Query performance was great, no resource contention, plenty of available connections.

Except there weren’t available connections. The pool was full of retry attempts from applications that couldn’t get connections because the pool was full of retry attempts. Perfect circular logic that would have been funny if it wasn’t bringing down production.

Then it happened again with our Kubernetes cluster. Pods failing health checks and going into CrashLoopBackOff, but the applications themselves were healthy. They were crashing because they couldn’t establish connections during startup. Kubernetes helpfully restarted them, creating more connection attempts during startup, making connections even less likely to succeed.

I’ve seen this exact scenario play out at three different companies now. Different tech stacks, different teams, but always the same root cause hiding in plain sight.

That’s when I realized we weren’t dealing with infrastructure failures. We were dealing with a fundamental design flaw in how we think about resilience.

This Is Happening Everywhere

Look, I get it. Every microservice framework ships with retry logic because distributed systems are unreliable by nature. Every database driver, every message queue client, every cloud SDK comes with automatic retry built in. The problem is most of these defaults were written by people who never had to debug them at 3 AM.

Think about what happens when you deploy a microservice with default retry settings to a Kubernetes cluster. One pod times out connecting to a database. It retries immediately. The retry fails, so it tries again. And again. The pod crashes, so Kubernetes starts a new one, which immediately starts the same retry pattern. Scale this across dozens of pods, and you’ve created a distributed denial of service attack against your own database.

Cloud services make this worse, not better. AWS RDS has connection limits. Google Cloud SQL has quotas. Azure Service Bus implements throttling. When your retry logic hits these limits, cloud providers don’t just slow you down. They can block you entirely, turning a five-second network hiccup into a multi-hour outage.

Load balancers add another layer of complexity. When your services can’t connect to backends due to retry storms, load balancers mark those backends as unhealthy and stop routing traffic to them. But your retry logic doesn’t know about load balancer decisions, so it keeps hammering services that have been intentionally taken out of rotation.

Service meshes, API gateways, message brokers, databases. Every component in your infrastructure can become a victim of well-intentioned but poorly implemented retry logic.

Understanding the Retry Storm Effect

Retry storms kill systems through resource exhaustion, but not in the way you might expect.

Most services are designed to handle steady-state load with occasional spikes. A database optimized for 10,000 queries per second might only be able to establish 100 new connections per second. When retry logic starts dominating the load pattern, connection establishment becomes the bottleneck instead of actual work.

Think of it this way: your database can handle 10,000 queries per second, but it can only establish maybe 100 new connections per second. When retry logic goes haywire, suddenly 90% of your database’s work becomes just managing connection handshakes instead of running actual queries.

The failure isn’t dramatic. Systems don’t crash with out-of-memory errors or CPU exhaustion. They just become progressively less capable of doing useful work. Response times increase, success rates decrease, and eventually the system becomes effectively unavailable even though it’s technically running.

Logging makes everything worse. Each failed connection attempt generates a log entry. Each retry generates another entry. Modern distributed systems can generate millions of log entries per hour during retry storms, overwhelming log aggregation systems and making debugging nearly impossible.

The feedback loops are vicious. Monitoring systems that depend on the same infrastructure start reporting false positives. Alerting becomes unreliable because it’s based on systems that are struggling under retry load. Teams spend their time debugging phantom problems while the real issue, the retry logic itself, remains invisible.

What Companies That Actually Scale Have Learned

Netflix learned this the hard way when their retry logic started taking down healthy services. Their solution wasn’t more sophisticated retry logic. It was knowing when to stop trying. Circuit breakers everywhere, exponential backoff on everything, and aggressive limits on retry attempts.

Google’s Cloud SDKs ship with exponential backoff and jitter by default because they’ve watched thousands of customers accidentally DDoS their own services with overly aggressive retry logic.

AWS builds intelligent backoff into their SDKs because they got tired of customers creating outages by being too persistent with failed API calls.

Kubernetes implements exponential backoff for pod restarts because the platform learned that immediate restarts often make problems worse. CrashLoopBackOff isn’t a punishment for failing pods. It’s protection against pods that would retry themselves to death.

The thing that drives me crazy is how obvious this all seems in retrospect. Of course unlimited retries are bad. Of course you need backoff and jitter. Of course you need circuit breakers. But somehow we keep building systems that turn good intentions into cascading failures.

Building Actually Resilient Systems

Real resilience requires knowing when to stop trying.

Exponential backoff with jitter is the foundation. When something fails, wait before retrying. If it fails again, wait longer. Add randomness so multiple clients don’t retry in perfect synchronization. This prevents retry storms while still providing protection against transient failures.

Circuit breakers provide system-level protection. When failure rates exceed thresholds, stop retrying entirely for a period of time. Let the struggling system recover instead of continuing to attack it with requests. Circuit breakers turn temporary problems into temporary outages, which is almost always better than turning temporary problems into extended disasters.

Maximum retry limits prevent infinite loops. No matter how critical a request seems, there has to be a point where you give up and report failure upstream. Infinite retries in distributed systems consume infinite resources.

Intelligent logging preserves signal while reducing noise. Log the first failure and the final outcome, but don’t log every retry attempt. Use sampling and aggregation to maintain observability without creating log storms.

Rate limiting your own outbound requests prevents your clients from overwhelming target systems. If you’re going to retry, do it at a rate that the target system can actually handle.

The Hard Truth About Resilience

The next time your “resilient” system mysteriously becomes unreliable, ask these questions:

What happens when this retry logic runs across all instances of your service simultaneously? Are you accidentally creating coordinated attacks against your own infrastructure?

Do your retries use exponential backoff and jitter, or do they retry at constant intervals? Constant retry rates almost always create thundering herd problems at scale.

Is there a maximum number of retry attempts, or could this code theoretically retry forever? Systems that retry infinitely consume resources infinitely.

How are you handling retry logging? Are you creating millions of log entries per day for retry attempts, overwhelming your observability stack?

Do you have circuit breakers to recognize when retrying is counterproductive? Systems that never give up often never recover.

The hardest lesson in distributed systems is that persistence isn’t always a virtue. Sometimes the most resilient thing you can do is fail fast, back off, and give struggling systems room to breathe.

Your retry logic was supposed to make your systems bulletproof. Instead, it made them self-destructive. The difference between systems that survive production and systems that don’t isn’t avoiding failures. It’s knowing when to stop making them worse.