← Back to Home

The Open Source Bait and Switch Nobody Talks About

We needed an API gateway. Kong was $30k/year, AWS API Gateway had its own cost trap. Tyk’s open source gateway looked like the answer: free, performant, written in Go.

The problem was route management. Tyk uses imperative API calls by default, but our infrastructure is fully declarative. Everything lives in Git, deployed with kubectl apply. We needed an operator.

Tyk has one. It’s called Tyk Operator and it’s exactly what we needed: declarative, GitOps-ready.

It’s also enterprise-only.

The open source gateway is free. The tooling that makes it actually usable in a cloud-native setup costs money. Core product free, operational tooling paywalled. So I built my own.

What I Built

tyk-cm-routes-controller is 565 lines of Python using Kopf. The idea is straightforward: watch TykRoute custom resources, pull the API definition from the spec, validate it, write it to a ConfigMap as JSON, optionally trigger a rollout restart, and update the resource status. Done.

It worked perfectly in staging with a dozen routes. We shipped it to production.

Eight months later we’d had three production incidents, merge conflicts became a recurring headache, and we kept running into ConfigMap size limits. The gap between “code that works” and “code that works at scale” hit us hard.

The Merge Conflict Problem

The original design put all routes into a single ConfigMap. Tyk reads route definitions from mounted files at startup, so one ConfigMap with all routes made sense.

It made sense until four teams started deploying independently.

Payments wanted to add a refund endpoint. Products wanted to add an inventory check. Both branched from the same commit. Both modified the same ConfigMap. Git has no idea how to auto-merge 2000 lines of JSON route definitions, so you end up with 57 blobs all marked as conflicting.

An engineer spends 40 minutes resolving it manually. The YAML looks valid, the PR gets approved. Three routes got silently dropped. Six hours later, production starts throwing 404s.

This wasn’t a one-off. The probability of a conflict scales with how often teams deploy. We tried Slack coordination, namespace separation, and not committing the ConfigMap at all. Each workaround introduced a different problem. The root cause was architectural: GitOps assumes independent resources, but a ConfigMap is atomic. The entire object gets versioned as a unit.

Race Conditions

Kopf gives you three handlers: create, update, delete. Simple enough.

The uniqueness check for listen_path wasn’t atomic though. The flow was: read the ConfigMap, check if the path already exists, write the new route if it doesn’t. With multiple operator replicas or two TykRoute resources created at the same time, both checks pass before either write completes. You end up with duplicate listen paths in production.

The rollout restart had its own issue. After updating the ConfigMap, the operator would patch the deployment annotations to trigger a restart, then immediately set the TykRoute status to active. No waiting, no verification. The operator had no idea if the rollout succeeded, if pods were crashlooping, or if Tyk rejected the config.

Fixing this properly would’ve meant integrating with the Tyk admin API, watching rollout status, and adding locking. More complexity, more failure modes. We chose simplicity over robustness.

The 1MB Wall

Kubernetes has a hard 1MB limit per ConfigMap.

A complex route with auth, rate limiting, transforms, and caching is around 10KB. A simple route is about 2KB. Average maybe 5KB. That gives you roughly 200 routes before you hit the ceiling.

Fine with 12 routes in staging. Not fine when you have an entire API surface spread across multiple teams.

The failure mode is silent until you hit it. Kubernetes rejects the update with a cryptic error. New routes stop working, existing routes keep serving traffic. If nobody’s watching closely, it takes a while to figure out what happened.

The workarounds weren’t great either. Multiple ConfigMaps means multiple volume mounts and complexity that spreads into the Tyk deployment config. Increasing the cluster limit is a non-starter with most platform teams. We ended up with manual capacity planning: monitor the size, split across namespaces when we get close.

Automated route deployment, manual storage capacity planning.

Status That Lies

The TykRoute status had fields for state, listenPath, targetConfigMap, and conditions. When you looked at a route and saw state: active, it meant the ConfigMap update succeeded.

It did not mean the route was serving traffic.

A rollout could’ve failed. Pods could be crashlooping. Tyk could’ve rejected the config entirely. The operator only knew about the ConfigMap write. The deployment pipeline saw active and marked success. Monitoring saw active and stayed quiet. Meanwhile, requests were hitting 404s.

Real status would require integrating with the Tyk admin API, watching rollout completion, and running health checks against the routes. That’s a much bigger project, and we chose not to do it. The operator reports what it knows, which turns out not to be what operators actually need to know.

What Tyk Enterprise Probably Does

I can’t say for certain, but the problems we hit are obvious enough that a real product would’ve had to solve them. Probably database writes instead of a single ConfigMap. Actual Tyk API integration so status reflects real availability. Some mechanism for teams to deploy independently without merge conflicts.

The quality gap between a weekend project and production-grade tooling is real and expected.

The Trap

The pattern is predictable in hindsight. Choose open source to avoid the licensing cost. Discover that production use requires the paywalled tooling. By then you’re already committed to the technology. So you build it yourself.

A PoC is a weekend. Production-grade with all the edge cases is a different story. Maintenance over time is harder still.

The comparison spreadsheet made it look obvious: Enterprise at $30k/year versus operator at $0/year. The spreadsheet didn’t have a line for engineering time spent building, debugging, explaining limitations, fielding questions, and responding to incidents. At market rate for a senior engineer, that time adds up past the license cost quickly.

Engineering time doesn’t show up as an incremental cost on budgets the way a vendor invoice does. The CFO sees infrastructure cost down and doesn’t see velocity down. By the time the productivity drag is obvious, you’re maintaining critical infrastructure and the migration cost is high enough that you just keep going.

Tyk’s setup is well-designed for this. The gateway itself is solid and production-ready. The gaps only appear when you start doing K8s-native workflows. By then, traffic is flowing through it, teams depend on it, and migration is a serious project.

We saved the license cost and inherited the development and maintenance cost. Eight months in, I’m not sure we came out ahead.

What I’d Think About Differently

Don’t use a single ConfigMap for data that grows without a bound. Don’t report state as active when you’ve only verified the config write. Coordination logic that works for one team breaks for multiple teams in ways that aren’t obvious until you’re in it.

More importantly: “can we build this” is the wrong question. The right question is whether the operational burden is worth the cost savings. Sometimes the answer is yes, especially when the commercial tool genuinely doesn’t fit your requirements or you have real spare capacity. But when a commercial tool exists and has broad adoption, it usually exists because the problem is harder than it looks. The price reflects engineering effort that’s already been spent solving problems you haven’t hit yet.

For tyk-cm-routes-controller: the routes work and the workflow is declarative. But merge conflicts are a recurring tax, the size limit requires manual management, the status is misleading, and every edge case requires manual verification. Fixing the design problems properly would take weeks of work.

We built it to save money. We might have.