The Open Source Bait and Switch Nobody Talks About
We needed an API gateway. Enterprise options like Kong ($30k/year) and AWS API Gateway were expensive. Tyk’s open source gateway looked perfect—free, performant, written in Go.
The problem: Tyk uses imperative API calls for route management. Our infrastructure is declarative—everything in Git, deployed via kubectl apply. We needed a Kubernetes operator.
Tyk has one: Tyk Operator. Declarative, GitOps-ready, exactly what we needed.
Except it’s enterprise-only.
The open source gateway is free. The operator that makes it usable in cloud-native environments costs exactly what we tried to avoid. Classic bait and switch—core product free, operational tooling paywalled.
So I built my own.
The Implementation
tyk-cm-routes-controller - 565 lines of Python using Kopf. Simple controller pattern:
- Watch TykRoute custom resources
- Extract API definition from spec
- Validate and write to ConfigMap as JSON
- Trigger rollout restart (optional)
- Update resource status
Worked perfectly in staging with a dozen routes. Shipped to production.
Eight months later: three production incidents, constant merge conflicts, and ConfigMap size limit headaches. The gap between code that works and code that works at scale became painfully clear.
Problem #1: The ConfigMap Collaboration Disaster
Architecture: All routes in a single ConfigMap. Tyk reads route definitions from files at startup, so mount one ConfigMap with all routes.
Works fine for one team. Breaks with multiple teams deploying independently.
The Issue: Four teams (Payments, Auth, Products, Orders) all write to the same api-routes ConfigMap. Every deployment potentially conflicts.
Scenario: Payments adds a refund endpoint. Products adds inventory check. Both branches from the same commit, both modify the ConfigMap. Git can’t auto-merge—2000 lines of diff, 57 JSON route blobs all marked conflicting.
Engineer spends 40 minutes resolving manually. YAML is valid, PR approved. Three routes accidentally dropped. Production 404s six hours later.
This is Tuesday. Conflict probability = deployment frequency × team count.
Workarounds tried:
- Coordination Slack channel: Manual scheduling defeats automation purpose
- Namespace separation: Requires mounting multiple ConfigMaps, complexity spreads
- Don’t commit ConfigMap: Breaks audit trail, source of truth moves to cluster
Root cause: GitOps assumes independent resources. ConfigMaps are atomic—entire object is versioned as a unit. Tyk Enterprise probably uses database writes or smart sharding.
Problem #2: Race Conditions
Using Kopf: Three handlers (create, update, delete) triggered by resource changes.
Race #1: Uniqueness Check
Handler validates listen_path uniqueness:
- Read ConfigMap
- Check if path exists
- Write new route if unique
Not atomic. With multiple operator replicas or concurrent TykRoute creates, both checks pass before either writes. Result: duplicate listen paths in production.
Race #2: Async Rollout
Flow:
- Update ConfigMap
- Trigger rollout restart (patch deployment annotations)
- Immediately set TykRoute status to
active
No waiting for rollout completion. No verification Tyk loaded the route.
Result: Status shows active, but:
- Rollout might have failed
- Pods could be crashlooping
- Tyk might have rejected invalid config
Operator only knows ConfigMap was updated, not whether Tyk applied it.
Fix would require: Tyk API integration, rollout status watching, locking. Adds complexity, more failure modes. Chose simplicity over robustness.
Problem #3: ConfigMap Size Limit
Hard limit: 1MB per ConfigMap.
Math:
- Complex route (auth, rate limiting, transforms, caching): ~10KB
- Simple route: ~2KB
- Average: ~5KB
- Capacity: ~200 routes
Fine in staging with 12 routes. Not fine with entire API surface across teams.
Failure mode: Silent until you hit it. Kubernetes rejects update, cryptic error. New routes fail, existing routes keep working.
Fix options:
- Multiple ConfigMaps: Requires multiple volume mounts, coupling spreads to Tyk deployment config
- Increase cluster limit: Cluster-wide parameter, admins won’t change it
- ConfigMap sharding: Retrofit would require API changes, migrations, retraining
Actual workaround: Manual capacity planning. Monitor size, split across namespaces when close.
Automated route deployment, manual storage capacity planning.
Problem #4: Status Subresource Illusion
Status fields: state, listenPath, targetConfigMap, conditions
Operator sets state: active when ConfigMap update succeeds. Looks like useful observability.
The lie: Status = “ConfigMap updated”, NOT “route serving traffic”
Scenario: TykRoute shows active, but:
- Rollout restart failed
- Pods crashlooping
- Tyk rejected the config
Deployment pipeline sees active, marks success. Monitoring sees active, no alert. Traffic gets 404s.
The gap: “ConfigMap updated” ≠ “route available”
Real fix would need:
- Tyk admin API integration
- Rollout status watching
- Health checks for routes
- API credentials, retry logic, timeout handling
Operator becomes full Tyk runtime integration. More complexity, more maintenance.
Chose: Technical correctness (reports what it knows) over operational usefulness (reports what users need).
What Tyk Enterprise Likely Does Better
Storage: Probably database writes or smart sharding, not single ConfigMap
Verification: Tyk API integration, actual health checks, status reflects real availability
Collaboration: Architecture that avoids merge conflicts or workflows that route around it
Observability: Metrics, drift detection, proper alerting
Support: Documentation, runbooks, engineers who respond to tickets
Weekend project vs. production-grade tooling. Quality gap is expected.
The Open Source Bait and Switch
Pattern: Core product open source. Operational tooling enterprise-only.
The trap:
- Choose open source to save money
- Discover production requires paywalled tooling
- Already committed to the tech
- Build it yourself
Reality check:
- PoC in a weekend ✓
- Production-grade with edge cases = hard
- Maintenance over time = harder
Cost calculation: Weekend build + dozens of hours debugging + explaining limitations + fielding questions. At market rate for senior engineers, exceeds Enterprise license multiple times over.
Why we keep building: Engineering time doesn’t show as incremental cost on budgets. CFO sees infrastructure cost down, doesn’t see velocity down.
By the time velocity impact is obvious, you’re maintaining critical infrastructure. Migration activation energy is high. You keep maintaining.
Tyk’s trap is well-designed: Gateway is solid and production-ready. Gaps only appear with K8s-native workflows. By then, traffic is flowing, teams depend on it, migration is massive.
Saved license cost, inherited development + maintenance cost. Total cost of ownership might be higher, but it’s distributed across engineering time, not a budget line item.
Lessons Learned
Technical:
- Don’t use single ConfigMap for unbounded data
- Don’t claim state is active when you’ve only verified config write
- Simple coordination logic doesn’t scale
Operational:
- Prototype ≠ production infrastructure
- Technical problem ≠ organizational problem
- Works in staging ≠ works with multiple teams
Strategic: Built to avoid Enterprise license. Eight months later, engineering time spent > license cost.
Time could’ve gone to features, reliability improvements, reducing debt. Instead: maintaining operator with annoying limitations.
“Should We Build This?” vs “Can We Build This?”
Ask: Is the operational burden worth the cost savings?
Yes, build when:
- Unique requirements commercial tools don’t meet
- Spare engineering capacity exists
- Problem truly unique to your environment
No, don’t build when:
- Commercial tool exists because problem is harder than it looks
- License cost reflects production-grade engineering effort
- Building moves cost from vendor to you
For tyk-cm-routes-controller: Routes work, workflow is declarative. But merge conflicts, size limits, false status confidence, manual verification. Design limitations requiring weeks to fix vs. paying for Enterprise.
The Real Cost of Free
We calculate: License fees, instance hours, bandwidth (appear on budgets)
We don’t calculate: Engineering time maintaining tools, velocity lost to complexity, opportunity cost (spread across productivity drags, not one line item)
Comparison spreadsheet: Enterprise $30k/year vs operator $0/year
Reality: Build + maintain + debug + support + explain limitations + velocity lost to workarounds + incident response. At market rate > Enterprise license.
Open source benefits are real: Control, transparency, modifiability
But sometimes free is expensive: Commercial price reflects solving problems you haven’t hit yet.
The hard part: Knowing which situation you’re in before you’re committed.
Built to save money. Eight months later, not sure we did.