The Open Source Bait and Switch Nobody Talks About

01 Dec 2025

We needed an API gateway. Enterprise options like Kong ($30k/year) and AWS API Gateway were expensive. Tyk’s open source gateway looked perfect—free, performant, written in Go.

The problem: Tyk uses imperative API calls for route management. Our infrastructure is declarative—everything in Git, deployed via kubectl apply. We needed a Kubernetes operator.

Tyk has one: Tyk Operator. Declarative, GitOps-ready, exactly what we needed.

Except it’s enterprise-only.

The open source gateway is free. The operator that makes it usable in cloud-native environments costs exactly what we tried to avoid. Classic bait and switch—core product free, operational tooling paywalled.

So I built my own.

The Implementation

tyk-cm-routes-controller - 565 lines of Python using Kopf. Simple controller pattern:

Watch TykRoute custom resources
Extract API definition from spec
Validate and write to ConfigMap as JSON
Trigger rollout restart (optional)
Update resource status

Worked perfectly in staging with a dozen routes. Shipped to production.

Eight months later: three production incidents, constant merge conflicts, and ConfigMap size limit headaches. The gap between code that works and code that works at scale became painfully clear.

Problem #1: The ConfigMap Collaboration Disaster

Architecture: All routes in a single ConfigMap. Tyk reads route definitions from files at startup, so mount one ConfigMap with all routes.

Works fine for one team. Breaks with multiple teams deploying independently.

The Issue: Four teams (Payments, Auth, Products, Orders) all write to the same api-routes ConfigMap. Every deployment potentially conflicts.

Scenario: Payments adds a refund endpoint. Products adds inventory check. Both branches from the same commit, both modify the ConfigMap. Git can’t auto-merge—2000 lines of diff, 57 JSON route blobs all marked conflicting.

Engineer spends 40 minutes resolving manually. YAML is valid, PR approved. Three routes accidentally dropped. Production 404s six hours later.

This is Tuesday. Conflict probability = deployment frequency × team count.

Workarounds tried:

Coordination Slack channel: Manual scheduling defeats automation purpose
Namespace separation: Requires mounting multiple ConfigMaps, complexity spreads
Don’t commit ConfigMap: Breaks audit trail, source of truth moves to cluster

Root cause: GitOps assumes independent resources. ConfigMaps are atomic—entire object is versioned as a unit. Tyk Enterprise probably uses database writes or smart sharding.

Problem #2: Race Conditions

Using Kopf: Three handlers (create, update, delete) triggered by resource changes.

Race #1: Uniqueness Check

Handler validates listen_path uniqueness:

Read ConfigMap
Check if path exists
Write new route if unique

Not atomic. With multiple operator replicas or concurrent TykRoute creates, both checks pass before either writes. Result: duplicate listen paths in production.

Race #2: Async Rollout

Flow:

Update ConfigMap
Trigger rollout restart (patch deployment annotations)
Immediately set TykRoute status to active

No waiting for rollout completion. No verification Tyk loaded the route.

Result: Status shows active, but:

Rollout might have failed
Pods could be crashlooping
Tyk might have rejected invalid config

Operator only knows ConfigMap was updated, not whether Tyk applied it.

Fix would require: Tyk API integration, rollout status watching, locking. Adds complexity, more failure modes. Chose simplicity over robustness.

Problem #3: ConfigMap Size Limit

Hard limit: 1MB per ConfigMap.

Math:

Complex route (auth, rate limiting, transforms, caching): ~10KB
Simple route: ~2KB
Average: ~5KB
Capacity: ~200 routes

Fine in staging with 12 routes. Not fine with entire API surface across teams.

Failure mode: Silent until you hit it. Kubernetes rejects update, cryptic error. New routes fail, existing routes keep working.

Fix options:

Multiple ConfigMaps: Requires multiple volume mounts, coupling spreads to Tyk deployment config
Increase cluster limit: Cluster-wide parameter, admins won’t change it
ConfigMap sharding: Retrofit would require API changes, migrations, retraining

Actual workaround: Manual capacity planning. Monitor size, split across namespaces when close.

Automated route deployment, manual storage capacity planning.

Problem #4: Status Subresource Illusion

Status fields: state, listenPath, targetConfigMap, conditions

Operator sets state: active when ConfigMap update succeeds. Looks like useful observability.

The lie: Status = “ConfigMap updated”, NOT “route serving traffic”

Scenario: TykRoute shows active, but:

Rollout restart failed
Pods crashlooping
Tyk rejected the config

Deployment pipeline sees active, marks success. Monitoring sees active, no alert. Traffic gets 404s.

The gap: “ConfigMap updated” ≠ “route available”

Real fix would need:

Tyk admin API integration
Rollout status watching
Health checks for routes
API credentials, retry logic, timeout handling

Operator becomes full Tyk runtime integration. More complexity, more maintenance.

Chose: Technical correctness (reports what it knows) over operational usefulness (reports what users need).

What Tyk Enterprise Likely Does Better

Storage: Probably database writes or smart sharding, not single ConfigMap

Verification: Tyk API integration, actual health checks, status reflects real availability

Collaboration: Architecture that avoids merge conflicts or workflows that route around it

Observability: Metrics, drift detection, proper alerting

Support: Documentation, runbooks, engineers who respond to tickets

Weekend project vs. production-grade tooling. Quality gap is expected.

The Open Source Bait and Switch

Pattern: Core product open source. Operational tooling enterprise-only.

The trap:

Choose open source to save money
Discover production requires paywalled tooling
Already committed to the tech
Build it yourself

Reality check:

PoC in a weekend ✓
Production-grade with edge cases = hard
Maintenance over time = harder

Cost calculation: Weekend build + dozens of hours debugging + explaining limitations + fielding questions. At market rate for senior engineers, exceeds Enterprise license multiple times over.

Why we keep building: Engineering time doesn’t show as incremental cost on budgets. CFO sees infrastructure cost down, doesn’t see velocity down.

By the time velocity impact is obvious, you’re maintaining critical infrastructure. Migration activation energy is high. You keep maintaining.

Tyk’s trap is well-designed: Gateway is solid and production-ready. Gaps only appear with K8s-native workflows. By then, traffic is flowing, teams depend on it, migration is massive.

Saved license cost, inherited development + maintenance cost. Total cost of ownership might be higher, but it’s distributed across engineering time, not a budget line item.

Lessons Learned

Technical:

Don’t use single ConfigMap for unbounded data
Don’t claim state is active when you’ve only verified config write
Simple coordination logic doesn’t scale

Operational:

Prototype ≠ production infrastructure
Technical problem ≠ organizational problem
Works in staging ≠ works with multiple teams

Strategic: Built to avoid Enterprise license. Eight months later, engineering time spent > license cost.

Time could’ve gone to features, reliability improvements, reducing debt. Instead: maintaining operator with annoying limitations.

“Should We Build This?” vs “Can We Build This?”

Ask: Is the operational burden worth the cost savings?

Yes, build when:

Unique requirements commercial tools don’t meet
Spare engineering capacity exists
Problem truly unique to your environment

No, don’t build when:

Commercial tool exists because problem is harder than it looks
License cost reflects production-grade engineering effort
Building moves cost from vendor to you

For tyk-cm-routes-controller: Routes work, workflow is declarative. But merge conflicts, size limits, false status confidence, manual verification. Design limitations requiring weeks to fix vs. paying for Enterprise.

The Real Cost of Free

We calculate: License fees, instance hours, bandwidth (appear on budgets)

We don’t calculate: Engineering time maintaining tools, velocity lost to complexity, opportunity cost (spread across productivity drags, not one line item)

Comparison spreadsheet: Enterprise $30k/year vs operator $0/year

Reality: Build + maintain + debug + support + explain limitations + velocity lost to workarounds + incident response. At market rate > Enterprise license.

Open source benefits are real: Control, transparency, modifiability

But sometimes free is expensive: Commercial price reflects solving problems you haven’t hit yet.

The hard part: Knowing which situation you’re in before you’re committed.

Built to save money. Eight months later, not sure we did.