If you manage a large-scale Kubernetes cluster, you know the problem: horizontally scaling a custom controller doesn't really work. Every replica receives the complete event stream from the API server, deserializes every Pod, every ConfigMap, every Secret. At the scale of a 5,000-node cluster, that's massive resource waste.
Kubernetes v1.36, released in May 2026, finally provides a native solution to this problem with Coordinated Leader Election. This guide explains how it works and how to implement it in your production workloads.
The Problem: Why Horizontal Controller Scaling Is Broken
Before understanding the solution, we need to understand the problem. Let's look at a concrete example.
The Real Cost of Watching Everything
Imagine a controller managing a Custom Resource in a 5,000-node cluster. You run 3 replicas for high availability. Here's what happens:
Every replica receives the entire event stream
The API server sends the same events (Pod created, ConfigMap modified, Node added) to each replica. If you have 10,000 Pods in the cluster, each replica receives and processes all 10,000 events.
Redundant deserialization
Each replica deserializes the same JSON objects into Go structures. That's wasted CPU, multiplied by the number of replicas.
Duplicated memory
Each replica maintains its own cache of cluster objects. 3 replicas means 3 copies of the complete state in memory.
The Illusion of Leader Election
The classic leader election pattern in Kubernetes doesn't solve this problem. Yes, only one replica is the "leader" performing actions. But all replicas continue receiving and processing events to be ready to take over.
According to benchmarks presented at KubeCon 2026, a typical custom controller on a 5,000-node cluster consumes about 3x more resources with 3 replicas than with 1, with no throughput gain.
The Solution: Coordinated Leader Election (CLE)
Kubernetes v1.36 introduces Coordinated Leader Election, a mechanism that allows replicas to truly share work instead of duplicating it.
How It Works
CLE divides the event stream into "shards" based on a hash of objects. Each replica only processes the shards assigned to it:
-
Dynamic partitioning: Shards are distributed among active replicas. If a replica fails, its shards are redistributed to others.
-
Targeted watch: Each replica only establishes a watch on its assigned shards. The API server only sends relevant events to it.
-
Minimal shared state: Replicas communicate their state via Leases, without needing to synchronize complete cluster state.
Measured Gains
Tests on production clusters show significant results:
| Metric | Before CLE (3 replicas) | After CLE (3 replicas) | |--------|-------------------------|------------------------| | Total CPU | 300% baseline | 110% baseline | | Total memory | 300% baseline | 105% baseline | | p99 latency | 2.4s | 0.8s | | Throughput | 1x | 2.8x |
Throughput nearly triples because replicas work in parallel on different objects instead of all processing the same events.
Implementing CLE in Your Controllers
Here's how to enable Coordinated Leader Election in your own controllers.
Prerequisites
- Kubernetes cluster v1.36+
- Controller-runtime v0.19+ (for Kubebuilder/Operator SDK-based controllers)
- Feature gate
CoordinatedLeaderElection=trueenabled on the API server
Controller Configuration
For a controller-runtime based controller, configuration is done in the Manager:
import (
"sigs.k8s.io/controller-runtime/pkg/leaderelection"
ctrl "sigs.k8s.io/controller-runtime"
)
func main() {
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
LeaderElection: true,
LeaderElectionID: "my-controller-leader",
LeaderElectionNamespace: "my-namespace",
// Enable CLE
LeaderElectionConfig: leaderelection.Config{
CoordinatedEnabled: true,
ShardCount: 12, // Number of shards
Identity: os.Getenv("POD_NAME"),
},
})
if err != nil {
log.Fatal(err)
}
// ... rest of config
}
Choosing the Number of Shards
The shard count determines partitioning granularity:
- Too few shards: Replicas can't distribute work evenly. 4 shards for 5 replicas means one replica stays idle.
- Too many shards: Coordination overhead. Each shard requires a Lease, and reassignments become more frequent.
Rule of thumb: use 3x the maximum planned replica count. For a controller that can scale to 5 replicas, 15 shards is a good starting point.
Failover Handling
CLE automatically handles failures:
-
Detection: A replica that doesn't renew its Lease is considered failed after the
LeaseDuration(default: 15 seconds). -
Reassignment: The failed replica's shards are reassigned to remaining replicas.
-
Reconciliation: New shard owners trigger a full reconciliation of affected objects.
Total failover time is generally under 20 seconds, comparable to classic leader election.
Advanced Use Cases
Multi-Cluster with Coordinated Leader Election
For multi-cluster deployments, CLE can be combined with custom sharding topologies:
LeaderElectionConfig: leaderelection.Config{
CoordinatedEnabled: true,
ShardCount: 24,
ShardingStrategy: leaderelection.TopologyAwareSharding{
TopologyKey: "topology.kubernetes.io/zone",
},
}
This configuration distributes shards by zone, reducing network latency between the controller and the objects it manages.
Monitoring and Observability
CLE metrics are exposed via the standard /metrics endpoint:
controller_leader_election_shard_count: Number of shards assigned to this replicacontroller_leader_election_shard_transitions_total: Number of shard reassignmentscontroller_leader_election_shard_reconcile_latency_seconds: Post-failover reconciliation latency
Integrate these metrics into your Prometheus/Grafana dashboards to track sharding health.
Migrating from Classic Leader Election
If you have existing controllers using standard leader election, here's how to migrate to CLE.
Step 1: Check Compatibility
CLE requires your controller to be idempotent and stateless between reconciliations. If your controller maintains an in-memory cache that doesn't rebuild from watches, CLE won't work correctly.
Step 2: Test in Staging
Deploy your controller with CLE enabled in a test cluster. Simulate replica failures and verify objects are correctly reconciled after reassignment.
Step 3: Progressive Deployment
In production, use a Canary deployment:
- Deploy a new ReplicaSet with CLE enabled
- Route a fraction of traffic to this ReplicaSet
- Monitor metrics for 24-48h
- If stable, migrate completely
Points of Attention
- Event ordering: With CLE, event order is only guaranteed per shard. If your logic depends on global ordering, it needs adaptation.
- Lease resources: CLE creates additional Lease objects. Ensure your RBAC allows Lease creation.
Best Practices for Large-Scale Clusters
CLE is particularly useful for large clusters (over 1,000 nodes). Here are some recommendations:
Sizing Replicas
The recommended formula is:
Number of replicas = ceil(Managed objects / 2000)
A controller managing 10,000 Custom Resources should have 5 replicas with CLE enabled.
Configuring Resources
With CLE, each replica consumes about 1/N of the resources of a replica without CLE (where N is the replica count). Adjust your requests/limits accordingly.
Avoiding Thundering Herd
After a massive failover (all replicas restart), all shards are reconciled simultaneously. Configure a RateLimiter to avoid overwhelming the API server:
RateLimiter: workqueue.NewItemExponentialFailureRateLimiter(
100*time.Millisecond, // base delay
30*time.Second, // max delay
),
Real-World Migration: A Step-by-Step Case Study
To make the migration tangible, here is how a SaaS company managing a custom controller for tenant provisioning made the switch on a 3,800-node cluster running 6,000 Custom Resources.
Week 1: Baseline measurement. The team instrumented the existing 3-replica controller with detailed Prometheus metrics. Baseline numbers: 4.2 CPU cores and 9 GB RAM consumed across the three replicas, with p99 reconciliation latency at 3.1 seconds during peak load. They captured a full week of data to account for daily and weekly traffic patterns.
Week 2: Code adaptation. Two changes were needed. First, the controller's in-memory cache (which assumed it saw every event) was refactored to rebuild from watches on startup instead of from a snapshot. Second, RBAC was extended to allow Lease creation in the controller's namespace. Total code diff: 87 lines, plus 140 lines of new tests.
Week 3: Staging validation. The team deployed to a smaller staging cluster (200 nodes, 800 CRs) with CLE enabled. They simulated three failure modes: graceful pod termination, abrupt OOM kill, and network partition between the controller and the API server. All three recovered within 18 seconds, well under the team's 60-second SLO.
Week 4: Canary in production. A new ReplicaSet with CLE enabled was deployed alongside the existing one, taking 25% of shards initially. After 48 hours of stable metrics with no reconciliation errors, the cutover was completed and the old ReplicaSet retired.
Results after migration: 1.6 CPU cores total (down 62%), 3.4 GB RAM (down 62%), p99 latency at 0.9 seconds, and throughput at 2.7x baseline. The migration paid for itself in cloud savings within six weeks.
Impact on Cloud Costs
Efficient horizontal controller scaling has a direct impact on your infrastructure costs.
Before CLE: Scaling a controller from 1 to 3 replicas triples your compute costs for that controller, without improving performance.
With CLE: Scaling from 1 to 3 replicas increases costs by about 10-15% while tripling throughput.
For a company managing multiple Kubernetes clusters, savings can represent tens of thousands of dollars per month.
At Claro Digital, we help businesses optimize their cloud infrastructure. Adopting Kubernetes v1.36 and CLE is part of the recommendations we integrate into our infrastructure audits.
Conclusion
Kubernetes v1.36 finally brings a native solution to the horizontal controller scaling problem. Coordinated Leader Election enables true work distribution among replicas, reducing costs and improving performance.
If you manage large-scale clusters or high-volume custom controllers, migrating to CLE should be a priority. The resource and latency gains more than justify the migration effort.
FAQ
Is Coordinated Leader Election backward compatible?
Yes. If you don't enable the CoordinatedLeaderElection feature flag, your controllers will continue using classic leader election. Migration is opt-in.
Can I use CLE with existing controllers without code changes?
No. The controller code must be updated to configure CLE in the Manager. However, if you use controller-runtime, the changes are minimal (a few lines of configuration).
How many shards should I configure?
Rule of thumb: 3x the maximum planned replica count. For 5 max replicas, use 15 shards. Too many shards creates overhead, too few limits work distribution.
Does CLE work with multi-region clusters?
Yes, but with caveats. CLE uses Leases stored in the API server. If your replicas are geographically distributed, Lease renewal latency can cause false positive failovers. Use TopologyAwareSharding to optimize.
What are the prerequisites for enabling CLE?
Kubernetes cluster v1.36+, controller-runtime v0.19+, feature gate CoordinatedLeaderElection=true on the API server, and an idempotent controller without shared state between reconciliations.
