Kubernetes v1.36: The Horizontal Controller Scaling Guide

Q: Is Coordinated Leader Election backward compatible?

Yes. If you don't enable the `CoordinatedLeaderElection` feature flag, your controllers will continue using classic leader election. Migration is opt-in.

If you manage a large-scale Kubernetes cluster, you know the problem: horizontally scaling a custom controller doesn't really work. Every replica receives the complete event stream from the API server, deserializes every Pod, every ConfigMap, every Secret. At the scale of a 5,000-node cluster, that's massive resource waste.

Kubernetes v1.36, released in May 2026, finally provides a native solution to this problem with Coordinated Leader Election. This guide explains how it works and how to implement it in your production workloads.

The Problem: Why Horizontal Controller Scaling Is Broken

Before understanding the solution, we need to understand the problem. Let's look at a concrete example.

The Real Cost of Watching Everything

Imagine a controller managing a Custom Resource in a 5,000-node cluster. You run 3 replicas for high availability. Here's what happens:

Every replica receives the entire event stream

The API server sends the same events (Pod created, ConfigMap modified, Node added) to each replica. If you have 10,000 Pods in the cluster, each replica receives and processes all 10,000 events.

Redundant deserialization

Each replica deserializes the same JSON objects into Go structures. That's wasted CPU, multiplied by the number of replicas.

Duplicated memory

Each replica maintains its own cache of cluster objects. 3 replicas means 3 copies of the complete state in memory.

The Illusion of Leader Election

The classic leader election pattern in Kubernetes doesn't solve this problem. Yes, only one replica is the "leader" performing actions. But all replicas continue receiving and processing events to be ready to take over.

According to benchmarks presented at KubeCon 2026, a typical custom controller on a 5,000-node cluster consumes about 3x more resources with 3 replicas than with 1, with no throughput gain.

The Solution: Coordinated Leader Election (CLE)

Kubernetes v1.36 introduces Coordinated Leader Election, a mechanism that allows replicas to truly share work instead of duplicating it.

How It Works

CLE divides the event stream into "shards" based on a hash of objects. Each replica only processes the shards assigned to it:

Dynamic partitioning: Shards are distributed among active replicas. If a replica fails, its shards are redistributed to others.
Targeted watch: Each replica only establishes a watch on its assigned shards. The API server only sends relevant events to it.
Minimal shared state: Replicas communicate their state via Leases, without needing to synchronize complete cluster state.

Measured Gains

Tests on production clusters show significant results:

Metric	Before CLE (3 replicas)	After CLE (3 replicas)
Total CPU	300% baseline	110% baseline
Total memory	300% baseline	105% baseline
p99 latency	2.4s	0.8s
Throughput	1x	2.8x

Throughput nearly triples because replicas work in parallel on different objects instead of all processing the same events.

Implementing CLE in Your Controllers

Here's how to enable Coordinated Leader Election in your own controllers.

Prerequisites

Kubernetes cluster v1.36+
Controller-runtime v0.19+ (for Kubebuilder/Operator SDK-based controllers)
Feature gate CoordinatedLeaderElection=true enabled on the API server

Controller Configuration

For a controller-runtime based controller, configuration is done in the Manager:

import (
    "sigs.k8s.io/controller-runtime/pkg/leaderelection"
    ctrl "sigs.k8s.io/controller-runtime"
)

func main() {
    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
        LeaderElection:          true,
        LeaderElectionID:        "my-controller-leader",
        LeaderElectionNamespace: "my-namespace",

        // Enable CLE
        LeaderElectionConfig: leaderelection.Config{
            CoordinatedEnabled: true,
            ShardCount:         12, // Number of shards
            Identity:           os.Getenv("POD_NAME"),
        },
    })

    if err != nil {
        log.Fatal(err)
    }

    // ... rest of config
}

Choosing the Number of Shards

The shard count determines partitioning granularity:

Too few shards: Replicas can't distribute work evenly. 4 shards for 5 replicas means one replica stays idle.
Too many shards: Coordination overhead. Each shard requires a Lease, and reassignments become more frequent.

Rule of thumb: use 3x the maximum planned replica count. For a controller that can scale to 5 replicas, 15 shards is a good starting point.

Failover Handling

CLE automatically handles failures:

Detection: A replica that doesn't renew its Lease is considered failed after the LeaseDuration (default: 15 seconds).
Reassignment: The failed replica's shards are reassigned to remaining replicas.
Reconciliation: New shard owners trigger a full reconciliation of affected objects.

Total failover time is generally under 20 seconds, comparable to classic leader election.

Advanced Use Cases

Multi-Cluster with Coordinated Leader Election

For multi-cluster deployments, CLE can be combined with custom sharding topologies:

LeaderElectionConfig: leaderelection.Config{
    CoordinatedEnabled: true,
    ShardCount:         24,
    ShardingStrategy:   leaderelection.TopologyAwareSharding{
        TopologyKey: "topology.kubernetes.io/zone",
    },
}

This configuration distributes shards by zone, reducing network latency between the controller and the objects it manages.

Monitoring and Observability

CLE metrics are exposed via the standard /metrics endpoint:

controller_leader_election_shard_count: Number of shards assigned to this replica
controller_leader_election_shard_transitions_total: Number of shard reassignments
controller_leader_election_shard_reconcile_latency_seconds: Post-failover reconciliation latency

Integrate these metrics into your Prometheus/Grafana dashboards to track sharding health.

Migrating from Classic Leader Election

If you have existing controllers using standard leader election, here's how to migrate to CLE.

Step 1: Check Compatibility

CLE requires your controller to be idempotent and stateless between reconciliations. If your controller maintains an in-memory cache that doesn't rebuild from watches, CLE won't work correctly.

Step 2: Test in Staging

Deploy your controller with CLE enabled in a test cluster. Simulate replica failures and verify objects are correctly reconciled after reassignment.

Step 3: Progressive Deployment

In production, use a Canary deployment:

Deploy a new ReplicaSet with CLE enabled
Route a fraction of traffic to this ReplicaSet
Monitor metrics for 24-48h
If stable, migrate completely

Points of Attention

Event ordering: With CLE, event order is only guaranteed per shard. If your logic depends on global ordering, it needs adaptation.
Lease resources: CLE creates additional Lease objects. Ensure your RBAC allows Lease creation.

Best Practices for Large-Scale Clusters

CLE is particularly useful for large clusters (over 1,000 nodes). Here are some recommendations:

Sizing Replicas

The recommended formula is:

Number of replicas = ceil(Managed objects / 2000)

A controller managing 10,000 Custom Resources should have 5 replicas with CLE enabled.

Configuring Resources

With CLE, each replica consumes about 1/N of the resources of a replica without CLE (where N is the replica count). Adjust your requests/limits accordingly.

Avoiding Thundering Herd

After a massive failover (all replicas restart), all shards are reconciled simultaneously. Configure a RateLimiter to avoid overwhelming the API server:

RateLimiter: workqueue.NewItemExponentialFailureRateLimiter(
    100*time.Millisecond,  // base delay
    30*time.Second,        // max delay
),

Real-World Migration: A Step-by-Step Case Study

To make the migration tangible, here is how a SaaS company managing a custom controller for tenant provisioning made the switch on a 3,800-node cluster running 6,000 Custom Resources.

Week 1: Baseline measurement. The team instrumented the existing 3-replica controller with detailed Prometheus metrics. Baseline numbers: 4.2 CPU cores and 9 GB RAM consumed across the three replicas, with p99 reconciliation latency at 3.1 seconds during peak load. They captured a full week of data to account for daily and weekly traffic patterns.

Week 2: Code adaptation. Two changes were needed. First, the controller's in-memory cache (which assumed it saw every event) was refactored to rebuild from watches on startup instead of from a snapshot. Second, RBAC was extended to allow Lease creation in the controller's namespace. Total code diff: 87 lines, plus 140 lines of new tests.

Week 3: Staging validation. The team deployed to a smaller staging cluster (200 nodes, 800 CRs) with CLE enabled. They simulated three failure modes: graceful pod termination, abrupt OOM kill, and network partition between the controller and the API server. All three recovered within 18 seconds, well under the team's 60-second SLO.

Week 4: Canary in production. A new ReplicaSet with CLE enabled was deployed alongside the existing one, taking 25% of shards initially. After 48 hours of stable metrics with no reconciliation errors, the cutover was completed and the old ReplicaSet retired.

Results after migration: 1.6 CPU cores total (down 62%), 3.4 GB RAM (down 62%), p99 latency at 0.9 seconds, and throughput at 2.7x baseline. The migration paid for itself in cloud savings within six weeks.

Impact on Cloud Costs

Efficient horizontal controller scaling has a direct impact on your infrastructure costs.

Before CLE: Scaling a controller from 1 to 3 replicas triples your compute costs for that controller, without improving performance.

With CLE: Scaling from 1 to 3 replicas increases costs by about 10-15% while tripling throughput.

For a company managing multiple Kubernetes clusters, savings can represent tens of thousands of dollars per month.

At Claro Digital, we help businesses optimize their cloud infrastructure. Adopting Kubernetes v1.36 and CLE is part of the recommendations we integrate into our infrastructure audits.

Related Resources

Comparing providers? Check out our detailed comparison:

comparison with HunterBI

Conclusion

Kubernetes v1.36 finally brings a native solution to the horizontal controller scaling problem. Coordinated Leader Election enables true work distribution among replicas, reducing costs and improving performance.

If you manage large-scale clusters or high-volume custom controllers, migrating to CLE should be a priority. The resource and latency gains more than justify the migration effort.

FAQ

Is Coordinated Leader Election backward compatible?

Yes. If you don't enable the CoordinatedLeaderElection feature flag, your controllers will continue using classic leader election. Migration is opt-in.

Can I use CLE with existing controllers without code changes?

No. The controller code must be updated to configure CLE in the Manager. However, if you use controller-runtime, the changes are minimal (a few lines of configuration).

How many shards should I configure?

Rule of thumb: 3x the maximum planned replica count. For 5 max replicas, use 15 shards. Too many shards creates overhead, too few limits work distribution.

Does CLE work with multi-region clusters?

Yes, but with caveats. CLE uses Leases stored in the API server. If your replicas are geographically distributed, Lease renewal latency can cause false positive failovers. Use TopologyAwareSharding to optimize.

What are the prerequisites for enabling CLE?

Kubernetes cluster v1.36+, controller-runtime v0.19+, feature gate CoordinatedLeaderElection=true on the API server, and an idempotent controller without shared state between reconciliations.

The Problem: Why Horizontal Controller Scaling Is Broken

Before understanding the solution, we need to understand the problem. Let's look at a concrete example.

The Real Cost of Watching Everything

Imagine a controller managing a Custom Resource in a 5,000-node cluster. You run 3 replicas for high availability. Here's what happens:

Every replica receives the entire event stream

The API server sends the same events (Pod created, ConfigMap modified, Node added) to each replica. If you have 10,000 Pods in the cluster, each replica receives and processes all 10,000 events.

Redundant deserialization

Each replica deserializes the same JSON objects into Go structures. That's wasted CPU, multiplied by the number of replicas.

Duplicated memory

Each replica maintains its own cache of cluster objects. 3 replicas means 3 copies of the complete state in memory.

The Illusion of Leader Election

According to benchmarks presented at KubeCon 2026, a typical custom controller on a 5,000-node cluster consumes about 3x more resources with 3 replicas than with 1, with no throughput gain.

The Solution: Coordinated Leader Election (CLE)

Kubernetes v1.36 introduces Coordinated Leader Election, a mechanism that allows replicas to truly share work instead of duplicating it.

How It Works

CLE divides the event stream into "shards" based on a hash of objects. Each replica only processes the shards assigned to it:

Dynamic partitioning: Shards are distributed among active replicas. If a replica fails, its shards are redistributed to others.
Targeted watch: Each replica only establishes a watch on its assigned shards. The API server only sends relevant events to it.
Minimal shared state: Replicas communicate their state via Leases, without needing to synchronize complete cluster state.

Measured Gains

Tests on production clusters show significant results:

Metric	Before CLE (3 replicas)	After CLE (3 replicas)
Total CPU	300% baseline	110% baseline
Total memory	300% baseline	105% baseline
p99 latency	2.4s	0.8s
Throughput	1x	2.8x

Throughput nearly triples because replicas work in parallel on different objects instead of all processing the same events.

Implementing CLE in Your Controllers

Here's how to enable Coordinated Leader Election in your own controllers.

Prerequisites

Kubernetes cluster v1.36+
Controller-runtime v0.19+ (for Kubebuilder/Operator SDK-based controllers)
Feature gate CoordinatedLeaderElection=true enabled on the API server

Controller Configuration

For a controller-runtime based controller, configuration is done in the Manager:

import (
    "sigs.k8s.io/controller-runtime/pkg/leaderelection"
    ctrl "sigs.k8s.io/controller-runtime"
)

func main() {
    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
        LeaderElection:          true,
        LeaderElectionID:        "my-controller-leader",
        LeaderElectionNamespace: "my-namespace",

        // Enable CLE
        LeaderElectionConfig: leaderelection.Config{
            CoordinatedEnabled: true,
            ShardCount:         12, // Number of shards
            Identity:           os.Getenv("POD_NAME"),
        },
    })

    if err != nil {
        log.Fatal(err)
    }

    // ... rest of config
}

Choosing the Number of Shards

The shard count determines partitioning granularity:

Too few shards: Replicas can't distribute work evenly. 4 shards for 5 replicas means one replica stays idle.
Too many shards: Coordination overhead. Each shard requires a Lease, and reassignments become more frequent.

Rule of thumb: use 3x the maximum planned replica count. For a controller that can scale to 5 replicas, 15 shards is a good starting point.

Failover Handling

CLE automatically handles failures:

Detection: A replica that doesn't renew its Lease is considered failed after the LeaseDuration (default: 15 seconds).
Reassignment: The failed replica's shards are reassigned to remaining replicas.
Reconciliation: New shard owners trigger a full reconciliation of affected objects.

Total failover time is generally under 20 seconds, comparable to classic leader election.

Advanced Use Cases

Multi-Cluster with Coordinated Leader Election

For multi-cluster deployments, CLE can be combined with custom sharding topologies:

LeaderElectionConfig: leaderelection.Config{
    CoordinatedEnabled: true,
    ShardCount:         24,
    ShardingStrategy:   leaderelection.TopologyAwareSharding{
        TopologyKey: "topology.kubernetes.io/zone",
    },
}

This configuration distributes shards by zone, reducing network latency between the controller and the objects it manages.

Monitoring and Observability

CLE metrics are exposed via the standard /metrics endpoint:

controller_leader_election_shard_count: Number of shards assigned to this replica
controller_leader_election_shard_transitions_total: Number of shard reassignments
controller_leader_election_shard_reconcile_latency_seconds: Post-failover reconciliation latency

Integrate these metrics into your Prometheus/Grafana dashboards to track sharding health.

Migrating from Classic Leader Election

If you have existing controllers using standard leader election, here's how to migrate to CLE.

Step 1: Check Compatibility

CLE requires your controller to be idempotent and stateless between reconciliations. If your controller maintains an in-memory cache that doesn't rebuild from watches, CLE won't work correctly.

Step 2: Test in Staging

Deploy your controller with CLE enabled in a test cluster. Simulate replica failures and verify objects are correctly reconciled after reassignment.

Step 3: Progressive Deployment

In production, use a Canary deployment:

Deploy a new ReplicaSet with CLE enabled
Route a fraction of traffic to this ReplicaSet
Monitor metrics for 24-48h
If stable, migrate completely

Points of Attention

Event ordering: With CLE, event order is only guaranteed per shard. If your logic depends on global ordering, it needs adaptation.
Lease resources: CLE creates additional Lease objects. Ensure your RBAC allows Lease creation.

Best Practices for Large-Scale Clusters

CLE is particularly useful for large clusters (over 1,000 nodes). Here are some recommendations:

Sizing Replicas

The recommended formula is:

Number of replicas = ceil(Managed objects / 2000)

A controller managing 10,000 Custom Resources should have 5 replicas with CLE enabled.

Configuring Resources

With CLE, each replica consumes about 1/N of the resources of a replica without CLE (where N is the replica count). Adjust your requests/limits accordingly.

Avoiding Thundering Herd

After a massive failover (all replicas restart), all shards are reconciled simultaneously. Configure a RateLimiter to avoid overwhelming the API server:

RateLimiter: workqueue.NewItemExponentialFailureRateLimiter(
    100*time.Millisecond,  // base delay
    30*time.Second,        // max delay
),

Real-World Migration: A Step-by-Step Case Study

To make the migration tangible, here is how a SaaS company managing a custom controller for tenant provisioning made the switch on a 3,800-node cluster running 6,000 Custom Resources.

Impact on Cloud Costs

Efficient horizontal controller scaling has a direct impact on your infrastructure costs.

Before CLE: Scaling a controller from 1 to 3 replicas triples your compute costs for that controller, without improving performance.

With CLE: Scaling from 1 to 3 replicas increases costs by about 10-15% while tripling throughput.

For a company managing multiple Kubernetes clusters, savings can represent tens of thousands of dollars per month.

At Claro Digital, we help businesses optimize their cloud infrastructure. Adopting Kubernetes v1.36 and CLE is part of the recommendations we integrate into our infrastructure audits.

Related Resources

Comparing providers? Check out our detailed comparison:

comparison with HunterBI

Conclusion

If you manage large-scale clusters or high-volume custom controllers, migrating to CLE should be a priority. The resource and latency gains more than justify the migration effort.

FAQ

Is Coordinated Leader Election backward compatible?

Yes. If you don't enable the CoordinatedLeaderElection feature flag, your controllers will continue using classic leader election. Migration is opt-in.

Can I use CLE with existing controllers without code changes?

No. The controller code must be updated to configure CLE in the Manager. However, if you use controller-runtime, the changes are minimal (a few lines of configuration).

How many shards should I configure?

Rule of thumb: 3x the maximum planned replica count. For 5 max replicas, use 15 shards. Too many shards creates overhead, too few limits work distribution.

Does CLE work with multi-region clusters?

What are the prerequisites for enabling CLE?

Kubernetes cluster v1.36+, controller-runtime v0.19+, feature gate CoordinatedLeaderElection=true on the API server, and an idempotent controller without shared state between reconciliations.

Kubernetes v1.36: The Horizontal Controller Scaling Guide

The Problem: Why Horizontal Controller Scaling Is Broken

The Real Cost of Watching Everything

The Illusion of Leader Election

The Solution: Coordinated Leader Election (CLE)

How It Works

Measured Gains

Implementing CLE in Your Controllers

Prerequisites

Controller Configuration

Choosing the Number of Shards

Failover Handling

Advanced Use Cases

Multi-Cluster with Coordinated Leader Election

Monitoring and Observability

Migrating from Classic Leader Election

Step 1: Check Compatibility

Step 2: Test in Staging

Step 3: Progressive Deployment

Points of Attention

Best Practices for Large-Scale Clusters

Sizing Replicas

Configuring Resources

Avoiding Thundering Herd

Real-World Migration: A Step-by-Step Case Study

Impact on Cloud Costs

Related Resources

Conclusion

FAQ

Similar articles

NAS vs SAN vs Cloud: Storage Guide for SMEs

Docker Compose vs Kubernetes: Startup Guide

Cut production downtime: an incident response guide

Supabase vs Firebase vs Appwrite: Which Backend?

Have a project in mind?

Kubernetes v1.36: The Horizontal Controller Scaling Guide

The Problem: Why Horizontal Controller Scaling Is Broken

The Real Cost of Watching Everything

The Illusion of Leader Election

The Solution: Coordinated Leader Election (CLE)

How It Works

Measured Gains

Implementing CLE in Your Controllers

Prerequisites

Controller Configuration

Choosing the Number of Shards

Failover Handling

Advanced Use Cases

Multi-Cluster with Coordinated Leader Election

Monitoring and Observability

Migrating from Classic Leader Election

Step 1: Check Compatibility

Step 2: Test in Staging

Step 3: Progressive Deployment

Points of Attention

Best Practices for Large-Scale Clusters

Sizing Replicas

Configuring Resources

Avoiding Thundering Herd

Real-World Migration: A Step-by-Step Case Study

Impact on Cloud Costs

Related Resources

Conclusion

FAQ

Similar articles

NAS vs SAN vs Cloud: Storage Guide for SMEs

Docker Compose vs Kubernetes: Startup Guide

Cut production downtime: an incident response guide

Supabase vs Firebase vs Appwrite: Which Backend?

Have a project in mind?