Summary
During replica scale-down (replicasCount: 2 → 1), the operator executes SYSTEM DROP REPLICA before the deleted replica's Keeper session has expired. The command fails with Code: 305, Can't drop replica ... because it's active, and the operator does not retry. This leaves stale replica metadata (znodes) in ClickHouse Keeper permanently.
Environment
- Operator version: 0.26.0
- ClickHouse version: 24.8 (Altinity Stable)
- Database engine:
Atomic with ReplicatedMergeTree tables (38 tables)
- Keeper session_timeout_ms: 120000 (CHI spec)
- Cluster layout: 2 shards × 2 replicas
Steps to Reproduce
- Deploy a CHI with
replicasCount: 2
- Scale down to
replicasCount: 1
- Observe operator logs
Observed Behavior
Kubernetes resource deletion (Pod, StatefulSet, Service) completes normally. Then:
07:03:43Z Drop replica: chi-dev-botmanager-bmch-0-1 at 0-0
07:03:43Z FAILED to drop replica on host 0-1, err: Code: 305, Can't drop replica ..., because it's active
07:03:44Z Drop replica: chi-dev-botmanager-bmch-1-1 at 1-0
07:03:44Z FAILED to drop replica on host 1-1, err: Code: 305, Can't drop replica ..., because it's active
07:03:44Z FAILED single try. No retries will be made
07:03:44Z processed replicas: 2
Timeline: Pods deleted at 07:02:08Z–07:02:18Z, DROP REPLICA attempted at 07:03:43Z — approximately 86–95 seconds after pod termination, which is less than the configured session_timeout_ms of 120000.
Result: After reconcile completes, the deleted replicas become inactive in system.replicas, but their znodes remain in Keeper:
SELECT path, groupArray(name) FROM system.zookeeper
WHERE path LIKE '%/replicas'
AND path LIKE '%botmanager%'
GROUP BY path;
-- Both deleted replica names (bmch-0-1, bmch-1-1) still present as children
-- 38 stale paths per deleted replica (one per ReplicatedMergeTree table)
system.replicas shows total_replicas = 2 instead of expected 1.
Root Cause
In pkg/controller/chi/worker-deleter.go, dropZKReplicas() calls dropZKReplica() for each removed host. The underlying HostDropReplica() in pkg/model/chi/schemer/schemer.go uses SetRetry(false):
// Single attempt, no retry on failure
opts.SetRetry(false)
There is no wait for the replica to become inactive before executing SYSTEM DROP REPLICA. The regular scale-down path in dropZKReplicas iterates removed hosts and immediately attempts the drop.
reconcile.host.wait.replicas.delay does not apply to this code path — it controls replication lag wait, not post-deletion session cleanup.
Expected Behavior
The operator should either:
- Wait for
replica_is_active = 0 (poll system.replicas) before executing SYSTEM DROP REPLICA, or
- Retry on
Code: 305 (because it's active) with backoff, at least for session_timeout_ms + buffer duration
Workaround
Reducing session_timeout_ms from 120000 to 60000 in the CHI spec narrows the race window enough that the operator's drop timing falls after session expiry.
We verified this with live experiments on a dev cluster:
| session_timeout_ms |
DROP REPLICA result |
| 120000 |
❌ because it's active (reproduced twice) |
| 60000 |
✅ Success (verified twice consecutively) |
For cases where the race is still hit, manual cleanup is required:
-- Run on surviving replica of each shard
SYSTEM DROP REPLICA 'chi-dev-botmanager-bmch-0-1';
SYSTEM DROP REPLICA 'chi-dev-botmanager-bmch-1-1';
Related
Summary
During replica scale-down (
replicasCount: 2 → 1), the operator executesSYSTEM DROP REPLICAbefore the deleted replica's Keeper session has expired. The command fails withCode: 305, Can't drop replica ... because it's active, and the operator does not retry. This leaves stale replica metadata (znodes) in ClickHouse Keeper permanently.Environment
AtomicwithReplicatedMergeTreetables (38 tables)Steps to Reproduce
replicasCount: 2replicasCount: 1Observed Behavior
Kubernetes resource deletion (Pod, StatefulSet, Service) completes normally. Then:
Timeline: Pods deleted at
07:02:08Z–07:02:18Z,DROP REPLICAattempted at07:03:43Z— approximately 86–95 seconds after pod termination, which is less than the configuredsession_timeout_msof 120000.Result: After reconcile completes, the deleted replicas become
inactiveinsystem.replicas, but their znodes remain in Keeper:system.replicasshowstotal_replicas = 2instead of expected1.Root Cause
In
pkg/controller/chi/worker-deleter.go,dropZKReplicas()callsdropZKReplica()for each removed host. The underlyingHostDropReplica()inpkg/model/chi/schemer/schemer.gousesSetRetry(false):There is no wait for the replica to become inactive before executing
SYSTEM DROP REPLICA. The regular scale-down path indropZKReplicasiterates removed hosts and immediately attempts the drop.reconcile.host.wait.replicas.delaydoes not apply to this code path — it controls replication lag wait, not post-deletion session cleanup.Expected Behavior
The operator should either:
replica_is_active = 0(pollsystem.replicas) before executingSYSTEM DROP REPLICA, orCode: 305 (because it's active)with backoff, at least forsession_timeout_ms + bufferdurationWorkaround
Reducing
session_timeout_msfrom120000to60000in the CHI spec narrows the race window enough that the operator's drop timing falls after session expiry.We verified this with live experiments on a dev cluster:
because it's active(reproduced twice)For cases where the race is still hit, manual cleanup is required:
Related
shardFuncis no-op, whereas replica removal does invokehostFuncbut fails due to this timing issue)