Added proposal for auto-rebalance on imbalanced cluster feature in operator by ShubhamRwt · Pull Request #211 · strimzi/proposals

ShubhamRwt · 2026-03-30T07:48:19Z

This PR aims to introduce the self-healing feature in Strimzi. This proposal contains all the comments and suggestion left on the old proposal . This proposal aim to utilize the auto-rebalancing feature of Strimzi to introduce the self healing.

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

scholzj · 2026-04-01T12:38:27Z

@ShubhamRwt So, is this a draft? Or is it ready for review? It is not completely clear as the PR is not a Draft but you did not requested the review from all maintainers but only from two specific people.

ShubhamRwt · 2026-04-01T12:53:29Z

@scholzj It is ready for review, Sorry I didn't tagged everyone

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

scholzj

It is not clear what the lifecycle of the KafkaRebalance resources is
What is the impact of maintenance time windows?
What is the testing strategy?

ppatierno · 2026-04-07T13:09:32Z

+
+#### What happens if some rebalance fails:
+
+With the new `imbalance` mode, we will be introducing two new states to the FSM called `RebalanceOnImbalance` and `RebalanceOnImbalanceNotComplete`


Why do we need an additional state for failed rebalance? IIRC there is no such a specific state for failures with auto-rebalancing on scale up/down. Or am I wrong?

I think it was me overthinking that in case some failure happens and it might need a human intervention and we might not need the auto rebalance to happen again and again but I think this is not really required. I will remove it

So what state will it move to if there is an error? Idle?

Will the generate KR CR move to NotReady (or whatever the fail state for that is)? How will a human operator know this happened?

The generated KR Cr would move to NotReady, yes but I think the user can check the logs but I dont think there is any other thing

Is the answer clear based on my previous reply?

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

egyedt · 2026-04-16T12:12:58Z

+To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will be the responsibility of the Strimzi Cluster Operator.
+To enable this, we will use an approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
+We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance.
+We will not enable the other anomaly detection classes, related to goal violations, that would require manual interventions at the infrastructure level such as disk or broker failures.


I think it would be nice, if we make events or WARN logs in Strimzi if there are other type of anomalies around Kafka cluster, which cannot be automatically fixed. WDYT?

AFAIK disk and broker failures would surface through the Strimzi Kafka CR status and normal metrics monitoring channels so I am not sure having CC also report that is particularly useful?

If it is reported in "other channels" then it is fine to me!

I guess that depends on what disk and broker failures really are. I do not think disk failures are necessarily something the operator would notice? Broker failures - if that means the broker is unresponsive - then it would actually fail the reconciliation I think. But if the broker accepts connections but doesn't for example replicate any data, we would not notice it.

I think all those situations are better surfaced through monitoring the appropriate metrics, that is all CC is doing anyway.

egyedt · 2026-04-16T12:17:17Z

+* `remove-brokers` - auto-rebalancing on scale down
+
+To leverage the automated rebalance on imbalanced clusters (those with detected goal violations), we will be introducing a new mode to the auto-rebalancing feature.
+The new mode will be called `imbalance`, which means that cluster imbalance was detected and rebalancing should be applied to all the brokers.


nit: I think we should call it rebalance or rebalance-broker instead of imbalance
I prefer rebalance or rebalance-broker, since it will match with CC terminology and also add-brokers, remove-brokers have a verb, while "imbalance" is just an adjective. So to match the current convention better.

I disagree, to me, the mode is the reason for the auto-rebalance being triggered.

In which case we might want to make it past-tense: "imbalanced"

I disagree, to me, the mode is the reason for the auto-rebalance being triggered.

I also disagree with you, if it is the reason, then why we call it 'mode'? It should be called 'reason' then!
In this current form, for me, the rebalance-brokers would match the convention created by add-brokers and remove-brokers...

But all the modes are rebalancing the brokers?

The "mode" terminology is coming from the KafkaRebalance spec where you specify which mode you want to use for rebalancing: full, add-brokers, remove-brokers or remove-disks. All these modes maps to the corresponding CC endpoints. Maybe we had a discussion about a good naming long time ago about what to use in the auto-rebalancing and we ending with the same because the mapping was one to one. I can see @egyedt point where, within the auto-rebalancing, we are actually specifying "when I should trigger an auto-rebalancing?" on adding brokers, on removing brokers or ... whenever there is an anomaly. What the current proposal is going to do underneath in terms of KafkaRebalance is using the "full" mode which is using the /rebalance endpoint (full because using ALL brokers within the cluster). Maybe we can stick with "full" instead of "imbalance" for now? Or we should change the field name but it needs deprecating this and adding a new one.

I still see this is an ongoing discussion - Do we plan to use imbalance or full mode?

My take always was that it is the action we are going to take:

Add brokers

Remove brokers

Not the cause or event. For me, the imbalance is maybe not perfect (I think I originally preferred something else), but I think it is understandable for all users what it means and describes well the intent. full on the other hand is meaningless for me and makes it hard to imagine when it will be done.

While for the other two modes the type is the same as the mode in KafkaRebalance, for them it works fine to describe everything. full is fine as a mode in KafkaRebalance. It says what will happen. But in the Kafka CR autorebalance section, the keyword should focus on the intent and when it will happen. full does not explain that. imbalance does.

Ok I am fine with sticking with imbalance.

egyedt · 2026-04-16T12:27:54Z

+This could cause potential conflicts with other administration operations and is the primary reason self-healing has been disabled until now.
+To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will be the responsibility of the Strimzi Cluster Operator.
+To enable this, we will use an approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
+We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance.


Maybe for some anomalies, the call of the fix-offline-replicas endpoint can be useful!
Beside the usage of rebalance endpoint.

How would we decide which anomalies to call the different endpoint for?

There should be some logic about this in Strimzi if we decide to use fix-offline-replicas endpoint too. Otherwise we can use only the rebalance endpoint, it is totally fine to me. I just mentioned here as an interesting idea.

I didn't know about this /fix_offline_replicas endpoint at all. If there is a specific anomaly raised about it, maybe it could make sense to use it but it's not trivial. For how the rebalance operator works today, Each CC endpoint call is mappend on a corresponding "mode" within the KafkaRebalance, so hitting two endpoints would mean creating two KafkaRebalance: one using the /rebalance endpoint and another one using the /fix_offline_replicas (which is anyway not supported now).

tomncooper

I have had a pass. I left a few comments.

As other reviewers have identified, the main point is checking if a rebalance (manual or automatic) is ongoing before you do anything else.

For auto rebalances if seems you can check the state machine, for manual you could check CC directly or the status of the KR CR. If a manual KR CR already exists then something is going to check it's status with CC, depending on the ordering of those operations it may happen before you need to do your checks, in which case you could use that. If it is after or ordering can't be guaranteed then you might be best querying CC directly.

tomncooper · 2026-04-16T15:57:02Z

+This could cause potential conflicts with other administration operations and is the primary reason self-healing has been disabled until now.
+To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will be the responsibility of the Strimzi Cluster Operator.
+To enable this, we will use an approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
+We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance.


How would we decide which anomalies to call the different endpoint for?

tomncooper · 2026-04-16T15:58:32Z

+To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will be the responsibility of the Strimzi Cluster Operator.
+To enable this, we will use an approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
+We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance.
+We will not enable the other anomaly detection classes, related to goal violations, that would require manual interventions at the infrastructure level such as disk or broker failures.


AFAIK disk and broker failures would surface through the Strimzi Kafka CR status and normal metrics monitoring channels so I am not sure having CC also report that is particularly useful?

tomncooper · 2026-04-16T16:01:03Z

+* `remove-brokers` - auto-rebalancing on scale down
+
+To leverage the automated rebalance on imbalanced clusters (those with detected goal violations), we will be introducing a new mode to the auto-rebalancing feature.
+The new mode will be called `imbalance`, which means that cluster imbalance was detected and rebalancing should be applied to all the brokers.


I disagree, to me, the mode is the reason for the auto-rebalance being triggered.

tomncooper · 2026-04-16T16:33:04Z

+
+#### What happens if some rebalance fails:
+
+With the new `imbalance` mode, we will be introducing two new states to the FSM called `RebalanceOnImbalance` and `RebalanceOnImbalanceNotComplete`


So what state will it move to if there is an error? Idle?

tomncooper · 2026-04-16T16:33:55Z

+
+#### What happens if some rebalance fails:
+
+With the new `imbalance` mode, we will be introducing two new states to the FSM called `RebalanceOnImbalance` and `RebalanceOnImbalanceNotComplete`


Will the generate KR CR move to NotReady (or whatever the fail state for that is)? How will a human operator know this happened?

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

egyedt · 2026-05-06T13:50:02Z

Is there an estimate when this PR will be ready?
Is there a lot of open questions?
If help is needed with the questions then feel free to answer me in this thread and I will try to help!

ShubhamRwt · 2026-05-06T13:56:34Z

@egyedt Hi, I pushed the suggestions yesterday. I think this proposal is ready for next set of reviews. I hope the new suggestions fixes the open questions

ShubhamRwt · 2026-05-06T13:57:17Z

@scholzj @ppatierno @tomncooper can you guys have another pass at this, please. Thankyou

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

scholzj

I sitll don't see any considerations ont he maintenance time window which I think I raised before. It would be good to have it included in the proposal (whether it will follow the window, or if not, why not)

scholzj · 2026-05-19T07:32:43Z

+## Motivation
+
+Currently, if the cluster is imbalanced, the user would need to manually rebalance the cluster by using the `KafkaRebalance` custom resource.
+Every cluster requires manual intervention to trigger rebalancing when imbalances are detected, regardless of its size.


This sentence sounds strange. I would maybe remove it. There are many people who live without rebalancing all their life. Especially if you have small cluster with 3 brokers and use replication factor 3, there is actually not much to rebalance.

I guess that Shubham meant to say something like "if your cluster is unbalanced and you want to do some rebalancing, the only way you have today is the manual one, there is no automation which is the goal for this proposal". I don't think he meant to say that all clusters of any size needs rebalancing. Maybe he can clarify the meaning.

Yes, I assume so, but we should clarify it.

I meant that with respect to Strimzi, yes, If we want to do any rebalance due to imbalance then we have to do it manually and there is now way for auto rebalance on imbalance mechanism.

scholzj · 2026-05-19T17:09:23Z

+* `remove-brokers` - auto-rebalancing on scale down
+
+To leverage the automated rebalance on imbalanced clusters (those with detected goal violations), we will be introducing a new mode to the auto-rebalancing feature.
+The new mode will be called `imbalance`, which means that cluster imbalance was detected and rebalancing should be applied to all the brokers.


My take always was that it is the action we are going to take:

Add brokers

Remove brokers

Not the cause or event. For me, the imbalance is maybe not perfect (I think I originally preferred something else), but I think it is understandable for all users what it means and describes well the intent. full on the other hand is meaningless for me and makes it hard to imagine when it will be done.

While for the other two modes the type is the same as the mode in KafkaRebalance, for them it works fine to describe everything. full is fine as a mode in KafkaRebalance. It says what will happen. But in the Kafka CR autorebalance section, the keyword should focus on the intent and when it will happen. full does not explain that. imbalance does.

scholzj · 2026-05-21T12:23:43Z

+4. If the key is not present or empty, the operator uses the default anomaly detection goals defined in `CruiseControlConfiguration.CRUISE_CONTROL_DEFAULT_ANOMALY_DETECTION_GOALS_LIST`:
+   - `RACK_AWARENESS_GOAL`
+   - `MIN_TOPIC_LEADERS_PER_BROKER_GOAL`
+   - `REPLICA_CAPACITY_GOAL`
+   - `DISK_CAPACITY_GOAL`


So, are these goals what is in CRUISE_CONTROL_DEFAULT_ANOMALY_DETECTION_GOALS_LIST? It is a bit confusing what the list of goals means here.

Yes, Shubham is listing what is in the CRUISE_CONTROL_DEFAULT_ANOMALY_DETECTION_GOALS_LIST. Would you suggest to remove the list and referencing to the constant definition only (within the code) or something else? Just trying to understand what "confusing" means here.

I would probably just reference the list. Or one can do something like to make it more clear...

4. If the key is not present or empty, the operator uses the default anomaly detection goals defined in `CruiseControlConfiguration.CRUISE_CONTROL_DEFAULT_ANOMALY_DETECTION_GOALS_LIST`. Currently, the list contains: - ...

I will try to make it simpler if it sounds confusing

scholzj · 2026-05-21T12:26:04Z

+The `KafkaRebalance` has 4 modes: `full`, `add-broker`, `remove-broker` and `remove-disks` mode.
+The `imbalance` mode will be mapped to the `full` mode in the generated KafkaRebalance resource which means that the generated `KafkaRebalance` custom resource will have the mode set as `full` which within the Strimzi rebalancing operator means calling the Cruise Control API to run a rebalancing taking all brokers into account.
+
+The generated `KafkaRebalance` custom resource will be called `<my-cluster-name>-auto-rebalancing-goal-violation`, where the `<my-cluster-name>` part comes from the `metadata.name` in the `Kafka` custom resource, `goal-violation` means that a goal related imbalance was detected in the cluster.


Should the mode name be in the name? E.g. <my-cluster-name>-auto-rebalancing-imbalance? That will make it easier to understand for the user.

I named it <my-cluster-name>-auto-rebalancing-goal-violation since we only support goal violations for now. I think if we think about future like have fix for topic anomalies etc then it would be great to have distinct names based on the anomaly type.

Isn't imbalance the distinct type you can name it after?

It sound good too but I wanted to name it as per the anomaly. (since we are handling goal violations so <my-cluster-name>-auto-rebalancing-goal-violation but I am open to change the names.

@ppatierno @tomncooper any preference from your side?

scholzj · 2026-05-21T12:26:53Z

+  # ... other rebalancing related configuration
+```
+
+The operator also sets a finalizer, named `strimzi.io/auto-rebalancing`, on the generated `KafkaRebalance` custom resource.


Do we use this also for adding and removing brokers? Worth mentioning here whether it is the same pattern or not.

Yeah that's the same.

Yes, thats the same. I will add a sentence to it

scholzj · 2026-05-23T09:35:13Z

+When the `KafkaRebalanceAssemblyOperator` reconciles a KafkaRebalance resource and detects a transition to a terminal state:
+1. Extracts the cluster name from the KafkaRebalance labels
+2. Retrieves the corresponding Kafka CR
+3. Updates `Kafka.status.lastRebalanceCompletionTime` to the current timestamp
+4. Patches the Kafka CR status


No, this would not work and cause problems with the operators writing over each other and causing conflicts. You should follow the principle to write from one place and read from the other.

I agree with Jakub: the KafkaAssemblyOperator is the owner of Kafka CR and it should be the only one to update the Kafka CR status (note: the KafkaAutoRebalanceReconciler can do that because it's part of such operator and not an operator itself). On the other side the KafkaRebalanceAssemblyOperator owns KafkaRebalance and it can't write into Kafka CR. I can't see an easy solution tough.
The KafkaAssemblyOperator having a watch on KafkaRebalance resources to detect when one ends in a terminal state to update the last rebalance time? :-/ Doing the same but poll-based on each reconciliation? :-/ (but we could lose a KafkaRebalance that was deleted by the user right after the rebalancing operation).

The obvious option is having some Config Map 🤷. I also wonder if it would make long term make sense to drop the `KafkaRebalanceAssemblyOperator and integrate the logic here. But I'm not sure I fully understand all the possible consequences. Mainly whether the fact that all rebalances would be done only if we get to the end of the reconciliation could cause any issues.

scholzj · 2026-05-23T13:55:15Z

+**Important Edge Case:** If a goal violation contains BOTH fixable goals and unfixable goals (i.e., both `fixableViolatedGoals` and `unfixableViolatedGoals` lists are non-empty), the operator will NOT trigger a rebalance. This follows Cruise Control's design principle that unfixable goals typically indicate infrastructure constraints (e.g., insufficient racks for rack-awareness, insufficient brokers for replica capacity) that cannot be resolved through partition reassignment alone. Attempting to rebalance only the fixable goals while unfixable goals remain would result in an incomplete fix and could lead to repeated rebalance attempts.
+
+The operator can retrieve the unfixable goals from the `unfixableViolatedGoals` list in the JSON response.
+When unfixable goals are detected, the operator will set a warning condition in the status of the `Kafka` CR to alert users that manual intervention is required:


So, will the unfixable goals persist in Cruise Control? Or will the warning disappear right in the next reconciliations?

The later one is true but since the anomalies are unfixabale we can expect the error to pop up again once CC detects it again

So, if the warning is not stable and we cannot guarantee that we can maintain it in every reconciliation because CC raises it right away, you should consider just logging it instead of using the conditions.

scholzj · 2026-05-23T13:55:45Z

+  conditions:
+    - type: Warning
+      status: "True"
+      reason: UnfixableViolatedGoal


This should probably have some better name to properly relate to the feature it is about.

scholzj · 2026-05-23T14:00:24Z

+- **Frequent rebalances**: More than 3 auto-rebalances triggered within 1 hour may indicate a persisting anomaly
+- **Repeated goal violations**: Same goal violation detected multiple times in a short time window (e.g., >2 times in 30 minutes)
+- **Rebalance failures**: Unfixable goals detected or repeated NotReady states requiring manual intervention
+


Will we add these to the alerts we ship?

I think it would be better to let users to create their own alerts as they can then choose how and when they want to trigger the alerts as per their cluster but we can definitely ship some example for the same

scholzj · 2026-05-23T14:02:06Z

+## Future Scope
+In the future, we plan to introduce auto-rebalance for topic-related and metric-related imbalances.


Suggested change

## Future Scope

In the future, we plan to introduce auto-rebalance for topic-related and metric-related imbalances.

## Future Scope

In the future, we plan to introduce auto-rebalance for topic-related and metric-related imbalances.

ShubhamRwt · 2026-05-24T05:32:26Z

@scholzj thanks for the reviews, I will answer them in a while. I was just thinking about maintenance window. I don't think we really need a maintenance window. I would assume that we would like to have it such that if the users can avoid self healing for some time whey they are going some ugrades/ repairs to their cluster. But I think if they want to avoid it for some time, they can just disable the feature and enable it back once they think they are good to go.

scholzj · 2026-05-24T17:04:33Z

@scholzj thanks for the reviews, I will answer them in a while. I was just thinking about maintenance window. I don't think we really need a maintenance window. I would assume that we would like to have it such that if the users can avoid self healing for some time whey they are going some ugrades/ repairs to their cluster. But I think if they want to avoid it for some time, they can just disable the feature and enable it back once they think they are good to go.

@ShubhamRwt I personally think hooking it into maintenance time window makes sense. It is not about upgrades - you would do those IN the maintenance window in the first place. It is not about repairs - you would do those either in the maintenance window or in an emergency.

It is about for example about reducing disruption caused by the rebalancing in peak hours of your business. That is where you do not want to do cert renewals. And where you possibly also do not want to do rebalances. Sure, you can manually switch it off. But I do not think it works here. If you tell someone to "switch off" the auto-rebalancing every morning and switch it on every evening when going home they will never use it.

In any case, one way or another, this should be part of the proposal.

ppatierno · 2026-05-25T10:19:37Z

+## Motivation
+
+Currently, if the cluster is imbalanced, the user would need to manually rebalance the cluster by using the `KafkaRebalance` custom resource.
+Every cluster requires manual intervention to trigger rebalancing when imbalances are detected, regardless of its size.


I guess that Shubham meant to say something like "if your cluster is unbalanced and you want to do some rebalancing, the only way you have today is the manual one, there is no automation which is the goal for this proposal". I don't think he meant to say that all clusters of any size needs rebalancing. Maybe he can clarify the meaning.

ppatierno · 2026-05-25T10:29:52Z

+* `remove-brokers` - auto-rebalancing on scale down
+
+To leverage the automated rebalance on imbalanced clusters (those with detected goal violations), we will be introducing a new mode to the auto-rebalancing feature.
+The new mode will be called `imbalance`, which means that cluster imbalance was detected and rebalancing should be applied to all the brokers.


Ok I am fine with sticking with imbalance.

ppatierno · 2026-05-25T10:34:28Z

+4. If the key is not present or empty, the operator uses the default anomaly detection goals defined in `CruiseControlConfiguration.CRUISE_CONTROL_DEFAULT_ANOMALY_DETECTION_GOALS_LIST`:
+   - `RACK_AWARENESS_GOAL`
+   - `MIN_TOPIC_LEADERS_PER_BROKER_GOAL`
+   - `REPLICA_CAPACITY_GOAL`
+   - `DISK_CAPACITY_GOAL`


Yes, Shubham is listing what is in the CRUISE_CONTROL_DEFAULT_ANOMALY_DETECTION_GOALS_LIST. Would you suggest to remove the list and referencing to the constant definition only (within the code) or something else? Just trying to understand what "confusing" means here.

ppatierno · 2026-05-25T10:35:56Z

+  - Users must correct the template configuration to resume imbalance auto-rebalancing
+
+The `KafkaRebalance` has 4 modes: `full`, `add-broker`, `remove-broker` and `remove-disks` mode.
+The `imbalance` mode will be mapped to the `full` mode in the generated KafkaRebalance resource which means that the generated `KafkaRebalance` custom resource will have the mode set as `full` which within the Strimzi rebalancing operator means calling the Cruise Control API to run a rebalancing taking all brokers into account.


Suggested change

The `imbalance` mode will be mapped to the `full` mode in the generated KafkaRebalance resource which means that the generated `KafkaRebalance` custom resource will have the mode set as `full` which within the Strimzi rebalancing operator means calling the Cruise Control API to run a rebalancing taking all brokers into account.

The `imbalance` mode will be mapped to the `full` mode in the generated `KafkaRebalance` resource which means that the generated `KafkaRebalance` custom resource will have the mode set as `full` which within the Strimzi rebalancing operator means calling the Cruise Control API to run a rebalancing taking all brokers into account.

ppatierno · 2026-05-25T10:37:13Z

+  # ... other rebalancing related configuration
+```
+
+The operator also sets a finalizer, named `strimzi.io/auto-rebalancing`, on the generated `KafkaRebalance` custom resource.


Yeah that's the same.

ppatierno · 2026-05-25T10:43:06Z

+Talking about the priority of rebalance operations, scale down rebalance operations have the highest priority, followed by scale up, and then rebalance on imbalance.
+The priority ordering is designed to prevent wasted work and optimize resource usage:
+- **Scale-down has highest priority**: Moving partition replicas to brokers that will soon be removed wastes resources and time, so scale-down operations must complete first
+- **Scale-up has second priority**: Quickly utilizing new capacity is important for cluster stability
+- **Imbalance rebalancing has lowest priority**: It addresses steady-state optimization rather than cluster topology changes, so it can wait for scaling operations to complete


No there is no such a queue, I think this is trying to summarize what's going to happen in the corresponding FSM (i.e. a anomaly is skipped if there is a rebalance on scale-down, ...). I would agree it could sound like a queue handling with priorities. @ShubhamRwt because we have the FSM explained later on, I guess you could even remove this list? Wdyt @scholzj ?

ppatierno · 2026-05-25T10:43:51Z

+- **Manual rebalances**: Created by users with arbitrary names
+
+The implementation leverages the existing `KafkaRebalanceUtils` class which is already used throughout the Strimzi codebase for state extraction and validation.
+All KafkaRebalance resources (manual and auto-generated) have the `strimzi.io/cluster` label, allowing the operator to discover them via label selector.


Suggested change

All KafkaRebalance resources (manual and auto-generated) have the `strimzi.io/cluster` label, allowing the operator to discover them via label selector.

All `KafkaRebalance` resources (manual and auto-generated) have the `strimzi.io/cluster` label, allowing the operator to discover them via label selector.

ppatierno · 2026-05-25T10:45:44Z

+
+**Tracking Rebalance Completion Times:**
+
+Since auto-generated KafkaRebalance resources are deleted upon completion or failure, and manual KafkaRebalance resources can be deleted by users, we need a persistent location to track completion times.


Suggested change

Since auto-generated KafkaRebalance resources are deleted upon completion or failure, and manual KafkaRebalance resources can be deleted by users, we need a persistent location to track completion times.

Since auto-generated `KafkaRebalance` resources are deleted upon completion or failure, and manual `KafkaRebalance` resources can be deleted by users, we need a persistent location to track completion times.

ppatierno · 2026-05-25T10:57:58Z

+**State Machine Handling:**
+
+Both inter-broker and intra-broker rebalancing use the same `RebalanceOnImbalance` state.
+The operator tracks which type of rebalance is currently running and executes them sequentially within this state:


I see a problem here. The states within an FSM are persisted into the Kafka CR so that if the cluster operator crashes it can recover from it was before. You are adding a new (internal) "state" for the RebalanceOnImbalance .... state :-) which makes a differentiation between rebalanceDisk true or false. It means that when the operator recover from a crash and it recovers the state RebalanceOnImbalance, it doesn't know if the rebalanceDisk it was handling was true or false. How are you going to solve this?

I understand your concern here. You are right. I think we can either think of having two separate state RebalanceOnImbalanceInterBrokerBalancing and RebalanceOnImbalanceIntraBrokerBalancing or the other way can be introduction of new field called pendingOperation or phase to which will track the state of the RebalanceOnImbalance mode.

Not sure I would make the FSM more complicated by adding an additional state tbh. Maybe just storing the rebalanceDisk field would be enough. Let's see if others have a better idea.

ppatierno · 2026-05-25T11:24:26Z

+When the `KafkaRebalanceAssemblyOperator` reconciles a KafkaRebalance resource and detects a transition to a terminal state:
+1. Extracts the cluster name from the KafkaRebalance labels
+2. Retrieves the corresponding Kafka CR
+3. Updates `Kafka.status.lastRebalanceCompletionTime` to the current timestamp
+4. Patches the Kafka CR status


I agree with Jakub: the KafkaAssemblyOperator is the owner of Kafka CR and it should be the only one to update the Kafka CR status (note: the KafkaAutoRebalanceReconciler can do that because it's part of such operator and not an operator itself). On the other side the KafkaRebalanceAssemblyOperator owns KafkaRebalance and it can't write into Kafka CR. I can't see an easy solution tough.
The KafkaAssemblyOperator having a watch on KafkaRebalance resources to detect when one ends in a terminal state to update the last rebalance time? :-/ Doing the same but poll-based on each reconciliation? :-/ (but we could lose a KafkaRebalance that was deleted by the user right after the rebalancing operation).

ShubhamRwt added 3 commits March 30, 2026 12:42

proposal for auto-rebalance on imbalance

3c034bb

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

Fixing minot issues

9d215a2

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

Remove unnecessory lines

4f3d6e5

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

ShubhamRwt requested review from ppatierno and tomncooper March 30, 2026 08:33

Fix broken link

d8965cf

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

ShubhamRwt requested review from a team and scholzj April 1, 2026 12:53

ppatierno requested review from Frawless, im-konge, katheris, see-quick and tinaselenge and removed request for a team April 1, 2026 12:55

Refined the proposal

7bfa09f

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

scholzj reviewed Apr 7, 2026

View reviewed changes

ppatierno reviewed Apr 7, 2026

View reviewed changes

Added suggestions from PP and JS

50b8936

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

egyedt reviewed Apr 16, 2026

View reviewed changes

tomncooper reviewed Apr 16, 2026

View reviewed changes

ShubhamRwt added 2 commits May 5, 2026 17:53

Added suggestions by TC, PP, JS and EG

813f57d

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

update diagram

60ae9e6

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

Add minor edits

86d1681

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

scholzj mentioned this pull request May 10, 2026

Add automatic offline log directory detection and rolling restart strimzi/strimzi-kafka-operator#12726

Closed

acgtun mentioned this pull request May 10, 2026

Proposal 142: Automatic detection and recovery of offline log directories #221

Closed

scholzj reviewed May 23, 2026

View reviewed changes

ppatierno reviewed May 25, 2026

View reviewed changes

mly-zju mentioned this pull request May 28, 2026

Add proposal for controller readiness probe based on KRaft raft state #225

Open


		#### What happens if some rebalance fails:

		With the new `imbalance` mode, we will be introducing two new states to the FSM called `RebalanceOnImbalance` and `RebalanceOnImbalanceNotComplete`

		## Future Scope
		In the future, we plan to introduce auto-rebalance for topic-related and metric-related imbalances.

	The `imbalance` mode will be mapped to the `full` mode in the generated KafkaRebalance resource which means that the generated `KafkaRebalance` custom resource will have the mode set as `full` which within the Strimzi rebalancing operator means calling the Cruise Control API to run a rebalancing taking all brokers into account.
	The `imbalance` mode will be mapped to the `full` mode in the generated `KafkaRebalance` resource which means that the generated `KafkaRebalance` custom resource will have the mode set as `full` which within the Strimzi rebalancing operator means calling the Cruise Control API to run a rebalancing taking all brokers into account.

	All KafkaRebalance resources (manual and auto-generated) have the `strimzi.io/cluster` label, allowing the operator to discover them via label selector.
	All `KafkaRebalance` resources (manual and auto-generated) have the `strimzi.io/cluster` label, allowing the operator to discover them via label selector.


		Tracking Rebalance Completion Times:

		Since auto-generated KafkaRebalance resources are deleted upon completion or failure, and manual KafkaRebalance resources can be deleted by users, we need a persistent location to track completion times.

	Since auto-generated KafkaRebalance resources are deleted upon completion or failure, and manual KafkaRebalance resources can be deleted by users, we need a persistent location to track completion times.
	Since auto-generated `KafkaRebalance` resources are deleted upon completion or failure, and manual `KafkaRebalance` resources can be deleted by users, we need a persistent location to track completion times.

Conversation

ShubhamRwt commented Mar 30, 2026

Uh oh!

scholzj commented Apr 1, 2026

Uh oh!

ShubhamRwt commented Apr 1, 2026

Uh oh!

scholzj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!