Skip to content

Added proposal for auto-rebalance on imbalanced cluster feature in operator#211

Open
ShubhamRwt wants to merge 9 commits into
strimzi:mainfrom
ShubhamRwt:autoRebalanceImbalance
Open

Added proposal for auto-rebalance on imbalanced cluster feature in operator#211
ShubhamRwt wants to merge 9 commits into
strimzi:mainfrom
ShubhamRwt:autoRebalanceImbalance

Conversation

@ShubhamRwt
Copy link
Copy Markdown
Contributor

This PR aims to introduce the self-healing feature in Strimzi. This proposal contains all the comments and suggestion left on the old proposal . This proposal aim to utilize the auto-rebalancing feature of Strimzi to introduce the self healing.

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
@scholzj
Copy link
Copy Markdown
Member

scholzj commented Apr 1, 2026

@ShubhamRwt So, is this a draft? Or is it ready for review? It is not completely clear as the PR is not a Draft but you did not requested the review from all maintainers but only from two specific people.

@ShubhamRwt
Copy link
Copy Markdown
Contributor Author

@scholzj It is ready for review, Sorry I didn't tagged everyone

@ShubhamRwt ShubhamRwt requested review from a team and scholzj April 1, 2026 12:53
@ppatierno ppatierno requested review from Frawless, im-konge, katheris, see-quick and tinaselenge and removed request for a team April 1, 2026 12:55
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Copy link
Copy Markdown
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • It is not clear what the lifecycle of the KafkaRebalance resources is
  • What is the impact of maintenance time windows?
  • What is the testing strategy?

Comment thread images/136-auto-rebalance-imbalance.png
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated

#### What happens if some rebalance fails:

With the new `imbalance` mode, we will be introducing two new states to the FSM called `RebalanceOnImbalance` and `RebalanceOnImbalanceNotComplete`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need an additional state for failed rebalance? IIRC there is no such a specific state for failures with auto-rebalancing on scale up/down. Or am I wrong?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was me overthinking that in case some failure happens and it might need a human intervention and we might not need the auto rebalance to happen again and again but I think this is not really required. I will remove it

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what state will it move to if there is an error? Idle?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the generate KR CR move to NotReady (or whatever the fail state for that is)? How will a human operator know this happened?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated KR Cr would move to NotReady, yes but I think the user can check the logs but I dont think there is any other thing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the answer clear based on my previous reply?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me

Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will be the responsibility of the Strimzi Cluster Operator.
To enable this, we will use an approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance.
We will not enable the other anomaly detection classes, related to goal violations, that would require manual interventions at the infrastructure level such as disk or broker failures.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice, if we make events or WARN logs in Strimzi if there are other type of anomalies around Kafka cluster, which cannot be automatically fixed. WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK disk and broker failures would surface through the Strimzi Kafka CR status and normal metrics monitoring channels so I am not sure having CC also report that is particularly useful?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is reported in "other channels" then it is fine to me!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that depends on what disk and broker failures really are. I do not think disk failures are necessarily something the operator would notice? Broker failures - if that means the broker is unresponsive - then it would actually fail the reconciliation I think. But if the broker accepts connections but doesn't for example replicate any data, we would not notice it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all those situations are better surfaced through monitoring the appropriate metrics, that is all CC is doing anyway.

* `remove-brokers` - auto-rebalancing on scale down

To leverage the automated rebalance on imbalanced clusters (those with detected goal violations), we will be introducing a new mode to the auto-rebalancing feature.
The new mode will be called `imbalance`, which means that cluster imbalance was detected and rebalancing should be applied to all the brokers.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we should call it rebalance or rebalance-broker instead of imbalance
I prefer rebalance or rebalance-broker, since it will match with CC terminology and also add-brokers, remove-brokers have a verb, while "imbalance" is just an adjective. So to match the current convention better.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, to me, the mode is the reason for the auto-rebalance being triggered.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case we might want to make it past-tense: "imbalanced"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, to me, the mode is the reason for the auto-rebalance being triggered.

I also disagree with you, if it is the reason, then why we call it 'mode'? It should be called 'reason' then!
In this current form, for me, the rebalance-brokers would match the convention created by add-brokers and remove-brokers...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But all the modes are rebalancing the brokers?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "mode" terminology is coming from the KafkaRebalance spec where you specify which mode you want to use for rebalancing: full, add-brokers, remove-brokers or remove-disks. All these modes maps to the corresponding CC endpoints. Maybe we had a discussion about a good naming long time ago about what to use in the auto-rebalancing and we ending with the same because the mapping was one to one. I can see @egyedt point where, within the auto-rebalancing, we are actually specifying "when I should trigger an auto-rebalancing?" on adding brokers, on removing brokers or ... whenever there is an anomaly. What the current proposal is going to do underneath in terms of KafkaRebalance is using the "full" mode which is using the /rebalance endpoint (full because using ALL brokers within the cluster). Maybe we can stick with "full" instead of "imbalance" for now? Or we should change the field name but it needs deprecating this and adding a new one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see this is an ongoing discussion - Do we plan to use imbalance or full mode?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My take always was that it is the action we are going to take:

  • Add brokers
  • Remove brokers

Not the cause or event. For me, the imbalance is maybe not perfect (I think I originally preferred something else), but I think it is understandable for all users what it means and describes well the intent. full on the other hand is meaningless for me and makes it hard to imagine when it will be done.

While for the other two modes the type is the same as the mode in KafkaRebalance, for them it works fine to describe everything. full is fine as a mode in KafkaRebalance. It says what will happen. But in the Kafka CR autorebalance section, the keyword should focus on the intent and when it will happen. full does not explain that. imbalance does.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I am fine with sticking with imbalance.

Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
This could cause potential conflicts with other administration operations and is the primary reason self-healing has been disabled until now.
To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will be the responsibility of the Strimzi Cluster Operator.
To enable this, we will use an approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for some anomalies, the call of the fix-offline-replicas endpoint can be useful!
Beside the usage of rebalance endpoint.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we decide which anomalies to call the different endpoint for?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be some logic about this in Strimzi if we decide to use fix-offline-replicas endpoint too. Otherwise we can use only the rebalance endpoint, it is totally fine to me. I just mentioned here as an interesting idea.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about this /fix_offline_replicas endpoint at all. If there is a specific anomaly raised about it, maybe it could make sense to use it but it's not trivial. For how the rebalance operator works today, Each CC endpoint call is mappend on a corresponding "mode" within the KafkaRebalance, so hitting two endpoints would mean creating two KafkaRebalance: one using the /rebalance endpoint and another one using the /fix_offline_replicas (which is anyway not supported now).

Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
Copy link
Copy Markdown
Member

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have had a pass. I left a few comments.

As other reviewers have identified, the main point is checking if a rebalance (manual or automatic) is ongoing before you do anything else.

For auto rebalances if seems you can check the state machine, for manual you could check CC directly or the status of the KR CR. If a manual KR CR already exists then something is going to check it's status with CC, depending on the ordering of those operations it may happen before you need to do your checks, in which case you could use that. If it is after or ordering can't be guaranteed then you might be best querying CC directly.

Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
This could cause potential conflicts with other administration operations and is the primary reason self-healing has been disabled until now.
To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will be the responsibility of the Strimzi Cluster Operator.
To enable this, we will use an approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we decide which anomalies to call the different endpoint for?

To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will be the responsibility of the Strimzi Cluster Operator.
To enable this, we will use an approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance.
We will not enable the other anomaly detection classes, related to goal violations, that would require manual interventions at the infrastructure level such as disk or broker failures.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK disk and broker failures would surface through the Strimzi Kafka CR status and normal metrics monitoring channels so I am not sure having CC also report that is particularly useful?

* `remove-brokers` - auto-rebalancing on scale down

To leverage the automated rebalance on imbalanced clusters (those with detected goal violations), we will be introducing a new mode to the auto-rebalancing feature.
The new mode will be called `imbalance`, which means that cluster imbalance was detected and rebalancing should be applied to all the brokers.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, to me, the mode is the reason for the auto-rebalance being triggered.


#### What happens if some rebalance fails:

With the new `imbalance` mode, we will be introducing two new states to the FSM called `RebalanceOnImbalance` and `RebalanceOnImbalanceNotComplete`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what state will it move to if there is an error? Idle?


#### What happens if some rebalance fails:

With the new `imbalance` mode, we will be introducing two new states to the FSM called `RebalanceOnImbalance` and `RebalanceOnImbalanceNotComplete`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the generate KR CR move to NotReady (or whatever the fail state for that is)? How will a human operator know this happened?

Comment thread 136-auto-rebalance-on-imbalanced-clusters.md
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
Comment thread 136-auto-rebalance-on-imbalanced-clusters.md Outdated
ShubhamRwt added 2 commits May 5, 2026 17:53
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
@egyedt
Copy link
Copy Markdown

egyedt commented May 6, 2026

Is there an estimate when this PR will be ready?
Is there a lot of open questions?
If help is needed with the questions then feel free to answer me in this thread and I will try to help!

@ShubhamRwt
Copy link
Copy Markdown
Contributor Author

@egyedt Hi, I pushed the suggestions yesterday. I think this proposal is ready for next set of reviews. I hope the new suggestions fixes the open questions

@ShubhamRwt
Copy link
Copy Markdown
Contributor Author

@scholzj @ppatierno @tomncooper can you guys have another pass at this, please. Thankyou

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Copy link
Copy Markdown
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sitll don't see any considerations ont he maintenance time window which I think I raised before. It would be good to have it included in the proposal (whether it will follow the window, or if not, why not)

## Motivation

Currently, if the cluster is imbalanced, the user would need to manually rebalance the cluster by using the `KafkaRebalance` custom resource.
Every cluster requires manual intervention to trigger rebalancing when imbalances are detected, regardless of its size.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence sounds strange. I would maybe remove it. There are many people who live without rebalancing all their life. Especially if you have small cluster with 3 brokers and use replication factor 3, there is actually not much to rebalance.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that Shubham meant to say something like "if your cluster is unbalanced and you want to do some rebalancing, the only way you have today is the manual one, there is no automation which is the goal for this proposal". I don't think he meant to say that all clusters of any size needs rebalancing. Maybe he can clarify the meaning.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I assume so, but we should clarify it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that with respect to Strimzi, yes, If we want to do any rebalance due to imbalance then we have to do it manually and there is now way for auto rebalance on imbalance mechanism.

* `remove-brokers` - auto-rebalancing on scale down

To leverage the automated rebalance on imbalanced clusters (those with detected goal violations), we will be introducing a new mode to the auto-rebalancing feature.
The new mode will be called `imbalance`, which means that cluster imbalance was detected and rebalancing should be applied to all the brokers.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My take always was that it is the action we are going to take:

  • Add brokers
  • Remove brokers

Not the cause or event. For me, the imbalance is maybe not perfect (I think I originally preferred something else), but I think it is understandable for all users what it means and describes well the intent. full on the other hand is meaningless for me and makes it hard to imagine when it will be done.

While for the other two modes the type is the same as the mode in KafkaRebalance, for them it works fine to describe everything. full is fine as a mode in KafkaRebalance. It says what will happen. But in the Kafka CR autorebalance section, the keyword should focus on the intent and when it will happen. full does not explain that. imbalance does.

Comment on lines +200 to +204
4. If the key is not present or empty, the operator uses the default anomaly detection goals defined in `CruiseControlConfiguration.CRUISE_CONTROL_DEFAULT_ANOMALY_DETECTION_GOALS_LIST`:
- `RACK_AWARENESS_GOAL`
- `MIN_TOPIC_LEADERS_PER_BROKER_GOAL`
- `REPLICA_CAPACITY_GOAL`
- `DISK_CAPACITY_GOAL`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, are these goals what is in CRUISE_CONTROL_DEFAULT_ANOMALY_DETECTION_GOALS_LIST? It is a bit confusing what the list of goals means here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Shubham is listing what is in the CRUISE_CONTROL_DEFAULT_ANOMALY_DETECTION_GOALS_LIST. Would you suggest to remove the list and referencing to the constant definition only (within the code) or something else? Just trying to understand what "confusing" means here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably just reference the list. Or one can do something like to make it more clear...

4. If the key is not present or empty, the operator uses the default anomaly detection goals defined in `CruiseControlConfiguration.CRUISE_CONTROL_DEFAULT_ANOMALY_DETECTION_GOALS_LIST`.
   Currently, the list contains:
     - ...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to make it simpler if it sounds confusing

The `KafkaRebalance` has 4 modes: `full`, `add-broker`, `remove-broker` and `remove-disks` mode.
The `imbalance` mode will be mapped to the `full` mode in the generated KafkaRebalance resource which means that the generated `KafkaRebalance` custom resource will have the mode set as `full` which within the Strimzi rebalancing operator means calling the Cruise Control API to run a rebalancing taking all brokers into account.

The generated `KafkaRebalance` custom resource will be called `<my-cluster-name>-auto-rebalancing-goal-violation`, where the `<my-cluster-name>` part comes from the `metadata.name` in the `Kafka` custom resource, `goal-violation` means that a goal related imbalance was detected in the cluster.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the mode name be in the name? E.g. <my-cluster-name>-auto-rebalancing-imbalance? That will make it easier to understand for the user.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I named it <my-cluster-name>-auto-rebalancing-goal-violation since we only support goal violations for now. I think if we think about future like have fix for topic anomalies etc then it would be great to have distinct names based on the anomaly type.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't imbalance the distinct type you can name it after?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sound good too but I wanted to name it as per the anomaly. (since we are handling goal violations so <my-cluster-name>-auto-rebalancing-goal-violation but I am open to change the names.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ppatierno @tomncooper any preference from your side?

# ... other rebalancing related configuration
```

The operator also sets a finalizer, named `strimzi.io/auto-rebalancing`, on the generated `KafkaRebalance` custom resource.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use this also for adding and removing brokers? Worth mentioning here whether it is the same pattern or not.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's the same.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thats the same. I will add a sentence to it

Comment on lines +391 to +395
When the `KafkaRebalanceAssemblyOperator` reconciles a KafkaRebalance resource and detects a transition to a terminal state:
1. Extracts the cluster name from the KafkaRebalance labels
2. Retrieves the corresponding Kafka CR
3. Updates `Kafka.status.lastRebalanceCompletionTime` to the current timestamp
4. Patches the Kafka CR status
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this would not work and cause problems with the operators writing over each other and causing conflicts. You should follow the principle to write from one place and read from the other.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Jakub: the KafkaAssemblyOperator is the owner of Kafka CR and it should be the only one to update the Kafka CR status (note: the KafkaAutoRebalanceReconciler can do that because it's part of such operator and not an operator itself). On the other side the KafkaRebalanceAssemblyOperator owns KafkaRebalance and it can't write into Kafka CR. I can't see an easy solution tough.
The KafkaAssemblyOperator having a watch on KafkaRebalance resources to detect when one ends in a terminal state to update the last rebalance time? :-/ Doing the same but poll-based on each reconciliation? :-/ (but we could lose a KafkaRebalance that was deleted by the user right after the rebalancing operation).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The obvious option is having some Config Map 🤷. I also wonder if it would make long term make sense to drop the `KafkaRebalanceAssemblyOperator and integrate the logic here. But I'm not sure I fully understand all the possible consequences. Mainly whether the fact that all rebalances would be done only if we get to the end of the reconciliation could cause any issues.

**Important Edge Case:** If a goal violation contains BOTH fixable goals and unfixable goals (i.e., both `fixableViolatedGoals` and `unfixableViolatedGoals` lists are non-empty), the operator will NOT trigger a rebalance. This follows Cruise Control's design principle that unfixable goals typically indicate infrastructure constraints (e.g., insufficient racks for rack-awareness, insufficient brokers for replica capacity) that cannot be resolved through partition reassignment alone. Attempting to rebalance only the fixable goals while unfixable goals remain would result in an incomplete fix and could lead to repeated rebalance attempts.

The operator can retrieve the unfixable goals from the `unfixableViolatedGoals` list in the JSON response.
When unfixable goals are detected, the operator will set a warning condition in the status of the `Kafka` CR to alert users that manual intervention is required:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, will the unfixable goals persist in Cruise Control? Or will the warning disappear right in the next reconciliations?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The later one is true but since the anomalies are unfixabale we can expect the error to pop up again once CC detects it again

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if the warning is not stable and we cannot guarantee that we can maintain it in every reconciliation because CC raises it right away, you should consider just logging it instead of using the conditions.

conditions:
- type: Warning
status: "True"
reason: UnfixableViolatedGoal
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably have some better name to properly relate to the feature it is about.

Comment on lines +639 to +642
- **Frequent rebalances**: More than 3 auto-rebalances triggered within 1 hour may indicate a persisting anomaly
- **Repeated goal violations**: Same goal violation detected multiple times in a short time window (e.g., >2 times in 30 minutes)
- **Rebalance failures**: Unfixable goals detected or repeated NotReady states requiring manual intervention

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we add these to the alerts we ship?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to let users to create their own alerts as they can then choose how and when they want to trigger the alerts as per their cluster but we can definitely ship some example for the same

Comment on lines +913 to +914
## Future Scope
In the future, we plan to introduce auto-rebalance for topic-related and metric-related imbalances.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Future Scope
In the future, we plan to introduce auto-rebalance for topic-related and metric-related imbalances.
## Future Scope
In the future, we plan to introduce auto-rebalance for topic-related and metric-related imbalances.

@ShubhamRwt
Copy link
Copy Markdown
Contributor Author

@scholzj thanks for the reviews, I will answer them in a while. I was just thinking about maintenance window. I don't think we really need a maintenance window. I would assume that we would like to have it such that if the users can avoid self healing for some time whey they are going some ugrades/ repairs to their cluster. But I think if they want to avoid it for some time, they can just disable the feature and enable it back once they think they are good to go.

@scholzj
Copy link
Copy Markdown
Member

scholzj commented May 24, 2026

@scholzj thanks for the reviews, I will answer them in a while. I was just thinking about maintenance window. I don't think we really need a maintenance window. I would assume that we would like to have it such that if the users can avoid self healing for some time whey they are going some ugrades/ repairs to their cluster. But I think if they want to avoid it for some time, they can just disable the feature and enable it back once they think they are good to go.

@ShubhamRwt I personally think hooking it into maintenance time window makes sense. It is not about upgrades - you would do those IN the maintenance window in the first place. It is not about repairs - you would do those either in the maintenance window or in an emergency.

It is about for example about reducing disruption caused by the rebalancing in peak hours of your business. That is where you do not want to do cert renewals. And where you possibly also do not want to do rebalances. Sure, you can manually switch it off. But I do not think it works here. If you tell someone to "switch off" the auto-rebalancing every morning and switch it on every evening when going home they will never use it.

In any case, one way or another, this should be part of the proposal.

## Motivation

Currently, if the cluster is imbalanced, the user would need to manually rebalance the cluster by using the `KafkaRebalance` custom resource.
Every cluster requires manual intervention to trigger rebalancing when imbalances are detected, regardless of its size.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that Shubham meant to say something like "if your cluster is unbalanced and you want to do some rebalancing, the only way you have today is the manual one, there is no automation which is the goal for this proposal". I don't think he meant to say that all clusters of any size needs rebalancing. Maybe he can clarify the meaning.

* `remove-brokers` - auto-rebalancing on scale down

To leverage the automated rebalance on imbalanced clusters (those with detected goal violations), we will be introducing a new mode to the auto-rebalancing feature.
The new mode will be called `imbalance`, which means that cluster imbalance was detected and rebalancing should be applied to all the brokers.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I am fine with sticking with imbalance.

Comment on lines +200 to +204
4. If the key is not present or empty, the operator uses the default anomaly detection goals defined in `CruiseControlConfiguration.CRUISE_CONTROL_DEFAULT_ANOMALY_DETECTION_GOALS_LIST`:
- `RACK_AWARENESS_GOAL`
- `MIN_TOPIC_LEADERS_PER_BROKER_GOAL`
- `REPLICA_CAPACITY_GOAL`
- `DISK_CAPACITY_GOAL`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Shubham is listing what is in the CRUISE_CONTROL_DEFAULT_ANOMALY_DETECTION_GOALS_LIST. Would you suggest to remove the list and referencing to the constant definition only (within the code) or something else? Just trying to understand what "confusing" means here.

- Users must correct the template configuration to resume imbalance auto-rebalancing

The `KafkaRebalance` has 4 modes: `full`, `add-broker`, `remove-broker` and `remove-disks` mode.
The `imbalance` mode will be mapped to the `full` mode in the generated KafkaRebalance resource which means that the generated `KafkaRebalance` custom resource will have the mode set as `full` which within the Strimzi rebalancing operator means calling the Cruise Control API to run a rebalancing taking all brokers into account.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `imbalance` mode will be mapped to the `full` mode in the generated KafkaRebalance resource which means that the generated `KafkaRebalance` custom resource will have the mode set as `full` which within the Strimzi rebalancing operator means calling the Cruise Control API to run a rebalancing taking all brokers into account.
The `imbalance` mode will be mapped to the `full` mode in the generated `KafkaRebalance` resource which means that the generated `KafkaRebalance` custom resource will have the mode set as `full` which within the Strimzi rebalancing operator means calling the Cruise Control API to run a rebalancing taking all brokers into account.

# ... other rebalancing related configuration
```

The operator also sets a finalizer, named `strimzi.io/auto-rebalancing`, on the generated `KafkaRebalance` custom resource.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's the same.

Comment on lines +274 to +278
Talking about the priority of rebalance operations, scale down rebalance operations have the highest priority, followed by scale up, and then rebalance on imbalance.
The priority ordering is designed to prevent wasted work and optimize resource usage:
- **Scale-down has highest priority**: Moving partition replicas to brokers that will soon be removed wastes resources and time, so scale-down operations must complete first
- **Scale-up has second priority**: Quickly utilizing new capacity is important for cluster stability
- **Imbalance rebalancing has lowest priority**: It addresses steady-state optimization rather than cluster topology changes, so it can wait for scaling operations to complete
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No there is no such a queue, I think this is trying to summarize what's going to happen in the corresponding FSM (i.e. a anomaly is skipped if there is a rebalance on scale-down, ...). I would agree it could sound like a queue handling with priorities. @ShubhamRwt because we have the FSM explained later on, I guess you could even remove this list? Wdyt @scholzj ?

- **Manual rebalances**: Created by users with arbitrary names

The implementation leverages the existing `KafkaRebalanceUtils` class which is already used throughout the Strimzi codebase for state extraction and validation.
All KafkaRebalance resources (manual and auto-generated) have the `strimzi.io/cluster` label, allowing the operator to discover them via label selector.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
All KafkaRebalance resources (manual and auto-generated) have the `strimzi.io/cluster` label, allowing the operator to discover them via label selector.
All `KafkaRebalance` resources (manual and auto-generated) have the `strimzi.io/cluster` label, allowing the operator to discover them via label selector.


**Tracking Rebalance Completion Times:**

Since auto-generated KafkaRebalance resources are deleted upon completion or failure, and manual KafkaRebalance resources can be deleted by users, we need a persistent location to track completion times.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Since auto-generated KafkaRebalance resources are deleted upon completion or failure, and manual KafkaRebalance resources can be deleted by users, we need a persistent location to track completion times.
Since auto-generated `KafkaRebalance` resources are deleted upon completion or failure, and manual `KafkaRebalance` resources can be deleted by users, we need a persistent location to track completion times.

**State Machine Handling:**

Both inter-broker and intra-broker rebalancing use the same `RebalanceOnImbalance` state.
The operator tracks which type of rebalance is currently running and executes them sequentially within this state:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a problem here. The states within an FSM are persisted into the Kafka CR so that if the cluster operator crashes it can recover from it was before. You are adding a new (internal) "state" for the RebalanceOnImbalance .... state :-) which makes a differentiation between rebalanceDisk true or false. It means that when the operator recover from a crash and it recovers the state RebalanceOnImbalance, it doesn't know if the rebalanceDisk it was handling was true or false. How are you going to solve this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your concern here. You are right. I think we can either think of having two separate state RebalanceOnImbalanceInterBrokerBalancing and RebalanceOnImbalanceIntraBrokerBalancing or the other way can be introduction of new field called pendingOperation or phase to which will track the state of the RebalanceOnImbalance mode.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I would make the FSM more complicated by adding an additional state tbh. Maybe just storing the rebalanceDisk field would be enough. Let's see if others have a better idea.

Comment on lines +391 to +395
When the `KafkaRebalanceAssemblyOperator` reconciles a KafkaRebalance resource and detects a transition to a terminal state:
1. Extracts the cluster name from the KafkaRebalance labels
2. Retrieves the corresponding Kafka CR
3. Updates `Kafka.status.lastRebalanceCompletionTime` to the current timestamp
4. Patches the Kafka CR status
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Jakub: the KafkaAssemblyOperator is the owner of Kafka CR and it should be the only one to update the Kafka CR status (note: the KafkaAutoRebalanceReconciler can do that because it's part of such operator and not an operator itself). On the other side the KafkaRebalanceAssemblyOperator owns KafkaRebalance and it can't write into Kafka CR. I can't see an easy solution tough.
The KafkaAssemblyOperator having a watch on KafkaRebalance resources to detect when one ends in a terminal state to update the last rebalance time? :-/ Doing the same but poll-based on each reconciliation? :-/ (but we could lose a KafkaRebalance that was deleted by the user right after the rebalancing operation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants