Add proposal for alerting on stuck reconciliations by nikita-kibitkin · Pull Request #224 · strimzi/proposals

nikita-kibitkin · 2026-05-17T17:32:52Z

This proposal suggests a solution for strimzi/strimzi-kafka-operator#11634.

Introduces a per-resource gauge that exports the start time of the in-progress reconciliation, so users can alert on
reconciliations that started but never completed without scanning operator logs.

Signed-off-by: Nikita Kibitkin <nikita.n.kibitkin@gmail.com>

scholzj

Thanks for the proposal and sorry it took so long to review. I left some clarification comments, but I think it looks mostly good otherwise.

scholzj · 2026-05-24T21:48:20Z

+* Identify the affected resource by kind, namespace, and name.
+* Expose how long the reconciliation has been running.
+* Avoid false alerts after completed reconciliations, including deletion reconciliations.
+* Keep deployment-specific thresholds out of the operator itself.


What does this mean?

Reworded this section. The bullets now describe what information the metric gives to Prometheus rules: identify the resource, calculate elapsed time from the start timestamp, stop matching completed reconciliations after cleanup, and keep the threshold in the alert rule instead of baking a timeout into the operator.

scholzj · 2026-05-24T21:52:05Z

+
+The operator will not define a built-in timeout. Different Strimzi deployments can have different reconciliation durations, so the threshold should be part of the user's Prometheus alerting rule.
+
+Removing the local meter also avoids tombstone values such as `-1`. Once Prometheus observes that the series is no longer exported, the series becomes stale and instant-vector alerts stop matching the old value. This is the same cleanup model already used by per-resource metrics such as `strimzi.resource.state`, where the operator removes local meters when they no longer apply.


For someone who is not a Prometheus expert ... can you provide more details on when/how will Prometheus mark it stale? It matters here because it defines how reliable the alerting would be.

Expanded the lifecycle section. The start time is only the gauge value, not an explicit Prometheus sample timestamp. After the local meter is removed, the next successful scrape of the same endpoint no longer returns the series, Prometheus marks it stale, and instant-vector selectors stop returning it after that stale marker.

scholzj · 2026-05-24T21:53:04Z

+
+## Proposal
+
+The operators will expose a new gauge metric for active reconciliations:


Maybe it would be good to be more specific here ... does operators here mean all 3 operators? Cluster / Topic / User? Or does it mean the internal operators inside the Cluster Operator (e.g. Kafka, KafkaConnect, ...)? You touch on it later. But it might be good to clarify it right at the beginning.

Made this explicit in the proposal section: this means the Cluster Operator, Topic Operator, and User Operator. I also clarified below that the Topic Operator uses a different path (BatchingTopicController) from the Cluster/User Operator paths.

scholzj · 2026-05-24T21:54:14Z

+* `Reconciliation` exposes `kind()`, `namespace()`, and `name()`.
+* In the Cluster Operator, `AbstractOperator.withLock(...)` starts the progress-warning timer after the lock is acquired and cancels it when the asynchronous reconciliation completes.
+* In the Topic and User Operator controller loop, `AbstractControllerLoop.reconcileWrapper(...)` starts the progress-warning task before calling `reconcile(...)` and cancels it in the `finally` block.


Do these have access to the metrics objects?

Yes, clarified in the implementation section. AbstractOperator has an OperatorMetricsHolder through metrics(), AbstractControllerLoop exposes a ControllerMetricsHolder, and BatchingTopicController already has a TopicOperatorMetricsHolder field. The helper can therefore live on the existing metrics holders instead of passing a new metrics dependency into the wrappers.

ppatierno · 2026-05-25T13:05:45Z

Overall it looks ok. I will wait for answers to the questions asked by Jakub. Also, could you please split the paragraph with one sentence per line please? It helps on reviewing to make comments on every specific line. Thanks!

Signed-off-by: Nikita Kibitkin <nikita.n.kibitkin@gmail.com>

nikita-kibitkin · 2026-05-26T13:18:15Z

Thanks for the review.

I pushed two commits:

a mechanical reflow commit applying one sentence per line;
a clarification commit addressing the review comments.

The proposal now clarifies the Prometheus rule goals, the Cluster / Topic / User Operator scope, Prometheus staleness behavior and where the implementation can access the existing metrics holders.

Add proposal for alerting on stuck reconciliations

0ace8ba

Signed-off-by: Nikita Kibitkin <nikita.n.kibitkin@gmail.com>

nikita-kibitkin mentioned this pull request May 17, 2026

Create alerting mechanism for stuck reconciliations strimzi/strimzi-kafka-operator#11634

Open

scholzj requested review from Frawless, PaulRMellor, im-konge, katheris, ppatierno, samuel-hawker, scholzj, see-quick, sknot-rh, tinaselenge and tombentley May 17, 2026 19:55

scholzj reviewed May 24, 2026

View reviewed changes

nikita-kibitkin added 2 commits May 26, 2026 14:07

Reflow proposal text for review

2e53372

Signed-off-by: Nikita Kibitkin <nikita.n.kibitkin@gmail.com>

Clarify stuck reconciliation proposal

f3e7508

Signed-off-by: Nikita Kibitkin <nikita.n.kibitkin@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add proposal for alerting on stuck reconciliations#224

Add proposal for alerting on stuck reconciliations#224
nikita-kibitkin wants to merge 3 commits into
strimzi:mainfrom
nikita-kibitkin:144-stuck-reconciliation-alerting

nikita-kibitkin commented May 17, 2026

Uh oh!

scholzj left a comment

Uh oh!

scholzj May 24, 2026

Uh oh!

nikita-kibitkin May 26, 2026

Uh oh!

scholzj May 24, 2026

Uh oh!

nikita-kibitkin May 26, 2026

Uh oh!

scholzj May 24, 2026

Uh oh!

nikita-kibitkin May 26, 2026

Uh oh!

scholzj May 24, 2026

Uh oh!

nikita-kibitkin May 26, 2026

Uh oh!

ppatierno commented May 25, 2026

Uh oh!

nikita-kibitkin commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		The operator will not define a built-in timeout. Different Strimzi deployments can have different reconciliation durations, so the threshold should be part of the user's Prometheus alerting rule.

		Removing the local meter also avoids tombstone values such as `-1`. Once Prometheus observes that the series is no longer exported, the series becomes stale and instant-vector alerts stop matching the old value. This is the same cleanup model already used by per-resource metrics such as `strimzi.resource.state`, where the operator removes local meters when they no longer apply.


		## Proposal

		The operators will expose a new gauge metric for active reconciliations:

Conversation

nikita-kibitkin commented May 17, 2026

Uh oh!

scholzj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ppatierno commented May 25, 2026

Uh oh!

nikita-kibitkin commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants