Skip to content

Add proposal for alerting on stuck reconciliations#224

Open
nikita-kibitkin wants to merge 3 commits into
strimzi:mainfrom
nikita-kibitkin:144-stuck-reconciliation-alerting
Open

Add proposal for alerting on stuck reconciliations#224
nikita-kibitkin wants to merge 3 commits into
strimzi:mainfrom
nikita-kibitkin:144-stuck-reconciliation-alerting

Conversation

@nikita-kibitkin
Copy link
Copy Markdown

This proposal suggests a solution for strimzi/strimzi-kafka-operator#11634.

Introduces a per-resource gauge that exports the start time of the in-progress reconciliation, so users can alert on
reconciliations that started but never completed without scanning operator logs.

Signed-off-by: Nikita Kibitkin <nikita.n.kibitkin@gmail.com>
Copy link
Copy Markdown
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal and sorry it took so long to review. I left some clarification comments, but I think it looks mostly good otherwise.

Comment thread 144-stuck-reconciliation-alerting.md Outdated
* Identify the affected resource by kind, namespace, and name.
* Expose how long the reconciliation has been running.
* Avoid false alerts after completed reconciliations, including deletion reconciliations.
* Keep deployment-specific thresholds out of the operator itself.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded this section. The bullets now describe what information the metric gives to Prometheus rules: identify the resource, calculate elapsed time from the start timestamp, stop matching completed reconciliations after cleanup, and keep the threshold in the alert rule instead of baking a timeout into the operator.

Comment thread 144-stuck-reconciliation-alerting.md Outdated

The operator will not define a built-in timeout. Different Strimzi deployments can have different reconciliation durations, so the threshold should be part of the user's Prometheus alerting rule.

Removing the local meter also avoids tombstone values such as `-1`. Once Prometheus observes that the series is no longer exported, the series becomes stale and instant-vector alerts stop matching the old value. This is the same cleanup model already used by per-resource metrics such as `strimzi.resource.state`, where the operator removes local meters when they no longer apply.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For someone who is not a Prometheus expert ... can you provide more details on when/how will Prometheus mark it stale? It matters here because it defines how reliable the alerting would be.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanded the lifecycle section. The start time is only the gauge value, not an explicit Prometheus sample timestamp. After the local meter is removed, the next successful scrape of the same endpoint no longer returns the series, Prometheus marks it stale, and instant-vector selectors stop returning it after that stale marker.

Comment thread 144-stuck-reconciliation-alerting.md Outdated

## Proposal

The operators will expose a new gauge metric for active reconciliations:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be good to be more specific here ... does operators here mean all 3 operators? Cluster / Topic / User? Or does it mean the internal operators inside the Cluster Operator (e.g. Kafka, KafkaConnect, ...)? You touch on it later. But it might be good to clarify it right at the beginning.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this explicit in the proposal section: this means the Cluster Operator, Topic Operator, and User Operator. I also clarified below that the Topic Operator uses a different path (BatchingTopicController) from the Cluster/User Operator paths.

Comment thread 144-stuck-reconciliation-alerting.md Outdated
Comment on lines +73 to +75
* `Reconciliation` exposes `kind()`, `namespace()`, and `name()`.
* In the Cluster Operator, `AbstractOperator.withLock(...)` starts the progress-warning timer after the lock is acquired and cancels it when the asynchronous reconciliation completes.
* In the Topic and User Operator controller loop, `AbstractControllerLoop.reconcileWrapper(...)` starts the progress-warning task before calling `reconcile(...)` and cancels it in the `finally` block.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these have access to the metrics objects?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, clarified in the implementation section. AbstractOperator has an OperatorMetricsHolder through metrics(), AbstractControllerLoop exposes a ControllerMetricsHolder, and BatchingTopicController already has a TopicOperatorMetricsHolder field. The helper can therefore live on the existing metrics holders instead of passing a new metrics dependency into the wrappers.

@ppatierno
Copy link
Copy Markdown
Member

Overall it looks ok. I will wait for answers to the questions asked by Jakub. Also, could you please split the paragraph with one sentence per line please? It helps on reviewing to make comments on every specific line. Thanks!

Signed-off-by: Nikita Kibitkin <nikita.n.kibitkin@gmail.com>
Signed-off-by: Nikita Kibitkin <nikita.n.kibitkin@gmail.com>
@nikita-kibitkin
Copy link
Copy Markdown
Author

Thanks for the review.

I pushed two commits:

  • a mechanical reflow commit applying one sentence per line;
  • a clarification commit addressing the review comments.

The proposal now clarifies the Prometheus rule goals, the Cluster / Topic / User Operator scope, Prometheus staleness behavior and where the implementation can access the existing metrics holders.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants