Add proposal for alerting on stuck reconciliations#224
Conversation
Signed-off-by: Nikita Kibitkin <nikita.n.kibitkin@gmail.com>
scholzj
left a comment
There was a problem hiding this comment.
Thanks for the proposal and sorry it took so long to review. I left some clarification comments, but I think it looks mostly good otherwise.
| * Identify the affected resource by kind, namespace, and name. | ||
| * Expose how long the reconciliation has been running. | ||
| * Avoid false alerts after completed reconciliations, including deletion reconciliations. | ||
| * Keep deployment-specific thresholds out of the operator itself. |
There was a problem hiding this comment.
Reworded this section. The bullets now describe what information the metric gives to Prometheus rules: identify the resource, calculate elapsed time from the start timestamp, stop matching completed reconciliations after cleanup, and keep the threshold in the alert rule instead of baking a timeout into the operator.
|
|
||
| The operator will not define a built-in timeout. Different Strimzi deployments can have different reconciliation durations, so the threshold should be part of the user's Prometheus alerting rule. | ||
|
|
||
| Removing the local meter also avoids tombstone values such as `-1`. Once Prometheus observes that the series is no longer exported, the series becomes stale and instant-vector alerts stop matching the old value. This is the same cleanup model already used by per-resource metrics such as `strimzi.resource.state`, where the operator removes local meters when they no longer apply. |
There was a problem hiding this comment.
For someone who is not a Prometheus expert ... can you provide more details on when/how will Prometheus mark it stale? It matters here because it defines how reliable the alerting would be.
There was a problem hiding this comment.
Expanded the lifecycle section. The start time is only the gauge value, not an explicit Prometheus sample timestamp. After the local meter is removed, the next successful scrape of the same endpoint no longer returns the series, Prometheus marks it stale, and instant-vector selectors stop returning it after that stale marker.
|
|
||
| ## Proposal | ||
|
|
||
| The operators will expose a new gauge metric for active reconciliations: |
There was a problem hiding this comment.
Maybe it would be good to be more specific here ... does operators here mean all 3 operators? Cluster / Topic / User? Or does it mean the internal operators inside the Cluster Operator (e.g. Kafka, KafkaConnect, ...)? You touch on it later. But it might be good to clarify it right at the beginning.
There was a problem hiding this comment.
Made this explicit in the proposal section: this means the Cluster Operator, Topic Operator, and User Operator. I also clarified below that the Topic Operator uses a different path (BatchingTopicController) from the Cluster/User Operator paths.
| * `Reconciliation` exposes `kind()`, `namespace()`, and `name()`. | ||
| * In the Cluster Operator, `AbstractOperator.withLock(...)` starts the progress-warning timer after the lock is acquired and cancels it when the asynchronous reconciliation completes. | ||
| * In the Topic and User Operator controller loop, `AbstractControllerLoop.reconcileWrapper(...)` starts the progress-warning task before calling `reconcile(...)` and cancels it in the `finally` block. |
There was a problem hiding this comment.
Do these have access to the metrics objects?
There was a problem hiding this comment.
Yes, clarified in the implementation section. AbstractOperator has an OperatorMetricsHolder through metrics(), AbstractControllerLoop exposes a ControllerMetricsHolder, and BatchingTopicController already has a TopicOperatorMetricsHolder field. The helper can therefore live on the existing metrics holders instead of passing a new metrics dependency into the wrappers.
|
Overall it looks ok. I will wait for answers to the questions asked by Jakub. Also, could you please split the paragraph with one sentence per line please? It helps on reviewing to make comments on every specific line. Thanks! |
Signed-off-by: Nikita Kibitkin <nikita.n.kibitkin@gmail.com>
Signed-off-by: Nikita Kibitkin <nikita.n.kibitkin@gmail.com>
|
Thanks for the review. I pushed two commits:
The proposal now clarifies the Prometheus rule goals, the Cluster / Topic / User Operator scope, Prometheus staleness behavior and where the implementation can access the existing metrics holders. |
This proposal suggests a solution for strimzi/strimzi-kafka-operator#11634.
Introduces a per-resource gauge that exports the start time of the in-progress reconciliation, so users can alert on
reconciliations that started but never completed without scanning operator logs.