Context
We noticed that we didn't get explicitly informed about upgrades which are stuck due to a missing admin ack. We can already check if an upgrade won't succeed due to a missing admin ack with cluster_operator_conditions{condition="Upgradeable",endpoint="metrics",name="version",reason="AdminAckRequired"} == 0, but we should write a more sophisticated query which takes into account the current and desired cluster version for the running upgrade job, and which fires when the upgrade job tries to do a minor upgrade which can't succeed due to the missing admin ack.
Additionally, we didn't realize that maintenance was stuck on one cluster for >10h due to the missing admin ack. We should add an alert for upgrade jobs which have been running for longer than 6 hours or so (exact duration TBD).
Task deliverables
- Component configures an alert which fires if the maintenance takes too long
- Component configures an alert which fires if the upgrade job is blocked due to a missing admin ack
Context
We noticed that we didn't get explicitly informed about upgrades which are stuck due to a missing admin ack. We can already check if an upgrade won't succeed due to a missing admin ack with
cluster_operator_conditions{condition="Upgradeable",endpoint="metrics",name="version",reason="AdminAckRequired"} == 0, but we should write a more sophisticated query which takes into account the current and desired cluster version for the running upgrade job, and which fires when the upgrade job tries to do a minor upgrade which can't succeed due to the missing admin ack.Additionally, we didn't realize that maintenance was stuck on one cluster for >10h due to the missing admin ack. We should add an alert for upgrade jobs which have been running for longer than 6 hours or so (exact duration TBD).
Task deliverables