Skip to content

Create additional alerts for stuck upgrades #31

@simu

Description

@simu

Context

We noticed that we didn't get explicitly informed about upgrades which are stuck due to a missing admin ack. We can already check if an upgrade won't succeed due to a missing admin ack with cluster_operator_conditions{condition="Upgradeable",endpoint="metrics",name="version",reason="AdminAckRequired"} == 0, but we should write a more sophisticated query which takes into account the current and desired cluster version for the running upgrade job, and which fires when the upgrade job tries to do a minor upgrade which can't succeed due to the missing admin ack.

Additionally, we didn't realize that maintenance was stuck on one cluster for >10h due to the missing admin ack. We should add an alert for upgrade jobs which have been running for longer than 6 hours or so (exact duration TBD).

Task deliverables

  • Component configures an alert which fires if the maintenance takes too long
  • Component configures an alert which fires if the upgrade job is blocked due to a missing admin ack

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions