Skip to content

Latest commit

 

History

History
509 lines (388 loc) · 14.9 KB

File metadata and controls

509 lines (388 loc) · 14.9 KB

Resource Lifecycle Management

This document describes two complementary features for managing the lifecycle of operator-managed resources:

  1. Tombstoning - Clean deletion of obsolete resources during upgrades
  2. Root Exclusion - Prevention of resource creation from Day 0

Overview

During operator upgrades, resources can become obsolete (removed features, renamed resources, consolidated configurations). Simply removing YAML from the assets directory leaves orphaned objects in the cluster. These features provide explicit, safe mechanisms for lifecycle management.

Tombstoning

Purpose

Tombstoning enables explicit deletion of legacy resources with built-in safety checks. When a feature is removed or a resource is renamed, the old resource is moved to the tombstones directory to be deleted during the next reconciliation.

Directory Structure

/assets
  ├── active/          # Current managed resources (Apply - Desired State)
  │   ├── descheduler/
  │   ├── hco/
  │   ├── operators/
  │   └── metadata.yaml
  └── tombstones/      # Obsolete resources (Delete - Legacy)
      └── v1.1-cleanup/  # Optional organizational subfolders

/test
  └── crds/            # CRDs for testing (envtest/Kind) - not deployed by autopilot

Tombstone File Format

Tombstone files are minimal YAML manifests with a required safety label:

apiVersion: v1
kind: ConfigMap
metadata:
  name: obsolete-tuning-config
  namespace: openshift-cnv
  labels:
    platform.kubevirt.io/managed-by: virt-platform-autopilot  # REQUIRED

Required fields:

  • apiVersion
  • kind
  • metadata.name
  • metadata.labels["platform.kubevirt.io/managed-by"] = "virt-platform-autopilot"

Optional field:

  • metadata.namespace (omit for cluster-scoped resources)

Safety Mechanism

The tombstoning system includes multiple safety checks:

  1. Label verification: Resources are only deleted if they have the exact label platform.kubevirt.io/managed-by=virt-platform-autopilot
  2. Load-time validation: Tombstone files are validated when loaded - missing labels cause startup failure
  3. Best-effort execution: If one tombstone fails to delete, others are still processed
  4. Idempotency: Already-deleted resources are silently skipped (no error)

Workflow

Creating a tombstone:

  1. Identify obsolete resource (e.g., assets/active/descheduler/old-config.yaml)
  2. Move file to tombstones directory:
    git mv assets/active/descheduler/old-config.yaml assets/tombstones/v1.1-cleanup/
  3. Verify the file contains the required platform.kubevirt.io/managed-by label
  4. Commit and release

On operator upgrade:

  1. Operator loads tombstones from assets/tombstones/
  2. For each tombstone:
    • Check if resource exists in cluster
    • Verify it has the management label
    • Delete if label matches (skip if label missing/incorrect)
  3. Emit events and update metrics

Removing a tombstone (after 2-3 releases):

  1. Confirm resource is deleted from all supported clusters
  2. Remove file from tombstones directory:
    git rm assets/tombstones/v1.1-cleanup/old-config.yaml
  3. Commit

RBAC Automation

The RBAC generator automatically scans the tombstones/ directory and adds the delete verb to the ClusterRole for any resource types found:

make generate-rbac

Generated ClusterRole example:

- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["delete", "create", "get", "list", "patch", "update", "watch"]  # delete added

Observability

Metrics:

kubevirt_autopilot_tombstone_status{kind, name, namespace}
# Values:
#  1 = Resource still exists (not yet deleted)
#  0 = Resource deleted successfully
# -1 = Deletion error
# -2 = Skipped (label mismatch - safety check triggered)

Events:

  • TombstoneDeleted (Normal): Resource successfully deleted
  • TombstoneFailed (Warning): Deletion failed (check logs, RBAC, finalizers)
  • TombstoneSkipped (Warning): Label mismatch - resource not managed by autopilot

Alert:

- alert: VirtPlatformTombstoneStuck
  expr: kubevirt_autopilot_tombstone_status < 0
  for: 30m

See runbook: docs/runbooks/VirtPlatformTombstoneStuck.md

Root Exclusion (Day 0 Prevention)

Purpose

Root exclusion prevents specific resources from being created in the first place. This is useful for:

  • Disabling features not relevant to the deployment
  • Preventing resource creation in environments where they would fail
  • Temporary workarounds for known issues
  • Excluding groups of related resources using wildcards

Annotation Format

Set the platform.kubevirt.io/disabled-resources annotation on the HyperConverged CR using YAML syntax:

apiVersion: hco.kubevirt.io/v1
kind: HyperConverged
metadata:
  name: kubevirt-hyperconverged
  namespace: openshift-cnv
  annotations:
    platform.kubevirt.io/disabled-resources: |
      - kind: KubeDescheduler  # Cluster-scoped resource
        name: cluster

      - kind: ConfigMap
        namespace: openshift-cnv
        name: virt-tuning-*      # Wildcard for multiple configs

      - kind: Service
        namespace: prod-*        # Namespace wildcard
        name: metrics

      - kind: Secret             # Omit namespace = all namespaces
        name: credentials-*

YAML Structure:

  • Array of exclusion rules
  • Each rule requires:
    • kind: Resource kind (case-sensitive, e.g., "ConfigMap")
    • name: Resource name (supports wildcards with *)
  • Optional field:
    • namespace: Target namespace (supports wildcards, omit to match all namespaces)

Wildcard Support:

  • * matches any sequence of characters
  • Examples: virt-*, *-config, prod-*
  • Glob pattern semantics (uses filepath.Match)

Namespace Matching:

  • Specify namespace for exact or wildcard namespace matching
  • Omit namespace field to match resources in any namespace (including cluster-scoped)
  • Empty namespace in rule = matches all namespaces

Implementation

  1. Operator parses the annotation as YAML on each reconciliation
  2. Invalid YAML logs an error and continues without exclusions (fail-open)
  3. After rendering assets, filters out excluded resources in-memory using pattern matching
  4. Excluded resources are never applied (ServerSideApply is never called)
  5. Logs each skipped resource for transparency

Example log:

Skipping resource due to Root Exclusion kind=ConfigMap namespace=openshift-cnv name=virt-handler annotation=platform.kubevirt.io/disabled-resources

Use Cases

Disable KubeDescheduler (cluster-scoped):

annotations:
  platform.kubevirt.io/disabled-resources: |
    - kind: KubeDescheduler
      name: cluster

Disable swap on specific clusters:

annotations:
  platform.kubevirt.io/disabled-resources: |
    - kind: MachineConfig
      name: 50-swap-enable

Disable all virt tuning configs in openshift-cnv namespace:

annotations:
  platform.kubevirt.io/disabled-resources: |
    - kind: ConfigMap
      namespace: openshift-cnv
      name: virt-*

Disable metrics service in all prod namespaces:

annotations:
  platform.kubevirt.io/disabled-resources: |
    - kind: Service
      namespace: prod-*
      name: metrics

Disable specific secret across all namespaces:

annotations:
  platform.kubevirt.io/disabled-resources: |
    - kind: Secret
      name: credentials-db

Multiple exclusions:

annotations:
  platform.kubevirt.io/disabled-resources: |
    - kind: KubeDescheduler
      name: cluster
    - kind: ConfigMap
      namespace: openshift-cnv
      name: virt-*
    - kind: PersesDataSource
      namespace: openshift-cnv
      name: virt-metrics

Comparison with mode: unmanaged

Feature Root Exclusion mode: unmanaged
Scope Specific resources (Kind/Namespace/Name + wildcards) Individual resource (annotation per object)
When Day 0 (prevents creation) Day 1+ (stops reconciliation)
RBAC No impact (resource never created) Full RBAC still required
Wildcards Supported (name and namespace) Not applicable
Namespace filtering Supported Not applicable
Use case Disable features/patterns cluster-wide Opt out of management per resource

When to use Root Exclusion:

  • Cluster-wide feature disablement
  • Resources that should never be created
  • Temporary workarounds before feature flag available
  • Pattern-based exclusions (e.g., all virt-* configs)
  • Namespace-specific exclusions

When to use mode: unmanaged:

  • Per-resource customization
  • Gradual migration to external management
  • Temporary user overrides

Features

  • Case-sensitive kind: ConfigMapconfigmap
  • Wildcard support: Use * in name or namespace fields
  • Namespace filtering: Exclude resources in specific namespaces or namespace patterns
  • Any-namespace matching: Omit namespace field to match resources in all namespaces
  • Error handling: Invalid YAML logs error but continues (fail-open)
  • Pattern validation: Invalid glob patterns are skipped gracefully

Migration Note

This is a breaking change from the previous comma-separated format ("Kind/Name, Kind/Name"). The old format is no longer supported. Update your HyperConverged annotations to use the new YAML syntax.

Before (old format - no longer supported):

platform.kubevirt.io/disabled-resources: "KubeDescheduler/cluster, MachineConfig/50-swap-enable"

After (new format):

platform.kubevirt.io/disabled-resources: |
  - kind: KubeDescheduler
    name: cluster
  - kind: MachineConfig
    name: 50-swap-enable

Troubleshooting

Tombstone Not Deleted

  1. Check if resource exists:

    kubectl get <kind> <name> -n <namespace>
  2. Verify label:

    kubectl get <kind> <name> -n <namespace> -o jsonpath='{.metadata.labels}'

    Should contain: "platform.kubevirt.io/managed-by": "virt-platform-autopilot"

  3. Check for finalizers:

    kubectl get <kind> <name> -n <namespace> -o jsonpath='{.metadata.finalizers}'

    If finalizers present, they may block deletion. Check operator logs for the finalizer owner.

  4. Check RBAC:

    kubectl auth can-i delete <resource> --as system:serviceaccount:openshift-cnv:virt-platform-autopilot
  5. Check events:

    kubectl get events -n openshift-cnv --field-selector involvedObject.kind=HyperConverged

Resource Created Despite Root Exclusion

  1. Verify annotation syntax:

    kubectl get hco kubevirt-hyperconverged -n openshift-cnv \
      -o jsonpath='{.metadata.annotations.platform\.kubevirt\.io/disabled-resources}'
  2. Check operator logs for "Skipping resource due to Root Exclusion" message

  3. Verify Kind/Name matches exactly (case-sensitive)

  4. Check if resource was created before annotation was added

    • Root exclusion only prevents creation, doesn't delete existing resources
    • Use tombstoning to remove existing resources

Best Practices

Tombstoning

  1. Lifecycle: Keep tombstones for 2-3 releases, then remove
  2. Organization: Use subdirectories like v1.1-cleanup/ for clarity
  3. Testing: Test tombstone deletion in staging before production release
  4. Monitoring: Set up alerts for stuck tombstones
  5. Documentation: Document why resources were tombstoned (commit message)

Root Exclusion

  1. Temporary: Use root exclusion as a temporary measure, not permanent solution
  2. Documentation: Document why resources are excluded
  3. Alternatives: Consider if feature gates or component-level disable is better
  4. Migration path: Plan to remove exclusions when proper fix is available

General

  1. Prefer feature flags: Use metadata.yaml conditions when possible
  2. Gradual rollout: Test lifecycle changes in dev/staging first
  3. Monitor metrics: Watch tombstone_status and compliance_status metrics
  4. Clean up: Remove old tombstones and unused exclusions regularly

Examples

Example 1: Rename a ConfigMap

Before (v1.0):

assets/active/observability/metrics-config.yaml

After (v1.1):

  1. Create new resource:

    assets/active/observability/prometheus-config.yaml
    
  2. Move old resource to tombstones:

    git mv assets/active/observability/metrics-config.yaml \
           assets/tombstones/v1.1-cleanup/metrics-config.yaml
  3. Verify label in tombstone file:

    labels:
      platform.kubevirt.io/managed-by: virt-platform-autopilot
  4. Release v1.1 - operator will:

    • Create prometheus-config
    • Delete metrics-config (tombstone)

Example 2: Disable Feature Temporarily

User wants to disable KubeDescheduler due to known issue:

kubectl annotate hco kubevirt-hyperconverged -n openshift-cnv \
  platform.kubevirt.io/disabled-resources='- kind: KubeDescheduler
  name: cluster'

Or using kubectl patch:

kubectl patch hco kubevirt-hyperconverged -n openshift-cnv --type=merge -p '
metadata:
  annotations:
    platform.kubevirt.io/disabled-resources: |
      - kind: KubeDescheduler
        name: cluster
'

Operator will skip creating/updating KubeDescheduler.

To re-enable:

kubectl annotate hco kubevirt-hyperconverged -n openshift-cnv \
  platform.kubevirt.io/disabled-resources-

Example 3: Remove Obsolete Feature

Removing MTV (Migration Toolkit) integration:

  1. Create tombstone:

    # assets/tombstones/v1.2-cleanup/mtv-operator.yaml
    apiVersion: operators.coreos.com/v1alpha1
    kind: Subscription
    metadata:
      name: mtv-operator
      namespace: openshift-mtv
      labels:
        platform.kubevirt.io/managed-by: virt-platform-autopilot
  2. Remove from active:

    git rm assets/active/operators/mtv.yaml.tpl
  3. Update metadata.yaml to remove MTV asset entry

  4. Regenerate RBAC:

    make generate-rbac  # Adds delete verb for Subscription
  5. Release - operator deletes MTV subscription

  6. After 2 releases, clean up tombstone:

    git rm assets/tombstones/v1.2-cleanup/mtv-operator.yaml

Migration Guide (for Existing Deployments)

If you have an existing deployment and want to adopt lifecycle management:

  1. Audit current resources: Identify which resources are managed by autopilot

    kubectl get all,cm,secrets -A -l platform.kubevirt.io/managed-by=virt-platform-autopilot
  2. Ensure labels: All managed resources should have the management label

    • Newer versions auto-apply labels
    • Older resources may need manual labeling
  3. Plan tombstones: For any resources to be removed, create tombstones with proper labels

  4. Test in staging: Validate tombstone deletion in non-production environment

  5. Monitor: Watch metrics and events during rollout

References

  • Specification: /claude_assets/reclaiming_leftovers.md
  • Runbook: docs/runbooks/VirtPlatformTombstoneStuck.md
  • RBAC generation: cmd/rbac-gen/main.go
  • Implementation: pkg/engine/tombstone.go, pkg/engine/exclusion.go