Skip to content

Documentation of the controller states/conditions and their interaction#264

Open
fwiesel wants to merge 1 commit intomainfrom
documentation
Open

Documentation of the controller states/conditions and their interaction#264
fwiesel wants to merge 1 commit intomainfrom
documentation

Conversation

@fwiesel
Copy link
Contributor

@fwiesel fwiesel commented Mar 12, 2026

This is a documentation of the various controllers and their interactions.

Summary by CodeRabbit

  • Documentation
    • Added comprehensive guide for OpenStack Hypervisor Operator controllers, including controller responsibilities, interactions, condition semantics, state diagrams, workflow descriptions, coordination rules, and testing guidance.

@fwiesel fwiesel marked this pull request as ready for review March 12, 2026 13:30
@notandy notandy self-requested a review March 12, 2026 17:38
Copy link
Contributor

@notandy notandy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file name is incorrect, since it includes way more than just an diagram.
Please consider splitting parts.
Also please put any docs into a docs/ directory, like it's common with Go project layouts

6. **DecommissionController** (`decomission_controller.go`)
- Handles hypervisor offboarding
- Cleans up OpenStack service and resource provider
- Runs during node termination
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which a litte bit more information. like how it monitors node termination and what are the steps.


### Auxiliary Controllers

8. **HypervisorInstanceHaController** (`hypervisor_instance_ha_controller.go`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this controller should not be in here

- Creates cert-manager `Certificate` CRs for libvirt TLS
- Generates RSA 4096-bit certificates covering all node IPs/DNS names

## Spec Fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should generate the spec documentation with go-docs directly from the spec. This is doomed to be outdated quickly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I haven't come around to that yet.


11. **NodeCertificateController** (`node_certificate_controller.go`)
- Creates cert-manager `Certificate` CRs for libvirt TLS
- Generates RSA 4096-bit certificates covering all node IPs/DNS names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is hallucination

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not follow, which part?

Here is the RSA 4096-bit, and here it collects all IPs and DNS names of the node.
And the certificate is a cert-manager Certificate CR for libvirt TLS.

Copy link
Contributor

@notandy notandy Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The controller doesn't generate certificates, it requests them from cert-manager

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By creating the cert-manager Certificate Custom-Resource (CR).

| `Spec.Maintenance` | string | HypervisorMaintenanceController | Triggers maintenance mode. Values: `""` (none), `manual`, `auto`, `ha`, `termination`. |
| `Spec.HighAvailability` | bool | HypervisorInstanceHaController | Enables or disables instance HA registration with `kvm-ha-service`. |
| `Spec.Aggregates` | []string | AggregatesController | OpenStack aggregates to apply once onboarding reaches the Handover phase. Set from the `nova.openstack.cloud.sap/aggregates` node annotation. |
| `Spec.CustomTraits` | []string | TraitsController | Placement API traits to sync. Set from the `nova.openstack.cloud.sap/custom-traits` node annotation. |
Copy link
Member

@PhilippMatthes PhilippMatthes Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not document the spec/status fields directly in the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


| Condition | Description | Set By | Required for `Ready=True` |
|-----------|-------------|--------|---------------------------|
| `Ready` | Overall readiness of the hypervisor | HypervisorController, OnboardingController, HypervisorMaintenanceController, DecommissionController | *(this is the readiness condition itself)* |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document what readiness means in this case. Is it equivalent to "this node is schedulable"? So, do I need to look at other status conditions to determine it, or is this a summary condition that is always False when one of the required sub-conditions is unready. For example, Should this always be "False" when the hypervisor carries a disabled status condition set to "True"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or more precisely, we should make it clear which sub-conditions lead to Ready being "False".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is a common request. I'll have to integrate that.

| `Offboarded` | Cleanup completed | DecommissionController | Absent — when `True`, Ready=False (reason: Offboarded); this is a terminal state |
| `AggregatesUpdated` | OpenStack aggregates synced | AggregatesController | `True` — required for onboarding to complete (which sets Ready=True); not directly re-checked at runtime |
| `TraitsUpdated` | Placement API traits synced | TraitsController | `True` — required for onboarding to complete (which sets Ready=True); not directly re-checked at runtime |
| `HaEnabled` | Instance HA enabled/disabled state | HypervisorInstanceHaController | `True` — required for onboarding to complete when `Spec.HighAvailability=true`; `False` required before GardenerNodeLifecycleController allows node deletion during offboarding |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say, usually it is not important to know which controller owns a status condition, as long as it is clear what the status condition means and why it was triggered, so consumers of this api can react appropriately and don't disagree in the resulting action.

- `Maintenance` - In maintenance mode
- `Evicted` - VMs evacuated
- `Evicting` - VM evacuation in progress
- `Onboarding` - Initial setup in progress
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the pattern "Evicted" and "Evicting". In this way we can always tell if the process has started and if it was completed. Based on this, the controller can also tell if it needs to restart the process, monitor an ongoing process, or do anything else.

Should we continue that pattern for Onboarding, Decommissioning, and Offboarding as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it depends on where you write the things.
"Evicting" and "Evicted" are used in the Condition of the type Ready as a reason on why the node is not ready.
The steady state after Onboarding is Ready, I am not sure changing it to Onboarded is helping there.
Using it as the Reason for of the long-running conditions would introduce a different pattern than the norm Running/Failed/Succeeded.

But there are doubts that it make sense to use long-running conditions at all. (PR.


**Note:** Instance HA is disabled during `manual`/`auto`/`termination` maintenance once eviction completes and `Evicting=False` is set. After maintenance recovery (service re-enabled, `Evicting` condition removed), `HypervisorInstanceHaController` re-enables HA. For `ha` maintenance mode (no Eviction CR created), the `Evicting` condition is never set to `False`, so HA is not disabled by the controller — the HA service handles evacuation directly.

### 3. Termination Flow (Ready -> Offboarded)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to document this here? To understand that logic, I would read the go code and hope these thoughts are documented there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the problem with the go-code now is a bit that the code is not linearly readable anymore.
It greatly depends on the interplay of various controllers. Something, that I recognize as a downside.

My intend with the document here is, that one can identify in which controller the whole thing might be stuck, without having to understand the whole code.

This is a documentation of the various controllers and their
interactions.
@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

A new comprehensive documentation file for the OpenStack Hypervisor Operator detailing controllers, their responsibilities, Kubernetes conditions, state diagrams, coordination rules, and operational flows.

Changes

Cohort / File(s) Summary
Documentation
docs/controllers.md
New guide documenting controller catalog, spec fields, conditions semantics, state diagrams, detailed flow descriptions for creation/onboarding/maintenance/termination, controller coordination rules, and error handling guidance.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Controllers aligned with care,
Conditions dance through the air,
States and flows now crystal clear,
Documentation holds them dear!
Operators hop with delight,
The Hypervisor's path shines bright.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change—a documentation file detailing controller states, conditions, and their interactions, which matches the file content and PR objectives.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch documentation
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can approve the review once all CodeRabbit's comments are resolved.

Enable the reviews.request_changes_workflow setting to automatically approve the review once all CodeRabbit's comments are resolved.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
docs/controllers.md (1)

95-96: ⚠️ Potential issue | 🟡 Minor

Clarify certificate responsibility vs cert-manager issuance.

This wording can be read as if this controller directly generates certificates. It creates a cert-manager Certificate CR and requests issuance with configured key parameters.

Suggested patch
-    - Creates cert-manager `Certificate` CRs for libvirt TLS
-    - Generates RSA 4096-bit certificates covering all node IPs/DNS names
+    - Creates cert-manager `Certificate` CRs for libvirt TLS
+    - Requests issuance with RSA 4096-bit key settings and SANs covering all node IPs/DNS names
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/controllers.md` around lines 95 - 96, Update the wording to make clear
the controller creates cert-manager Certificate CRs and requests issuance rather
than generating keys/certificates itself: change the lines describing behavior
to say the controller "creates a cert-manager Certificate CR with configured key
parameters (e.g., RSA 4096) and requests cert-manager to issue TLS certificates
covering node IPs/DNS names," referencing the controller and the cert-manager
Certificate CR to avoid implying direct certificate generation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/controllers.md`:
- Line 61: Update the misspelled controller filename `decomission_controller.go`
to `decommission_controller.go` and update any references to it (e.g., in docs
or imports referencing DecommissionController) so they point to the corrected
filename; ensure the exported type/name `DecommissionController` remains
unchanged and that any import paths or links in docs/controllers.md are updated
to the new filename to avoid broken references.
- Line 58: Fix the typo in the controller responsibility sentence: replace the
misspelled resource name "Eviciton" with the correct "Eviction" in the line that
reads "Executes VM migration off the hypervisor by reconciling on `Eviciton`
CRDs" so the docs reference the correct CRD name (`Eviction`) and avoid
confusion.
- Line 670: Remove the stray trailing quote from the kubectl command and update
or remove the deprecated annotation usage: change the command 'kubectl annotate
node <node-name>  node.machine.sapcloud.io/trigger-deletion-by-mcm=true'' to a
valid command without the trailing quote, and then either replace or delete the
guidance that instructs using the
node.machine.sapcloud.io/trigger-deletion-by-mcm annotation (which is deprecated
per Gardener MCM); if replacing, add the current supported mechanism for
triggering deletion or a clear deprecation note referencing
node.machine.sapcloud.io/trigger-deletion-by-mcm so readers know it is no longer
supported.

---

Duplicate comments:
In `@docs/controllers.md`:
- Around line 95-96: Update the wording to make clear the controller creates
cert-manager Certificate CRs and requests issuance rather than generating
keys/certificates itself: change the lines describing behavior to say the
controller "creates a cert-manager Certificate CR with configured key parameters
(e.g., RSA 4096) and requests cert-manager to issue TLS certificates covering
node IPs/DNS names," referencing the controller and the cert-manager Certificate
CR to avoid implying direct certificate generation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6a4078ae-c0c2-4d3f-9311-df835b18f7b8

📥 Commits

Reviewing files that changed from the base of the PR and between 9f13080 and d6c1049.

📒 Files selected for processing (1)
  • docs/controllers.md

- `"ha"`: Disables compute service but does **not** create an Eviction CR (the HA service handles evacuation)

- **EvictionController** (`eviction_controller.go`)
- Executes VM migration off the hypervisor by reconciling on `Eviciton` CRDs
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typo in resource name (Eviction).

Eviciton is misspelled and appears in a controller responsibility sentence. Please correct it to avoid confusion in CRD naming.

Suggested patch
-  - Executes VM migration off the hypervisor by reconciling on `Eviciton` CRDs
+  - Executes VM migration off the hypervisor by reconciling on `Eviction` CRDs
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- Executes VM migration off the hypervisor by reconciling on `Eviciton` CRDs
- Executes VM migration off the hypervisor by reconciling on `Eviction` CRDs
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/controllers.md` at line 58, Fix the typo in the controller
responsibility sentence: replace the misspelled resource name "Eviciton" with
the correct "Eviction" in the line that reads "Executes VM migration off the
hypervisor by reconciling on `Eviciton` CRDs" so the docs reference the correct
CRD name (`Eviction`) and avoid confusion.

- Executes VM migration off the hypervisor by reconciling on `Eviciton` CRDs
- Manages live and cold migration of instances

- **DecommissionController** (`decomission_controller.go`)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Controller filename appears misspelled.

decomission_controller.go likely should be decommission_controller.go. The current spelling may send readers to a non-existing file.

Suggested patch
-- **DecommissionController** (`decomission_controller.go`)
+- **DecommissionController** (`decommission_controller.go`)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/controllers.md` at line 61, Update the misspelled controller filename
`decomission_controller.go` to `decommission_controller.go` and update any
references to it (e.g., in docs or imports referencing DecommissionController)
so they point to the corrected filename; ensure the exported type/name
`DecommissionController` remains unchanged and that any import paths or links in
docs/controllers.md are updated to the new filename to avoid broken references.


```bash
# Mark node for termination (depends on MCM)
kubectl annotate node <node-name> node.machine.sapcloud.io/trigger-deletion-by-mcm=true'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

What is the correct kubectl annotate syntax to set node annotation node.machine.sapcloud.io/trigger-deletion-by-mcm=true on a node?

💡 Result:

kubectl annotate node <node-name> node.machine.sapcloud.io/trigger-deletion-by-mcm="true" --overwrite

This uses the kubectl annotate TYPE NAME KEY=VALUE form (and --overwrite if the annotation may already exist). [1]

Note: Gardener’s machine-controller-manager documents that node.machine.sapcloud.io/trigger-deletion-by-mcm is no longer supported on Node objects. [2]

Sources: [1] Kubernetes kubectl annotate reference, [2] Gardener MCM FAQ


🏁 Script executed:

# First, check if the file exists and get its total line count
wc -l docs/controllers.md

# Read around line 670 to see the context and exact issue
sed -n '665,675p' docs/controllers.md | cat -n

Repository: cobaltcore-dev/openstack-hypervisor-operator

Length of output: 563


Remove trailing quote and note that the annotation is deprecated.

Line 670 has a trailing quote making the command invalid, but more importantly, the node.machine.sapcloud.io/trigger-deletion-by-mcm annotation is no longer supported on Node objects per Gardener MCM. This documentation should either be updated to use a current mechanism or removed entirely.

Suggested patch
-kubectl annotate node <node-name>  node.machine.sapcloud.io/trigger-deletion-by-mcm=true'
+kubectl annotate node <node-name> node.machine.sapcloud.io/trigger-deletion-by-mcm="true" --overwrite
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
kubectl annotate node <node-name> node.machine.sapcloud.io/trigger-deletion-by-mcm=true'
kubectl annotate node <node-name> node.machine.sapcloud.io/trigger-deletion-by-mcm="true" --overwrite
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/controllers.md` at line 670, Remove the stray trailing quote from the
kubectl command and update or remove the deprecated annotation usage: change the
command 'kubectl annotate node <node-name> 
node.machine.sapcloud.io/trigger-deletion-by-mcm=true'' to a valid command
without the trailing quote, and then either replace or delete the guidance that
instructs using the node.machine.sapcloud.io/trigger-deletion-by-mcm annotation
(which is deprecated per Gardener MCM); if replacing, add the current supported
mechanism for triggering deletion or a clear deprecation note referencing
node.machine.sapcloud.io/trigger-deletion-by-mcm so readers know it is no longer
supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants