Documentation of the controller states/conditions and their interaction#264
Documentation of the controller states/conditions and their interaction#264
Conversation
notandy
left a comment
There was a problem hiding this comment.
The file name is incorrect, since it includes way more than just an diagram.
Please consider splitting parts.
Also please put any docs into a docs/ directory, like it's common with Go project layouts
| 6. **DecommissionController** (`decomission_controller.go`) | ||
| - Handles hypervisor offboarding | ||
| - Cleans up OpenStack service and resource provider | ||
| - Runs during node termination |
There was a problem hiding this comment.
Which a litte bit more information. like how it monitors node termination and what are the steps.
controller-state-diagram.md
Outdated
|
|
||
| ### Auxiliary Controllers | ||
|
|
||
| 8. **HypervisorInstanceHaController** (`hypervisor_instance_ha_controller.go`) |
There was a problem hiding this comment.
this controller should not be in here
| - Creates cert-manager `Certificate` CRs for libvirt TLS | ||
| - Generates RSA 4096-bit certificates covering all node IPs/DNS names | ||
|
|
||
| ## Spec Fields |
There was a problem hiding this comment.
We should generate the spec documentation with go-docs directly from the spec. This is doomed to be outdated quickly
There was a problem hiding this comment.
Good idea, I haven't come around to that yet.
|
|
||
| 11. **NodeCertificateController** (`node_certificate_controller.go`) | ||
| - Creates cert-manager `Certificate` CRs for libvirt TLS | ||
| - Generates RSA 4096-bit certificates covering all node IPs/DNS names |
There was a problem hiding this comment.
I do not follow, which part?
Here is the RSA 4096-bit, and here it collects all IPs and DNS names of the node.
And the certificate is a cert-manager Certificate CR for libvirt TLS.
There was a problem hiding this comment.
The controller doesn't generate certificates, it requests them from cert-manager
There was a problem hiding this comment.
By creating the cert-manager Certificate Custom-Resource (CR).
| | `Spec.Maintenance` | string | HypervisorMaintenanceController | Triggers maintenance mode. Values: `""` (none), `manual`, `auto`, `ha`, `termination`. | | ||
| | `Spec.HighAvailability` | bool | HypervisorInstanceHaController | Enables or disables instance HA registration with `kvm-ha-service`. | | ||
| | `Spec.Aggregates` | []string | AggregatesController | OpenStack aggregates to apply once onboarding reaches the Handover phase. Set from the `nova.openstack.cloud.sap/aggregates` node annotation. | | ||
| | `Spec.CustomTraits` | []string | TraitsController | Placement API traits to sync. Set from the `nova.openstack.cloud.sap/custom-traits` node annotation. | |
There was a problem hiding this comment.
Why not document the spec/status fields directly in the code?
|
|
||
| | Condition | Description | Set By | Required for `Ready=True` | | ||
| |-----------|-------------|--------|---------------------------| | ||
| | `Ready` | Overall readiness of the hypervisor | HypervisorController, OnboardingController, HypervisorMaintenanceController, DecommissionController | *(this is the readiness condition itself)* | |
There was a problem hiding this comment.
We should document what readiness means in this case. Is it equivalent to "this node is schedulable"? So, do I need to look at other status conditions to determine it, or is this a summary condition that is always False when one of the required sub-conditions is unready. For example, Should this always be "False" when the hypervisor carries a disabled status condition set to "True"?
There was a problem hiding this comment.
Or more precisely, we should make it clear which sub-conditions lead to Ready being "False".
There was a problem hiding this comment.
Yeah, that is a common request. I'll have to integrate that.
| | `Offboarded` | Cleanup completed | DecommissionController | Absent — when `True`, Ready=False (reason: Offboarded); this is a terminal state | | ||
| | `AggregatesUpdated` | OpenStack aggregates synced | AggregatesController | `True` — required for onboarding to complete (which sets Ready=True); not directly re-checked at runtime | | ||
| | `TraitsUpdated` | Placement API traits synced | TraitsController | `True` — required for onboarding to complete (which sets Ready=True); not directly re-checked at runtime | | ||
| | `HaEnabled` | Instance HA enabled/disabled state | HypervisorInstanceHaController | `True` — required for onboarding to complete when `Spec.HighAvailability=true`; `False` required before GardenerNodeLifecycleController allows node deletion during offboarding | |
There was a problem hiding this comment.
I would say, usually it is not important to know which controller owns a status condition, as long as it is clear what the status condition means and why it was triggered, so consumers of this api can react appropriately and don't disagree in the resulting action.
controller-state-diagram.md
Outdated
| - `Maintenance` - In maintenance mode | ||
| - `Evicted` - VMs evacuated | ||
| - `Evicting` - VM evacuation in progress | ||
| - `Onboarding` - Initial setup in progress |
There was a problem hiding this comment.
I like the pattern "Evicted" and "Evicting". In this way we can always tell if the process has started and if it was completed. Based on this, the controller can also tell if it needs to restart the process, monitor an ongoing process, or do anything else.
Should we continue that pattern for Onboarding, Decommissioning, and Offboarding as well?
There was a problem hiding this comment.
Well, it depends on where you write the things.
"Evicting" and "Evicted" are used in the Condition of the type Ready as a reason on why the node is not ready.
The steady state after Onboarding is Ready, I am not sure changing it to Onboarded is helping there.
Using it as the Reason for of the long-running conditions would introduce a different pattern than the norm Running/Failed/Succeeded.
But there are doubts that it make sense to use long-running conditions at all. (PR.
|
|
||
| **Note:** Instance HA is disabled during `manual`/`auto`/`termination` maintenance once eviction completes and `Evicting=False` is set. After maintenance recovery (service re-enabled, `Evicting` condition removed), `HypervisorInstanceHaController` re-enables HA. For `ha` maintenance mode (no Eviction CR created), the `Evicting` condition is never set to `False`, so HA is not disabled by the controller — the HA service handles evacuation directly. | ||
|
|
||
| ### 3. Termination Flow (Ready -> Offboarded) |
There was a problem hiding this comment.
Does it make sense to document this here? To understand that logic, I would read the go code and hope these thoughts are documented there.
There was a problem hiding this comment.
Well, the problem with the go-code now is a bit that the code is not linearly readable anymore.
It greatly depends on the interplay of various controllers. Something, that I recognize as a downside.
My intend with the document here is, that one can identify in which controller the whole thing might be stuck, without having to understand the whole code.
This is a documentation of the various controllers and their interactions.
📝 WalkthroughWalkthroughA new comprehensive documentation file for the OpenStack Hypervisor Operator detailing controllers, their responsibilities, Kubernetes conditions, state diagrams, coordination rules, and operational flows. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip CodeRabbit can approve the review once all CodeRabbit's comments are resolved.Enable the |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
docs/controllers.md (1)
95-96:⚠️ Potential issue | 🟡 MinorClarify certificate responsibility vs cert-manager issuance.
This wording can be read as if this controller directly generates certificates. It creates a cert-manager
CertificateCR and requests issuance with configured key parameters.Suggested patch
- - Creates cert-manager `Certificate` CRs for libvirt TLS - - Generates RSA 4096-bit certificates covering all node IPs/DNS names + - Creates cert-manager `Certificate` CRs for libvirt TLS + - Requests issuance with RSA 4096-bit key settings and SANs covering all node IPs/DNS names🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/controllers.md` around lines 95 - 96, Update the wording to make clear the controller creates cert-manager Certificate CRs and requests issuance rather than generating keys/certificates itself: change the lines describing behavior to say the controller "creates a cert-manager Certificate CR with configured key parameters (e.g., RSA 4096) and requests cert-manager to issue TLS certificates covering node IPs/DNS names," referencing the controller and the cert-manager Certificate CR to avoid implying direct certificate generation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/controllers.md`:
- Line 61: Update the misspelled controller filename `decomission_controller.go`
to `decommission_controller.go` and update any references to it (e.g., in docs
or imports referencing DecommissionController) so they point to the corrected
filename; ensure the exported type/name `DecommissionController` remains
unchanged and that any import paths or links in docs/controllers.md are updated
to the new filename to avoid broken references.
- Line 58: Fix the typo in the controller responsibility sentence: replace the
misspelled resource name "Eviciton" with the correct "Eviction" in the line that
reads "Executes VM migration off the hypervisor by reconciling on `Eviciton`
CRDs" so the docs reference the correct CRD name (`Eviction`) and avoid
confusion.
- Line 670: Remove the stray trailing quote from the kubectl command and update
or remove the deprecated annotation usage: change the command 'kubectl annotate
node <node-name> node.machine.sapcloud.io/trigger-deletion-by-mcm=true'' to a
valid command without the trailing quote, and then either replace or delete the
guidance that instructs using the
node.machine.sapcloud.io/trigger-deletion-by-mcm annotation (which is deprecated
per Gardener MCM); if replacing, add the current supported mechanism for
triggering deletion or a clear deprecation note referencing
node.machine.sapcloud.io/trigger-deletion-by-mcm so readers know it is no longer
supported.
---
Duplicate comments:
In `@docs/controllers.md`:
- Around line 95-96: Update the wording to make clear the controller creates
cert-manager Certificate CRs and requests issuance rather than generating
keys/certificates itself: change the lines describing behavior to say the
controller "creates a cert-manager Certificate CR with configured key parameters
(e.g., RSA 4096) and requests cert-manager to issue TLS certificates covering
node IPs/DNS names," referencing the controller and the cert-manager Certificate
CR to avoid implying direct certificate generation.
| - `"ha"`: Disables compute service but does **not** create an Eviction CR (the HA service handles evacuation) | ||
|
|
||
| - **EvictionController** (`eviction_controller.go`) | ||
| - Executes VM migration off the hypervisor by reconciling on `Eviciton` CRDs |
There was a problem hiding this comment.
Fix typo in resource name (Eviction).
Eviciton is misspelled and appears in a controller responsibility sentence. Please correct it to avoid confusion in CRD naming.
Suggested patch
- - Executes VM migration off the hypervisor by reconciling on `Eviciton` CRDs
+ - Executes VM migration off the hypervisor by reconciling on `Eviction` CRDs📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - Executes VM migration off the hypervisor by reconciling on `Eviciton` CRDs | |
| - Executes VM migration off the hypervisor by reconciling on `Eviction` CRDs |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/controllers.md` at line 58, Fix the typo in the controller
responsibility sentence: replace the misspelled resource name "Eviciton" with
the correct "Eviction" in the line that reads "Executes VM migration off the
hypervisor by reconciling on `Eviciton` CRDs" so the docs reference the correct
CRD name (`Eviction`) and avoid confusion.
| - Executes VM migration off the hypervisor by reconciling on `Eviciton` CRDs | ||
| - Manages live and cold migration of instances | ||
|
|
||
| - **DecommissionController** (`decomission_controller.go`) |
There was a problem hiding this comment.
Controller filename appears misspelled.
decomission_controller.go likely should be decommission_controller.go. The current spelling may send readers to a non-existing file.
Suggested patch
-- **DecommissionController** (`decomission_controller.go`)
+- **DecommissionController** (`decommission_controller.go`)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/controllers.md` at line 61, Update the misspelled controller filename
`decomission_controller.go` to `decommission_controller.go` and update any
references to it (e.g., in docs or imports referencing DecommissionController)
so they point to the corrected filename; ensure the exported type/name
`DecommissionController` remains unchanged and that any import paths or links in
docs/controllers.md are updated to the new filename to avoid broken references.
|
|
||
| ```bash | ||
| # Mark node for termination (depends on MCM) | ||
| kubectl annotate node <node-name> node.machine.sapcloud.io/trigger-deletion-by-mcm=true' |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What is the correct kubectl annotate syntax to set node annotation node.machine.sapcloud.io/trigger-deletion-by-mcm=true on a node?
💡 Result:
kubectl annotate node <node-name> node.machine.sapcloud.io/trigger-deletion-by-mcm="true" --overwriteThis uses the kubectl annotate TYPE NAME KEY=VALUE form (and --overwrite if the annotation may already exist). [1]
Note: Gardener’s machine-controller-manager documents that node.machine.sapcloud.io/trigger-deletion-by-mcm is no longer supported on Node objects. [2]
Sources: [1] Kubernetes kubectl annotate reference, [2] Gardener MCM FAQ
🏁 Script executed:
# First, check if the file exists and get its total line count
wc -l docs/controllers.md
# Read around line 670 to see the context and exact issue
sed -n '665,675p' docs/controllers.md | cat -nRepository: cobaltcore-dev/openstack-hypervisor-operator
Length of output: 563
Remove trailing quote and note that the annotation is deprecated.
Line 670 has a trailing quote making the command invalid, but more importantly, the node.machine.sapcloud.io/trigger-deletion-by-mcm annotation is no longer supported on Node objects per Gardener MCM. This documentation should either be updated to use a current mechanism or removed entirely.
Suggested patch
-kubectl annotate node <node-name> node.machine.sapcloud.io/trigger-deletion-by-mcm=true'
+kubectl annotate node <node-name> node.machine.sapcloud.io/trigger-deletion-by-mcm="true" --overwrite📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| kubectl annotate node <node-name> node.machine.sapcloud.io/trigger-deletion-by-mcm=true' | |
| kubectl annotate node <node-name> node.machine.sapcloud.io/trigger-deletion-by-mcm="true" --overwrite |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/controllers.md` at line 670, Remove the stray trailing quote from the
kubectl command and update or remove the deprecated annotation usage: change the
command 'kubectl annotate node <node-name>
node.machine.sapcloud.io/trigger-deletion-by-mcm=true'' to a valid command
without the trailing quote, and then either replace or delete the guidance that
instructs using the node.machine.sapcloud.io/trigger-deletion-by-mcm annotation
(which is deprecated per Gardener MCM); if replacing, add the current supported
mechanism for triggering deletion or a clear deprecation note referencing
node.machine.sapcloud.io/trigger-deletion-by-mcm so readers know it is no longer
supported.
This is a documentation of the various controllers and their interactions.
Summary by CodeRabbit