|
| 1 | +# Preservation of Machines |
| 2 | + |
| 3 | +<!-- TOC --> |
| 4 | + |
| 5 | +- [Preservation of Machines](#preservation-of-machines) |
| 6 | + - [Objective](#objective) |
| 7 | + - [Proposal](#proposal) |
| 8 | + - [State Diagrams](#state-diagrams) |
| 9 | + - [Use Cases](#use-cases) |
| 10 | + |
| 11 | +<!-- /TOC --> |
| 12 | + |
| 13 | +## Objective |
| 14 | + |
| 15 | +Currently, the Machine Controller Manager(MCM) moves Machines with errors to the `Unknown` phase, and after the configured `machineHealthTimeout`, to the `Failed` phase. |
| 16 | +`Failed` machines are swiftly moved to the `Terminating` phase during which the node is drained and the `Machine` object is deleted. This rapid cleanup prevents SRE/operators/support from conducting an analysis on the VM and makes finding root cause of failure more difficult. |
| 17 | + |
| 18 | +Moreover, in cases where a node seems healthy but all the workload on it are facing issues, there is a need for operators to be able to cordon/drain the node and conduct their analysis without the cluster-autoscaler (CA) scaling down the node. |
| 19 | + |
| 20 | +This document proposes enhancing MCM, such that: |
| 21 | +* VMs of machines are retained temporarily for analysis. |
| 22 | +* There is a configurable limit to the number of machines that can be preserved automatically on failure (auto-preservation). |
| 23 | +* There is a configurable limit to the duration for which machines are preserved. |
| 24 | +* Users can specify which healthy machines they would like to preserve in case of failure, or for diagnoses in current state (prevent scale down by CA). |
| 25 | +* Users can request MCM to release a preserved machine, even before the timeout expires, so that MCM can transition the machine to either `Running` or `Terminating` phase, as the case may be. |
| 26 | + |
| 27 | +Related Issue: https://github.com/gardener/machine-controller-manager/issues/1008 |
| 28 | + |
| 29 | +## Proposal |
| 30 | + |
| 31 | +In order to achieve the objectives mentioned, the following are proposed: |
| 32 | +1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of machines to be auto-preserved, |
| 33 | +and the time duration for which these machines will be preserved. |
| 34 | + ``` |
| 35 | + machineControllerManager: |
| 36 | + autoPreserveFailedMax: 0 |
| 37 | + machinePreserveTimeout: 72h |
| 38 | + ``` |
| 39 | + * This configuration will be set per worker pool. |
| 40 | + * Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `autoPreserveFailedMax` will be distributed across N machine deployments. |
| 41 | + * `autoPreserveFailedMax` must be chosen such that it can be appropriately distributed across the MachineDeployments. |
| 42 | + * Example: if `autoPreserveFailedMax` is set to 2, and the worker pool has 2 zones, then the maximum number of machines that will be preserved per zone is 1. |
| 43 | +2. MCM will be modified to include a new sub-phase `Preserved` to indicate that the machine has been preserved by MCM. |
| 44 | +3. Allow user/operator to request for preservation of a specific machine/node with the use of annotations : `node.machine.sapcloud.io/preserve=now` and `node.machine.sapcloud.io/preserve=when-failed`. |
| 45 | +4. When annotation `node.machine.sapcloud.io/preserve=now` is added to a `Running` machine, the following will take place: |
| 46 | + - `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is added to the node to prevent CA from scaling it down. |
| 47 | + - `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$. |
| 48 | + - The machine's phase is changed to `Running:Preserved`. |
| 49 | + - After timeout, the `node.machine.sapcloud.io/preserve=now` and `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` are deleted. The `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. The machine phase is changed to `Running` and the CA may delete the node. |
| 50 | + - If a machine in `Running:Preserved` fails, it is moved to `Failed:Preserved`. |
| 51 | +5. When annotation `node.machine.sapcloud.io/preserve=when-failed` is added to a `Running` machine and the machine goes to `Failed`, the following will take place: |
| 52 | + - The machine is drained of pods except for Daemonset pods. |
| 53 | + - The machine phase is changed to `Failed:Preserved`. |
| 54 | + - `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is added to the node to prevent CA from scaling it down. |
| 55 | + - `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$. |
| 56 | + - After timeout, the annotations `node.machine.sapcloud.io/preserve=when-failed` and `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` are deleted. `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. The phase is changed to `Terminating`. |
| 57 | +6. When an un-annotated machine goes to `Failed` phase and `autoPreserveFailedMax` is not breached: |
| 58 | + - Pods (other than DaemonSet pods) are drained. |
| 59 | + - The machine's phase is changed to `Failed:Preserved`. |
| 60 | + - `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is added to the node to prevent CA from scaling it down. |
| 61 | + - `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$. |
| 62 | + - After timeout, the annotation `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is deleted. `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. The phase is changed to `Terminating`. |
| 63 | + - Number of machines in `Failed:Preserved` phase count towards enforcing `autoPreserveFailedMax`. |
| 64 | +7. If a failed machine is currently in `Failed:Preserved` and before timeout its VM/node is found to be Healthy, the machine will be moved to `Running:Preserved`. After the timeout, it will be moved to `Running`. |
| 65 | +The rationale behind moving the machine to `Running:Preserved` rather than `Running`, is to allow pods to get scheduled on to the healthy node again without the autoscaler scaling it down due to under-utilization. |
| 66 | +8. A user/operator can request MCM to stop preserving a machine/node in `Running:Preserved` or `Failed:Preserved` phase using the annotation: `node.machine.sapcloud.io/preserve=false`. |
| 67 | + * MCM will move a machine thus annotated either to `Running` phase or `Terminating` depending on the phase of the machine before it was preserved. |
| 68 | +9. Machines of a MachineDeployment in `Preserved` sub-phase will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment. |
| 69 | +10. MCM will be modified to perform drain in `Failed` phase rather than `Terminating`. |
| 70 | + |
| 71 | +## State Diagrams: |
| 72 | + |
| 73 | +1. State Diagram for when a machine or its node is explicitly annotated for preservation: |
| 74 | + ```mermaid |
| 75 | + stateDiagram-v2 |
| 76 | + state "Running" as R |
| 77 | + state "Running + Requested" as RR |
| 78 | + state "Running:Preserved" as RP |
| 79 | + state "Failed |
| 80 | + (node drained)" as F |
| 81 | + state "Failed:Preserved" as FP |
| 82 | + state "Terminating" as T |
| 83 | + [*]-->R |
| 84 | + R --> RR: annotated with preserve=when-failed |
| 85 | + RP-->R: on timeout or preserve=false |
| 86 | + RR --> F: on failure |
| 87 | + F --> FP |
| 88 | + FP --> T: on timeout or preserve=false |
| 89 | + FP --> RP: if node Healthy before timeout |
| 90 | + T --> [*] |
| 91 | + R-->RP: annotated with preserve=now |
| 92 | + RP-->F: if node/VM not healthy |
| 93 | + ``` |
| 94 | +2. State Diagram for when an un-annotated `Running` machine fails (Auto-preservation): |
| 95 | + ```mermaid |
| 96 | + stateDiagram-v2 |
| 97 | + state "Running" as R |
| 98 | + state "Running:Preserved" as RP |
| 99 | + state "Failed |
| 100 | + (node drained)" as F |
| 101 | + state "Failed:Preserved" as FP |
| 102 | + state "Terminating" as T |
| 103 | + [*] --> R |
| 104 | + R-->F: on failure |
| 105 | + F --> FP: if autoPreserveFailedMax not breached |
| 106 | + F --> T: if autoPreserveFailedMax breached |
| 107 | + FP --> T: on timeout or value=false |
| 108 | + FP --> RP : if node Healthy before timeout |
| 109 | + RP --> R: on timeout or preserve=false |
| 110 | + T --> [*] |
| 111 | + ``` |
| 112 | +
|
| 113 | +## Use Cases: |
| 114 | +
|
| 115 | +### Use Case 1: Preservation Request for Analysing Running Machine |
| 116 | +**Scenario:** Workload on machine failing. Operator wishes to diagnose. |
| 117 | +#### Steps: |
| 118 | +1. Operator annotates node with `node.machine.sapcloud.io/preserve=now`. |
| 119 | +2. MCM preserves the machine, and prevents CA from scaling it down. |
| 120 | +3. Operator analyzes the VM. |
| 121 | +
|
| 122 | +### Use Case 2: Proactive Preservation Request |
| 123 | +**Scenario:** Operator suspects a machine might fail and wants to ensure preservation for analysis. |
| 124 | +#### Steps: |
| 125 | +1. Operator annotates node with `node.machine.sapcloud.io/preserve=when-failed`. |
| 126 | +2. Machine fails later. |
| 127 | +3. MCM preserves the machine. |
| 128 | +4. Operator analyzes the VM. |
| 129 | +
|
| 130 | +
|
| 131 | +### Use Case 3: Auto-Preservation |
| 132 | +**Scenario:** Machine fails unexpectedly, no prior annotation. |
| 133 | +#### Steps: |
| 134 | +1. Machine transitions to `Failed` phase. |
| 135 | +2. Machine is drained. |
| 136 | +3. If `autoPreserveFailedMax` is not breached, machine moved to `Failed:Preserved` phase by MCM. |
| 137 | +4. After `machinePreserveTimeout`, machine is terminated by MCM. |
| 138 | +
|
| 139 | +### Use Case 4: Early Release |
| 140 | +**Scenario:** Operator has performed his analysis and no longer requires machine to be preserved. |
| 141 | +#### Steps: |
| 142 | +1. Machine is in `Running:Preserved` or `Failed:Preserved` phase. |
| 143 | +2. Operator adds: `node.machine.sapcloud.io/preserve=false` to node. |
| 144 | +3. MCM transitions machine to `Running` or `Terminating` for `Running:Preserved` or `Failed:Preserved` respectively, even though `machinePreserveTimeout` has not expired. |
| 145 | +4. If machine was in `Failed:Preserved`, capacity becomes available for auto-preservation. |
| 146 | +
|
| 147 | +## Points to Note |
| 148 | +
|
| 149 | +1. During rolling updates MCM will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to Failed phase. |
| 150 | +2. Hibernation policy will override machine preservation. |
| 151 | +3. Consumers (with access to shoot cluster) can annotate Nodes they would like to preserve. |
| 152 | +4. Operators (with access to control plane) can additionally annotate Machines that they would like to preserve. This feature can be used when a Machine does not have a backing Node and the operator wishes to preserve the backing VM. |
| 153 | +5. If the backing Node object exists but does not have the preservation annotation, preservation annotations added on the Machine will be honoured. |
| 154 | +6. However, if a backing Node exists for a Machine and has the preservation annotation, the Node's annotation value will override the Machine annotation value, and be synced to the Machine object. |
| 155 | +7. If `autoPreserveFailedMax` is reduced in the Shoot Spec, older machines are moved to `Terminating` phase before newer ones. |
| 156 | +8. In case of a scale down of an MCD's replica count, `Preserved` machines will be the last to be scaled down. Replica count will always be honoured. |
| 157 | +9. If the value for annotation key `cluster-autoscaler.kubernetes.io/scale-down-disabled` for a machine in `Running:Preserved` is changed to `false` by a user, the value will be overwritten to `true` by MCM. |
| 158 | +10. On increase/decrease of timeout, the new value will only apply to machines that go into `Preserved` phase after the change. Operators can always edit `machine.CurrentStatus.PreserveExpiryTime` to prolong the expiry time of existing `Preserved` machines. |
| 159 | +11. [Modify CA FAQ](https://github.com/gardener/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-prevent-cluster-autoscaler-from-scaling-down-a-particular-node) once feature is developed to use `node.machine.sapcloud.io/preserve=now` instead of the `cluster-autoscaler.kubernetes.io/scale-down-disabled=true` currently suggested. This would: |
| 160 | + - harmonise machine flow |
| 161 | + - shield from CA's internals |
| 162 | + - make it generic and no longer CA specific |
| 163 | + - allow a timeout to be specified |
0 commit comments