Skip to content

Commit b5c4f47

Browse files
authored
Add proposal for temporary preservation of machines (#1031)
* Add proposal for preservation of failed machines * Add limitations * Address review comments * Change mermaid layout from elk to default for github support * Improve clarity * Change proposal as per discussions * Fix limitations * Add state diagrams * Rename file and proposal * Update proposal to reflect changes decided in meeting * Modify proposal to support use case for `preserve=when-failed` * Add transition from Failed:Preserved to Running:Preserved. * Add rationale for transition between Preserved stages * Change to autoPreserveFailedMax * Update proposal to specify behaviour in case of conflicting annotation values
1 parent 1b5d502 commit b5c4f47

File tree

1 file changed

+163
-0
lines changed

1 file changed

+163
-0
lines changed
Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
# Preservation of Machines
2+
3+
<!-- TOC -->
4+
5+
- [Preservation of Machines](#preservation-of-machines)
6+
- [Objective](#objective)
7+
- [Proposal](#proposal)
8+
- [State Diagrams](#state-diagrams)
9+
- [Use Cases](#use-cases)
10+
11+
<!-- /TOC -->
12+
13+
## Objective
14+
15+
Currently, the Machine Controller Manager(MCM) moves Machines with errors to the `Unknown` phase, and after the configured `machineHealthTimeout`, to the `Failed` phase.
16+
`Failed` machines are swiftly moved to the `Terminating` phase during which the node is drained and the `Machine` object is deleted. This rapid cleanup prevents SRE/operators/support from conducting an analysis on the VM and makes finding root cause of failure more difficult.
17+
18+
Moreover, in cases where a node seems healthy but all the workload on it are facing issues, there is a need for operators to be able to cordon/drain the node and conduct their analysis without the cluster-autoscaler (CA) scaling down the node.
19+
20+
This document proposes enhancing MCM, such that:
21+
* VMs of machines are retained temporarily for analysis.
22+
* There is a configurable limit to the number of machines that can be preserved automatically on failure (auto-preservation).
23+
* There is a configurable limit to the duration for which machines are preserved.
24+
* Users can specify which healthy machines they would like to preserve in case of failure, or for diagnoses in current state (prevent scale down by CA).
25+
* Users can request MCM to release a preserved machine, even before the timeout expires, so that MCM can transition the machine to either `Running` or `Terminating` phase, as the case may be.
26+
27+
Related Issue: https://github.com/gardener/machine-controller-manager/issues/1008
28+
29+
## Proposal
30+
31+
In order to achieve the objectives mentioned, the following are proposed:
32+
1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of machines to be auto-preserved,
33+
and the time duration for which these machines will be preserved.
34+
```
35+
machineControllerManager:
36+
autoPreserveFailedMax: 0
37+
machinePreserveTimeout: 72h
38+
```
39+
* This configuration will be set per worker pool.
40+
* Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `autoPreserveFailedMax` will be distributed across N machine deployments.
41+
* `autoPreserveFailedMax` must be chosen such that it can be appropriately distributed across the MachineDeployments.
42+
* Example: if `autoPreserveFailedMax` is set to 2, and the worker pool has 2 zones, then the maximum number of machines that will be preserved per zone is 1.
43+
2. MCM will be modified to include a new sub-phase `Preserved` to indicate that the machine has been preserved by MCM.
44+
3. Allow user/operator to request for preservation of a specific machine/node with the use of annotations : `node.machine.sapcloud.io/preserve=now` and `node.machine.sapcloud.io/preserve=when-failed`.
45+
4. When annotation `node.machine.sapcloud.io/preserve=now` is added to a `Running` machine, the following will take place:
46+
- `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is added to the node to prevent CA from scaling it down.
47+
- `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$.
48+
- The machine's phase is changed to `Running:Preserved`.
49+
- After timeout, the `node.machine.sapcloud.io/preserve=now` and `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` are deleted. The `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. The machine phase is changed to `Running` and the CA may delete the node.
50+
- If a machine in `Running:Preserved` fails, it is moved to `Failed:Preserved`.
51+
5. When annotation `node.machine.sapcloud.io/preserve=when-failed` is added to a `Running` machine and the machine goes to `Failed`, the following will take place:
52+
- The machine is drained of pods except for Daemonset pods.
53+
- The machine phase is changed to `Failed:Preserved`.
54+
- `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is added to the node to prevent CA from scaling it down.
55+
- `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$.
56+
- After timeout, the annotations `node.machine.sapcloud.io/preserve=when-failed` and `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` are deleted. `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. The phase is changed to `Terminating`.
57+
6. When an un-annotated machine goes to `Failed` phase and `autoPreserveFailedMax` is not breached:
58+
- Pods (other than DaemonSet pods) are drained.
59+
- The machine's phase is changed to `Failed:Preserved`.
60+
- `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is added to the node to prevent CA from scaling it down.
61+
- `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$.
62+
- After timeout, the annotation `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is deleted. `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. The phase is changed to `Terminating`.
63+
- Number of machines in `Failed:Preserved` phase count towards enforcing `autoPreserveFailedMax`.
64+
7. If a failed machine is currently in `Failed:Preserved` and before timeout its VM/node is found to be Healthy, the machine will be moved to `Running:Preserved`. After the timeout, it will be moved to `Running`.
65+
The rationale behind moving the machine to `Running:Preserved` rather than `Running`, is to allow pods to get scheduled on to the healthy node again without the autoscaler scaling it down due to under-utilization.
66+
8. A user/operator can request MCM to stop preserving a machine/node in `Running:Preserved` or `Failed:Preserved` phase using the annotation: `node.machine.sapcloud.io/preserve=false`.
67+
* MCM will move a machine thus annotated either to `Running` phase or `Terminating` depending on the phase of the machine before it was preserved.
68+
9. Machines of a MachineDeployment in `Preserved` sub-phase will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment.
69+
10. MCM will be modified to perform drain in `Failed` phase rather than `Terminating`.
70+
71+
## State Diagrams:
72+
73+
1. State Diagram for when a machine or its node is explicitly annotated for preservation:
74+
```mermaid
75+
stateDiagram-v2
76+
state "Running" as R
77+
state "Running + Requested" as RR
78+
state "Running:Preserved" as RP
79+
state "Failed
80+
(node drained)" as F
81+
state "Failed:Preserved" as FP
82+
state "Terminating" as T
83+
[*]-->R
84+
R --> RR: annotated with preserve=when-failed
85+
RP-->R: on timeout or preserve=false
86+
RR --> F: on failure
87+
F --> FP
88+
FP --> T: on timeout or preserve=false
89+
FP --> RP: if node Healthy before timeout
90+
T --> [*]
91+
R-->RP: annotated with preserve=now
92+
RP-->F: if node/VM not healthy
93+
```
94+
2. State Diagram for when an un-annotated `Running` machine fails (Auto-preservation):
95+
```mermaid
96+
stateDiagram-v2
97+
state "Running" as R
98+
state "Running:Preserved" as RP
99+
state "Failed
100+
(node drained)" as F
101+
state "Failed:Preserved" as FP
102+
state "Terminating" as T
103+
[*] --> R
104+
R-->F: on failure
105+
F --> FP: if autoPreserveFailedMax not breached
106+
F --> T: if autoPreserveFailedMax breached
107+
FP --> T: on timeout or value=false
108+
FP --> RP : if node Healthy before timeout
109+
RP --> R: on timeout or preserve=false
110+
T --> [*]
111+
```
112+
113+
## Use Cases:
114+
115+
### Use Case 1: Preservation Request for Analysing Running Machine
116+
**Scenario:** Workload on machine failing. Operator wishes to diagnose.
117+
#### Steps:
118+
1. Operator annotates node with `node.machine.sapcloud.io/preserve=now`.
119+
2. MCM preserves the machine, and prevents CA from scaling it down.
120+
3. Operator analyzes the VM.
121+
122+
### Use Case 2: Proactive Preservation Request
123+
**Scenario:** Operator suspects a machine might fail and wants to ensure preservation for analysis.
124+
#### Steps:
125+
1. Operator annotates node with `node.machine.sapcloud.io/preserve=when-failed`.
126+
2. Machine fails later.
127+
3. MCM preserves the machine.
128+
4. Operator analyzes the VM.
129+
130+
131+
### Use Case 3: Auto-Preservation
132+
**Scenario:** Machine fails unexpectedly, no prior annotation.
133+
#### Steps:
134+
1. Machine transitions to `Failed` phase.
135+
2. Machine is drained.
136+
3. If `autoPreserveFailedMax` is not breached, machine moved to `Failed:Preserved` phase by MCM.
137+
4. After `machinePreserveTimeout`, machine is terminated by MCM.
138+
139+
### Use Case 4: Early Release
140+
**Scenario:** Operator has performed his analysis and no longer requires machine to be preserved.
141+
#### Steps:
142+
1. Machine is in `Running:Preserved` or `Failed:Preserved` phase.
143+
2. Operator adds: `node.machine.sapcloud.io/preserve=false` to node.
144+
3. MCM transitions machine to `Running` or `Terminating` for `Running:Preserved` or `Failed:Preserved` respectively, even though `machinePreserveTimeout` has not expired.
145+
4. If machine was in `Failed:Preserved`, capacity becomes available for auto-preservation.
146+
147+
## Points to Note
148+
149+
1. During rolling updates MCM will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to Failed phase.
150+
2. Hibernation policy will override machine preservation.
151+
3. Consumers (with access to shoot cluster) can annotate Nodes they would like to preserve.
152+
4. Operators (with access to control plane) can additionally annotate Machines that they would like to preserve. This feature can be used when a Machine does not have a backing Node and the operator wishes to preserve the backing VM.
153+
5. If the backing Node object exists but does not have the preservation annotation, preservation annotations added on the Machine will be honoured.
154+
6. However, if a backing Node exists for a Machine and has the preservation annotation, the Node's annotation value will override the Machine annotation value, and be synced to the Machine object.
155+
7. If `autoPreserveFailedMax` is reduced in the Shoot Spec, older machines are moved to `Terminating` phase before newer ones.
156+
8. In case of a scale down of an MCD's replica count, `Preserved` machines will be the last to be scaled down. Replica count will always be honoured.
157+
9. If the value for annotation key `cluster-autoscaler.kubernetes.io/scale-down-disabled` for a machine in `Running:Preserved` is changed to `false` by a user, the value will be overwritten to `true` by MCM.
158+
10. On increase/decrease of timeout, the new value will only apply to machines that go into `Preserved` phase after the change. Operators can always edit `machine.CurrentStatus.PreserveExpiryTime` to prolong the expiry time of existing `Preserved` machines.
159+
11. [Modify CA FAQ](https://github.com/gardener/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-prevent-cluster-autoscaler-from-scaling-down-a-particular-node) once feature is developed to use `node.machine.sapcloud.io/preserve=now` instead of the `cluster-autoscaler.kubernetes.io/scale-down-disabled=true` currently suggested. This would:
160+
- harmonise machine flow
161+
- shield from CA's internals
162+
- make it generic and no longer CA specific
163+
- allow a timeout to be specified

0 commit comments

Comments
 (0)