Skip to content

Commit 1c8d4a8

Browse files
authored
Merge pull request #102520 from skopacz1/OSDOCS-17158
OSDOCS#17158: node replacement procedure
2 parents 440b0f0 + f9707aa commit 1c8d4a8

9 files changed

+819
-0
lines changed

_topic_maps/_topic_map.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2805,6 +2805,8 @@ Topics:
28052805
# File: nodes-nodes-graceful-shutdown
28062806
- Name: Managing the maximum number of pods per node
28072807
File: nodes-nodes-managing-max-pods
2808+
- Name: Replacing a failed bare-metal control plane node without BMC credentials
2809+
File: nodes-nodes-replace-control-plane
28082810
- Name: Using the Node Tuning Operator
28092811
File: nodes-node-tuning-operator
28102812
- Name: Remediating, fencing, and maintaining nodes
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes/nodes-nodes-replace-control-plane.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="add-new-etcd-member_{context}"]
7+
= Adding the new etcd member
8+
9+
Finish adding the new control plane node by adding the new etcd member to the cluster.
10+
11+
.Procedure
12+
13+
. Add the new etcd member to the cluster by performing the following steps in a single bash shell session:
14+
15+
.. Find the IP of the new control plane node by running the following command:
16+
+
17+
[source,terminal]
18+
----
19+
$ oc get nodes -owide -l node-role.kubernetes.io/control-plane
20+
----
21+
+
22+
Make note of the node's IP address for later use.
23+
24+
.. List the etcd pods by running the following command:
25+
+
26+
[source,terminal]
27+
----
28+
$ oc get -n openshift-etcd pods -l k8s-app=etcd -o wide
29+
----
30+
31+
.. Connect to one of the running etcd pods by running the following command. The etcd pod on the new node should be in a `CrashLoopBackOff` state.
32+
+
33+
[source,terminal]
34+
----
35+
$ oc rsh -n openshift-etcd <running_pod>
36+
----
37+
+
38+
Replace `<running_pod>` with the name of a running pod shown in the previous step.
39+
40+
.. View the etcd member list by running the following command:
41+
+
42+
[source,terminal]
43+
----
44+
sh-4.2# etcdctl member list -w table
45+
----
46+
47+
.. Add the new control plane etcd member by running the following command:
48+
+
49+
[source,terminal]
50+
----
51+
sh-4.2# etcdctl member add <new_node> --peer-urls="https://<ip_address>:2380"
52+
----
53+
+
54+
where:
55+
56+
`<new_node>`:: Specifies the name of the new control plane node
57+
`<ip_address>`:: Specifies the IP address of the new node.
58+
59+
.. Exit the rsh shell by running the following command:
60+
+
61+
[source,terminal]
62+
----
63+
sh-4.2# exit
64+
----
65+
66+
. Force an etcd redeployment by running the following command:
67+
+
68+
[source,terminal]
69+
----
70+
$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "single-master-recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
71+
----
72+
73+
. Turn the etcd quorum guard back on by running the following command:
74+
+
75+
[source,terminal]
76+
----
77+
$ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": null}}'
78+
----
79+
80+
. Monitor the cluster Operator rollout by running the following command:
81+
+
82+
[source,terminal]
83+
----
84+
$ watch oc get co
85+
----
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes/nodes-nodes-replace-control-plane.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="create-new-machine_{context}"]
7+
= Creating the new control plane node
8+
9+
Begin creating the new control plane node by creating a `BareMetalHost` object and node.
10+
11+
.Procedure
12+
13+
. Edit the `bmh_affected.yaml` file that you previously saved:
14+
+
15+
--
16+
.. Remove the following metadata items from the file:
17+
+
18+
* `creationTimestamp`
19+
* `generation`
20+
* `resourceVersion`
21+
* `uid`
22+
23+
.. Remove the `status` section of the file.
24+
--
25+
+
26+
The resulting file should resemble the following example:
27+
+
28+
.Example `bmh_affected.yaml` file
29+
[source,yaml]
30+
----
31+
apiVersion: metal3.io/v1alpha1
32+
kind: BareMetalHost
33+
metadata:
34+
labels:
35+
installer.openshift.io/role: control-plane
36+
name: openshift-control-plane-2
37+
namespace: openshift-machine-api
38+
spec:
39+
automatedCleaningMode: disabled
40+
bmc:
41+
address:
42+
credentialsName:
43+
disableCertificateVerification: true
44+
bootMACAddress: ab:cd:ef:ab:cd:ef
45+
bootMode: UEFI
46+
externallyProvisioned: true
47+
online: true
48+
rootDeviceHints:
49+
deviceName: /dev/disk/by-path/pci-0000:04:00.0-nvme-1
50+
userData:
51+
name: master-user-data-managed
52+
namespace: openshift-machine-api
53+
----
54+
55+
. Create the `BareMetalHost` object using the `bmh_affected.yaml` file by running the following command:
56+
+
57+
[source,terminal]
58+
----
59+
$ oc create -f bmh_affected.yaml
60+
----
61+
+
62+
The following warning is expected upon creation of the `BareMetalHost` object:
63+
+
64+
[source,terminal]
65+
----
66+
Warning: metadata.finalizers: "baremetalhost.metal3.io": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers
67+
----
68+
69+
. Extract the control plane ignition secret by running the following command:
70+
+
71+
[source,terminal]
72+
----
73+
$ oc extract secret/master-user-data-managed \
74+
-n openshift-machine-api \
75+
--keys=userData \
76+
--to=- \
77+
| sed '/^userData/d' > new_controlplane.ign
78+
----
79+
+
80+
This command also removes the starting `userData` line of the ignition secret.
81+
82+
. Create an Nmstate YAML file titled `new_controlplane_nmstate.yaml` for the new node's network configuration, using the following example for reference:
83+
+
84+
.Example Nmstate YAML file
85+
[source,yaml]
86+
----
87+
interfaces:
88+
- name: eno1
89+
type: ethernet
90+
state: up
91+
mac-address: "ab:cd:ef:01:02:03"
92+
ipv4:
93+
enabled: true
94+
address:
95+
- ip: 192.168.20.11
96+
prefix-length: 24
97+
dhcp: false
98+
ipv6:
99+
enabled: false
100+
dns-resolver:
101+
config:
102+
search:
103+
- iso.sterling.home
104+
server:
105+
- 192.168.20.8
106+
routes:
107+
config:
108+
- destination: 0.0.0.0/0
109+
metric: 100
110+
next-hop-address: 192.168.20.1
111+
next-hop-interface: eno1
112+
table-id: 254
113+
----
114+
+
115+
[NOTE]
116+
====
117+
If you installed your cluster using the Agent-based Installer, you can use the failed node's `networkConfig` section in the `agent-config.yaml` file from the original cluster deployment as a starting point for the new control plane node's Nmstate file. For example, the following command extracts the `networkConfig` section for the first control plane node:
118+
119+
[source,terminal]
120+
----
121+
$ cat agent-config-iso.yaml | yq .hosts[0].networkConfig > new_controlplane_nmstate.yaml
122+
----
123+
====
124+
125+
. Create the customized {op-system-first} live ISO by running the following command:
126+
+
127+
[source,terminal]
128+
----
129+
$ coreos-installer iso customize rhcos-live.86_64.iso \
130+
--dest-ignition new_controlplane.ign \
131+
--network-nmstate new_controlplane_nmstate.yaml \
132+
--dest-device /dev/disk/by-path/<device_path> \
133+
-f
134+
----
135+
+
136+
Replace `<device_path>` with the path to the target device on which the ISO will be generated.
137+
138+
. Boot the new control plane node with the customized {op-system} live ISO.
139+
140+
. Approve the Certificate Signing Requests (CSR) to join the new node to the cluster.
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes/nodes-nodes-replace-control-plane.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="deleting-machine_{context}"]
7+
= Deleting the machine of the unhealthy etcd member
8+
9+
Finish removing the failed control plane node by deleting the machine of the unhealthy etcd member.
10+
11+
.Procedure
12+
13+
. Ensure that the Bare Metal Operator is available by running the following command:
14+
+
15+
[source,terminal]
16+
----
17+
$ oc get clusteroperator baremetal
18+
----
19+
+
20+
.Example output
21+
[source,terminal]
22+
----
23+
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
24+
baremetal 4.20.0 True False False 3d15h
25+
----
26+
27+
. Save the `BareMetalHost` object of the affected node to a file for later use by running the following command:
28+
+
29+
[source,terminal]
30+
----
31+
$ oc get -n openshift-machine-api bmh <node_name> -o yaml > bmh_affected.yaml
32+
----
33+
+
34+
Replace `<node_name>` with the name of the affected node, which usually matches the associated `BareMetalHost` name.
35+
36+
. View the YAML file of the saved `BareMetalHost` object by running the following command, and ensure the content is correct:
37+
+
38+
[source,terminal]
39+
----
40+
$ cat bmh_affected.yaml
41+
----
42+
43+
. Remove the affected `BareMetalHost` object by running the following command:
44+
+
45+
[source,terminal]
46+
----
47+
$ oc delete -n openshift-machine-api bmh <node_name>
48+
----
49+
+
50+
Replace `<node_name>` with the name of the affected node.
51+
52+
. List all machines by running the following command and identify the machine associated with the affected node:
53+
+
54+
[source,terminal]
55+
----
56+
$ oc get machines -n openshift-machine-api -o wide
57+
----
58+
+
59+
.Example output
60+
[source,terminal]
61+
----
62+
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
63+
examplecluster-control-plane-0 Running 3h11m openshift-control-plane-0 baremetalhost:///openshift-machine-api/openshift-control-plane-0/da1ebe11-3ff2-41c5-b099-0aa41222964e externally provisioned
64+
examplecluster-control-plane-1 Running 3h11m openshift-control-plane-1 baremetalhost:///openshift-machine-api/openshift-control-plane-1/d9f9acbc-329c-475e-8d81-03b20280a3e1 externally provisioned
65+
examplecluster-control-plane-2 Running 3h11m openshift-control-plane-2 baremetalhost:///openshift-machine-api/openshift-control-plane-2/3354bdac-61d8-410f-be5b-6a395b056135 externally provisioned
66+
examplecluster-compute-0 Running 165m openshift-compute-0 baremetalhost:///openshift-machine-api/openshift-compute-0/3d685b81-7410-4bb3-80ec-13a31858241f provisioned
67+
examplecluster-compute-1 Running 165m openshift-compute-1 baremetalhost:///openshift-machine-api/openshift-compute-1/0fdae6eb-2066-4241-91dc-e7ea72ab13b9 provisioned
68+
----
69+
70+
. Delete the machine of the unhealthy member by running the following command:
71+
+
72+
[source,terminal]
73+
----
74+
$ oc delete machine -n openshift-machine-api <machine_name>
75+
----
76+
+
77+
Replace `<machine_name>` with the machine name associated with the affected node.
78+
+
79+
.Example command
80+
[source,terminal]
81+
----
82+
$ oc delete machine -n openshift-machine-api examplecluster-control-plane-2
83+
----
84+
+
85+
[NOTE]
86+
====
87+
After you remove the `BareMetalHost` and `Machine` objects, the machine controller automatically deletes the `Node` object.
88+
====
89+
90+
. If deletion of the machine is delayed for any reason or the command is obstructed and delayed, force deletion by removing the machine object finalizer field.
91+
+
92+
[WARNING]
93+
====
94+
Do not interrupt machine deletion by pressing `Ctrl+c`. You must allow the command to proceed to completion. Open a new terminal window to edit and delete the finalizer fields.
95+
====
96+
97+
.. On a new terminal window, edit the machine configuration by running the following command:
98+
+
99+
[source,terminal]
100+
----
101+
$ oc edit machine -n openshift-machine-api examplecluster-control-plane-2
102+
----
103+
104+
.. Delete the following fields in the `Machine` custom resource, and then save the updated file:
105+
+
106+
[source,yaml]
107+
----
108+
finalizers:
109+
- machine.machine.openshift.io
110+
----
111+
+
112+
.Example output
113+
[source,terminal]
114+
----
115+
machine.machine.openshift.io/examplecluster-control-plane-2 edited
116+
----

0 commit comments

Comments
 (0)