Skip to content

Commit 5acab3c

Browse files
bergerhofferopenshift-cherrypick-robot
authored andcommitted
OSDOCS-16981: CQA updates for AI workloads book intro and LWS docs
1 parent 92cbf2b commit 5acab3c

14 files changed

+80
-33
lines changed

ai_workloads/index.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ include::_attributes/common-attributes.adoc[]
77

88
toc::[]
99

10+
[role="_abstract"]
1011
{product-title} provides a secure, scalable foundation for running artificial intelligence (AI) workloads across training, inference, and data science workflows.
1112

1213
// Operators for running AI workloads

ai_workloads/leader_worker_set/index.adoc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,9 @@ include::_attributes/common-attributes.adoc[]
77

88
toc::[]
99

10+
[role="_abstract"]
11+
Use the {lws-operator} to manage multi-node AI/ML inference deployments efficiently. The {lws-operator} treats groups of pods as one unit to simplify scaling, recovery, and updates for large workloads.
12+
1013
Using large language models (LLMs) for AI/ML inference often requires significant compute resources, and workloads typically must be sharded across multiple nodes. This can make deployments complex, creating challenges around scaling, recovery from failures, and efficient pod placement.
1114

1215
The {lws-operator} simplifies these multi-node deployments by treating a group of pods as a single, coordinated unit. It manages the lifecycle of each pod in the group, scales the entire group together, and performs updates and failure recovery at the group level to ensure consistency.

ai_workloads/leader_worker_set/lws-managing.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ include::_attributes/common-attributes.adoc[]
77

88
toc::[]
99

10+
[role="_abstract"]
1011
You can use the {lws-operator} to manage distributed inference workloads and process large-scale inference requests efficiently.
1112

1213
// Installing the {lws-operator}

ai_workloads/leader_worker_set/lws-release-notes.adoc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,9 @@ include::_attributes/common-attributes.adoc[]
77

88
toc::[]
99

10+
[role="_abstract"]
11+
Review the {lws-operator} release notes to track its development and learn what is new and changed with each release.
12+
1013
You can use the {lws-operator} to manage distributed inference workloads and process large-scale inference requests efficiently.
1114

1215
These release notes track the development of the {lws-operator}.

ai_workloads/leader_worker_set/lws-uninstalling.adoc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,8 @@ include::_attributes/common-attributes.adoc[]
77

88
toc::[]
99

10-
You can remove the {lws-operator} from {product-title} by uninstalling the Operator and removing its related resources.
10+
[role="_abstract"]
11+
If you no longer need the {lws-operator} in your cluster, you can uninstall the Operator and remove its related resources.
1112

1213
// Uninstalling the {lws-operator}
1314
include::modules/lws-uninstall.adoc[leveloffset=+1]

modules/ai-operators.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
[id="ai-operators_{context}"]
77
= Operators for running AI workloads
88

9+
[role="_abstract"]
910
You can use Operators to run artificial intelligence (AI) and machine learning (ML) workloads on {product-title}. With Operators, you can build a customized environment that meets your specific AI/ML requirements while continuing to use {product-title} as the core platform for your applications.
1011

1112
{product-title} provides several Operators that can help you run AI workloads:

modules/ai-rhoai.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88

99
// TODO: This needs approval from RHOAI team before it can be included
1010

11+
[role="_abstract"]
1112
If your organization requires an integrated environment to develop, train, serve, test, and monitor AI/ML models and applications, consider {rhoai-full}.
1213

1314
{rhoai-full} is a platform for data scientists and developers of artificial intelligence and machine learning (AI/ML) applications. {rhoai-full} builds on {product-title} and provides a preconfigured set of tools, accelerators, and other features to manage the full AI/ML lifecycle. This approach reduces the need to assemble and maintain individual Operators or components for AI workloads.

modules/lws-about.adoc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@
66
[id="lws-about_{context}"]
77
= About the {lws-operator}
88

9+
[role="_abstract"]
10+
Use the {lws-operator} to deploy groups of pods as a single, manageable unit. This helps you to deploy large AI/ML inference workloads, such as sharded large language models (LLMs).
11+
912
The {lws-operator} is based on the link:https://lws.sigs.k8s.io/[LeaderWorkerSet] open source project. `LeaderWorkerSet` is a custom Kubernetes API that can be used to deploy a group of pods as a unit. This is useful for artificial intelligence (AI) and machine learning (ML) inference workloads, where large language models (LLMs) are sharded across multiple nodes.
1013

1114
With the `LeaderWorkerSet` API, pods are grouped into units consisting of one leader and multiple workers, all managed together as a single entity. Each pod in a group has a unique pod identity. Pods within a group are created in parallel and share identical lifecycle stages. Rollouts, rolling updates, and pod failure restarts are performed as a group.

modules/lws-arch.adoc

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,10 @@
66
[id="lws-arch_{context}"]
77
= LeaderWorkerSet architecture
88

9-
The following diagram shows how the `LeaderWorkerSet` API organizes groups of pods into a single unit, with one pod as the leader and the rest as the workers, to coordinate distributed workloads:
9+
[role="_abstract"]
10+
Review the LeaderWorkerSet architecture to learn how the `LeaderWorkerSet` API organizes groups of pods into a single unit, with one pod as the leader and the rest as the workers, to coordinate distributed workloads.
11+
12+
The following diagram describes the LeaderWorkerSet architecture:
1013

1114
.Leader worker set architecture
1215
image::587_OpenShift_lws_0925.png[Leader worker set architecture]

modules/lws-config.adoc

Lines changed: 52 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
[id="lws-config_{context}"]
77
= Deploying a leader worker set
88

9+
[role="_abstract"]
910
You can use the {lws-operator} to deploy a leader worker set to assist with managing distributed workloads across nodes.
1011

1112
.Prerequisites
@@ -29,20 +30,20 @@ apiVersion: leaderworkerset.x-k8s.io/v1
2930
kind: LeaderWorkerSet
3031
metadata:
3132
generation: 1
32-
name: my-lws <1>
33-
namespace: my-namespace <2>
33+
name: my-lws
34+
namespace: my-namespace
3435
spec:
3536
leaderWorkerTemplate:
36-
leaderTemplate: <3>
37+
leaderTemplate:
3738
metadata: {}
3839
spec:
3940
containers:
4041
- image: nginxinc/nginx-unprivileged:1.27
4142
name: leader
4243
resources: {}
43-
restartPolicy: RecreateGroupOnPodRestart <4>
44-
size: 3 <5>
45-
workerTemplate: <6>
44+
restartPolicy: RecreateGroupOnPodRestart
45+
size: 3
46+
workerTemplate:
4647
metadata: {}
4748
spec:
4849
containers:
@@ -53,24 +54,45 @@ spec:
5354
protocol: TCP
5455
resources: {}
5556
networkConfig:
56-
subdomainPolicy: Shared <7>
57-
replicas: 2 <8>
57+
subdomainPolicy: Shared
58+
replicas: 2
5859
rolloutStrategy:
5960
rollingUpdateConfiguration:
60-
maxSurge: 1 <9>
61+
maxSurge: 1
6162
maxUnavailable: 1
6263
type: RollingUpdate
6364
startupPolicy: LeaderCreated
6465
----
65-
<1> Specify the name of the leader worker set resource.
66-
<2> Specify the namespace for the leader worker set to run in.
67-
<3> Specify the pod template for the leader pods.
68-
<4> Specify the restart policy for when pod failures occur. Allowed values are `RecreateGroupOnPodRestart` to restart the whole group or `None` to not restart the group.
69-
<5> Specify the number of pods to create for each group, including the leader pod. For example, a value of `3` creates 1 leader pod and 2 worker pods. The default value is `1`.
70-
<6> Specify the pod template for the worker pods.
71-
<7> Specify the policy to use when creating the headless service. Allowed values are `UniquePerReplica` or `Shared`. The default value is `Shared`.
72-
<8> Specify the number of replicas, or leader-worker groups. The default value is `1`.
73-
<9> Specify the maximum number of replicas that can be scheduled above the `replicas` value during rolling updates. The value can be specified as an integer or a percentage.
66+
+
67+
where:
68+
69+
`metadata.name`::
70+
Specifies the name of the leader worker set resource.
71+
72+
`metadata.namespace`::
73+
Specifies the namespace for the leader worker set to run in.
74+
75+
`spec.leaderWorkerTemplate.leaderTemplate`::
76+
Specifies the pod template for the leader pods.
77+
78+
`spec.leaderWorkerTemplate.restartPolicy`::
79+
Specifies the restart policy for when pod failures occur. Allowed values are `RecreateGroupOnPodRestart` to restart the whole group or `None` to not restart the group.
80+
81+
`spec.leaderWorkerTemplate.size`::
82+
Specifies the number of pods to create for each group, including the leader pod. For example, a value of `3` creates 1 leader pod and 2 worker pods. The default value is `1`.
83+
84+
`spec.leaderWorkerTemplate.workerTemplate`::
85+
Specifies the pod template for the worker pods.
86+
87+
`spec.networkConfig.subdomainPolicy`::
88+
Specifies the policy to use when creating the headless service. Allowed values are `UniquePerReplica` or `Shared`. The default value is `Shared`.
89+
90+
`spec.replicas`::
91+
Specifies the number of replicas, or leader-worker groups. The default value is `1`.
92+
93+
`spec.rolloutStrategy.rollingUpdateConfiguration.maxSurge`::
94+
Specifies the maximum number of replicas that can be scheduled above the `replicas` value during rolling updates. The value can be specified as an integer or a percentage.
95+
7496
+
7597
For more information on all available fields to configure, see link:https://lws.sigs.k8s.io/docs/reference/leaderworkerset.v1/[LeaderWorkerSet API] upstream documentation.
7698

@@ -94,15 +116,16 @@ $ oc get pods -n my-namespace
94116
[source,terminal]
95117
----
96118
NAME READY STATUS RESTARTS AGE
97-
my-lws-0 1/1 Running 0 4s <1>
119+
my-lws-0 1/1 Running 0 4s
98120
my-lws-0-1 1/1 Running 0 3s
99121
my-lws-0-2 1/1 Running 0 3s
100-
my-lws-1 1/1 Running 0 7s <2>
122+
my-lws-1 1/1 Running 0 7s
101123
my-lws-1-1 1/1 Running 0 6s
102124
my-lws-1-2 1/1 Running 0 6s
103125
----
104-
<1> The leader pod for the first group.
105-
<2> The leader pod for the second group.
126+
+
127+
** `my-lws-0` is the leader pod for the first group.
128+
** `my-lws-1` is the leader pod for the second group.
106129

107130
. Review the stateful sets by running the following command:
108131
+
@@ -115,10 +138,11 @@ $ oc get statefulsets
115138
[source,terminal]
116139
----
117140
NAME READY AGE
118-
my-lws 4/4 111s <1>
119-
my-lws-0 2/2 57s <2>
120-
my-lws-1 2/2 60s <3>
141+
my-lws 4/4 111s
142+
my-lws-0 2/2 57s
143+
my-lws-1 2/2 60s
121144
----
122-
<1> The leader stateful set for all leader-worker groups.
123-
<2> The worker stateful set for the first group.
124-
<3> The worker stateful set for the second group.
145+
+
146+
** `my-lws` is the leader stateful set for all leader-worker groups.
147+
** `my-lws-0` is the worker stateful set for the first group.
148+
** `my-lws-1` is the worker stateful set for the second group.

0 commit comments

Comments
 (0)