You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{product-title} provides a secure, scalable foundation for running artificial intelligence (AI) workloads across training, inference, and data science workflows.
Use the {lws-operator} to manage multi-node AI/ML inference deployments efficiently. The {lws-operator} treats groups of pods as one unit to simplify scaling, recovery, and updates for large workloads.
12
+
10
13
Using large language models (LLMs) for AI/ML inference often requires significant compute resources, and workloads typically must be sharded across multiple nodes. This can make deployments complex, creating challenges around scaling, recovery from failures, and efficient pod placement.
11
14
12
15
The {lws-operator} simplifies these multi-node deployments by treating a group of pods as a single, coordinated unit. It manages the lifecycle of each pod in the group, scales the entire group together, and performs updates and failure recovery at the group level to ensure consistency.
Copy file name to clipboardExpand all lines: modules/ai-operators.adoc
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,7 @@
6
6
[id="ai-operators_{context}"]
7
7
= Operators for running AI workloads
8
8
9
+
[role="_abstract"]
9
10
You can use Operators to run artificial intelligence (AI) and machine learning (ML) workloads on {product-title}. With Operators, you can build a customized environment that meets your specific AI/ML requirements while continuing to use {product-title} as the core platform for your applications.
10
11
11
12
{product-title} provides several Operators that can help you run AI workloads:
Copy file name to clipboardExpand all lines: modules/ai-rhoai.adoc
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,6 +8,7 @@
8
8
9
9
// TODO: This needs approval from RHOAI team before it can be included
10
10
11
+
[role="_abstract"]
11
12
If your organization requires an integrated environment to develop, train, serve, test, and monitor AI/ML models and applications, consider {rhoai-full}.
12
13
13
14
{rhoai-full} is a platform for data scientists and developers of artificial intelligence and machine learning (AI/ML) applications. {rhoai-full} builds on {product-title} and provides a preconfigured set of tools, accelerators, and other features to manage the full AI/ML lifecycle. This approach reduces the need to assemble and maintain individual Operators or components for AI workloads.
Copy file name to clipboardExpand all lines: modules/lws-about.adoc
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,9 @@
6
6
[id="lws-about_{context}"]
7
7
= About the {lws-operator}
8
8
9
+
[role="_abstract"]
10
+
Use the {lws-operator} to deploy groups of pods as a single, manageable unit. This helps you to deploy large AI/ML inference workloads, such as sharded large language models (LLMs).
11
+
9
12
The {lws-operator} is based on the link:https://lws.sigs.k8s.io/[LeaderWorkerSet] open source project. `LeaderWorkerSet` is a custom Kubernetes API that can be used to deploy a group of pods as a unit. This is useful for artificial intelligence (AI) and machine learning (ML) inference workloads, where large language models (LLMs) are sharded across multiple nodes.
10
13
11
14
With the `LeaderWorkerSet` API, pods are grouped into units consisting of one leader and multiple workers, all managed together as a single entity. Each pod in a group has a unique pod identity. Pods within a group are created in parallel and share identical lifecycle stages. Rollouts, rolling updates, and pod failure restarts are performed as a group.
Copy file name to clipboardExpand all lines: modules/lws-arch.adoc
+4-1Lines changed: 4 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,10 @@
6
6
[id="lws-arch_{context}"]
7
7
= LeaderWorkerSet architecture
8
8
9
-
The following diagram shows how the `LeaderWorkerSet` API organizes groups of pods into a single unit, with one pod as the leader and the rest as the workers, to coordinate distributed workloads:
9
+
[role="_abstract"]
10
+
Review the LeaderWorkerSet architecture to learn how the `LeaderWorkerSet` API organizes groups of pods into a single unit, with one pod as the leader and the rest as the workers, to coordinate distributed workloads.
11
+
12
+
The following diagram describes the LeaderWorkerSet architecture:
10
13
11
14
.Leader worker set architecture
12
15
image::587_OpenShift_lws_0925.png[Leader worker set architecture]
<1> Specify the name of the leader worker set resource.
66
-
<2> Specify the namespace for the leader worker set to run in.
67
-
<3> Specify the pod template for the leader pods.
68
-
<4> Specify the restart policy for when pod failures occur. Allowed values are `RecreateGroupOnPodRestart` to restart the whole group or `None` to not restart the group.
69
-
<5> Specify the number of pods to create for each group, including the leader pod. For example, a value of `3` creates 1 leader pod and 2 worker pods. The default value is `1`.
70
-
<6> Specify the pod template for the worker pods.
71
-
<7> Specify the policy to use when creating the headless service. Allowed values are `UniquePerReplica` or `Shared`. The default value is `Shared`.
72
-
<8> Specify the number of replicas, or leader-worker groups. The default value is `1`.
73
-
<9> Specify the maximum number of replicas that can be scheduled above the `replicas` value during rolling updates. The value can be specified as an integer or a percentage.
66
+
+
67
+
where:
68
+
69
+
`metadata.name`::
70
+
Specifies the name of the leader worker set resource.
71
+
72
+
`metadata.namespace`::
73
+
Specifies the namespace for the leader worker set to run in.
74
+
75
+
`spec.leaderWorkerTemplate.leaderTemplate`::
76
+
Specifies the pod template for the leader pods.
77
+
78
+
`spec.leaderWorkerTemplate.restartPolicy`::
79
+
Specifies the restart policy for when pod failures occur. Allowed values are `RecreateGroupOnPodRestart` to restart the whole group or `None` to not restart the group.
80
+
81
+
`spec.leaderWorkerTemplate.size`::
82
+
Specifies the number of pods to create for each group, including the leader pod. For example, a value of `3` creates 1 leader pod and 2 worker pods. The default value is `1`.
83
+
84
+
`spec.leaderWorkerTemplate.workerTemplate`::
85
+
Specifies the pod template for the worker pods.
86
+
87
+
`spec.networkConfig.subdomainPolicy`::
88
+
Specifies the policy to use when creating the headless service. Allowed values are `UniquePerReplica` or `Shared`. The default value is `Shared`.
89
+
90
+
`spec.replicas`::
91
+
Specifies the number of replicas, or leader-worker groups. The default value is `1`.
Specifies the maximum number of replicas that can be scheduled above the `replicas` value during rolling updates. The value can be specified as an integer or a percentage.
95
+
74
96
+
75
97
For more information on all available fields to configure, see link:https://lws.sigs.k8s.io/docs/reference/leaderworkerset.v1/[LeaderWorkerSet API] upstream documentation.
76
98
@@ -94,15 +116,16 @@ $ oc get pods -n my-namespace
94
116
[source,terminal]
95
117
----
96
118
NAME READY STATUS RESTARTS AGE
97
-
my-lws-0 1/1 Running 0 4s <1>
119
+
my-lws-0 1/1 Running 0 4s
98
120
my-lws-0-1 1/1 Running 0 3s
99
121
my-lws-0-2 1/1 Running 0 3s
100
-
my-lws-1 1/1 Running 0 7s <2>
122
+
my-lws-1 1/1 Running 0 7s
101
123
my-lws-1-1 1/1 Running 0 6s
102
124
my-lws-1-2 1/1 Running 0 6s
103
125
----
104
-
<1> The leader pod for the first group.
105
-
<2> The leader pod for the second group.
126
+
+
127
+
**`my-lws-0` is the leader pod for the first group.
128
+
**`my-lws-1` is the leader pod for the second group.
106
129
107
130
. Review the stateful sets by running the following command:
108
131
+
@@ -115,10 +138,11 @@ $ oc get statefulsets
115
138
[source,terminal]
116
139
----
117
140
NAME READY AGE
118
-
my-lws 4/4 111s <1>
119
-
my-lws-0 2/2 57s <2>
120
-
my-lws-1 2/2 60s <3>
141
+
my-lws 4/4 111s
142
+
my-lws-0 2/2 57s
143
+
my-lws-1 2/2 60s
121
144
----
122
-
<1> The leader stateful set for all leader-worker groups.
123
-
<2> The worker stateful set for the first group.
124
-
<3> The worker stateful set for the second group.
145
+
+
146
+
**`my-lws` is the leader stateful set for all leader-worker groups.
147
+
**`my-lws-0` is the worker stateful set for the first group.
148
+
**`my-lws-1` is the worker stateful set for the second group.
0 commit comments