Skip to content

[WIP]OSDOCS-19990: Resource fair sharing#113457

Open
StephenJamesSmith wants to merge 1 commit into
openshift:mainfrom
StephenJamesSmith:OSDOCS-19990
Open

[WIP]OSDOCS-19990: Resource fair sharing#113457
StephenJamesSmith wants to merge 1 commit into
openshift:mainfrom
StephenJamesSmith:OSDOCS-19990

Conversation

@StephenJamesSmith

@StephenJamesSmith StephenJamesSmith commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 16, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 16, 2026

Copy link
Copy Markdown

@StephenJamesSmith: This pull request references OSDOCS-19990 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Admission Fair Sharing (Kueue) Integration for Multi-Tenant Resource Fairness

Version: 4.22+

Jira: https://redhat.atlassian.net/browse/OSDOCS-19990

Preview:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 16, 2026
@ocpdocs-previewbot

ocpdocs-previewbot commented Jun 16, 2026

Copy link
Copy Markdown

🤖 Tue Jun 16 14:57:00 - Prow CI generated the docs preview:

https://113457--ocpdocs-pr.netlify.app/openshift-enterprise/latest/ai_workloads/kueue/admission-fair-sharing.html

+
`resourceWeights`:: Assigns weights to resources. The higher the weight, the higher the penalty.

`usageHalfLifeTimeSeconds`:: The time in seconds after which the current usage will decrease by half. In other words, controls how long the past consumption should impact future admission.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 [error] RedHat.TermsErrors: Use 'for example' or 'that is' rather than 'In other words'. For more information, see RedHat.TermsErrors.

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown

@StephenJamesSmith: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.


Given this upstream limitation, isolating CPU as the sole scoring factor by setting a memory weight of `0` is a reliable approach for deterministic fair sharing behavior.

The following example contains `admissionFairSharing.resourceWeights` settings for mixed CPU, memory, and GPU weights:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaysaMacedo Wondering if we should use the example Option A from kubernetes-sigs/kueue#10434?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't because that is not implemented yet.

. Choose the `configuration` type you want to use:
+
* `Default`: Uses {kueue-name} predefined values.
* `Custom`: Uses {kueue-name} predefined values.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Custom allows for the user to specify their own desired values.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

+
[source,yaml]
----
config:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead we can recommend the user to use the following command to apply the default configuration:

oc patch kueue.kueue.openshift.io/cluster --type=merge -p \
  '{"spec":{"config":{"admissionFairSharing":{"configuration":"Default"}}}}'

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a small correction Stephen. The command would be:

oc patch kueue.kueue.openshift.io/cluster --type=merge -p \
  '{"spec":{"config":{"admissionFairSharing":{"configuration":"Default","custom":null}}}}'

configuration: Default
----
+
* For `Custom` configuration use the following command:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* For `Custom` configuration use the following command:
* For `Custom` configuration you can adapt the following command with your desired values:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "Use the following command to create a Custom configuration that applies values that you specify:"

[source,terminal]
----
oc patch kueue.kueue.openshift.io/cluster --type=merge -p \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary space

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.


[role="_abstract"]
Use Admission fair sharing to fairly distribute workloads across LocalQueues that share a single ClusterQueue.
This feature balances workload admission by prioritizing workloads from tenants that have used fewer resources historically. It tracks usage over time with a configurable decay function and applies admission penalties when workloads are admitted.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This feature balances workload admission by prioritizing workloads from tenants that have used fewer resources historically. It tracks usage over time with a configurable decay function and applies admission penalties when workloads are admitted.
This feature balances workload admission by prioritizing workloads from local Queues that have used fewer resources historically. It tracks usage over time with a configurable decay function and applies admission penalties when workloads are admitted.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

[id="setting-resource-weights_{context}"]
= Setting resource weights

[role="_abstract"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will get back to this.

= Setting resource weights

[role="_abstract"]
Resource weights define the relative importance of different resource types (CPU, memory, GPU) when calculating admission penalties. Queues that consume resources with higher weights receive larger penalties, reducing their priority for future workload admission.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding a section specific to resourceWeights, I was thinking we could just add a note where you explained what each field of the configuration is, with something like:

When using Admission Fair Sharing, the resourceWeights for any resource whose Kubernetes quantity is expressed in bytes — such as memory — must be scaled down to compensate for the internal byte representation. Without this adjustment, the raw byte value of these resources will numerically dominate human-scale resources, such as CPU cores, by several orders of magnitude, effectively making their weights meaningless. For example, if you would like to specify the value of 1.0 for the memory weight, you would need to instead specify 9.31e-10, which corresponds to 1.0 / 1,073,741,824.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anahas-redhat @kannon92 let me know what you guys think about it

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree removing this. Also because in the example below we're using GPUs which may be out of context without DRA explanation. Added more details here: #113457 (comment)

@MaysaMacedo

Copy link
Copy Markdown
Contributor

@StephenJamesSmith In the description of the PR you mentioned this is for Version: 4.22+. However, that should be Version: 4.18+. Can you adapt that? Thank you

Use Admission fair sharing to fairly distribute workloads across LocalQueues that share a single ClusterQueue.
This feature balances workload admission by prioritizing workloads from tenants that have used fewer resources historically. It tracks usage over time with a configurable decay function and applies admission penalties when workloads are admitted.

The shared ClusterQueue causes resource starvation between tenants, creating a high risk of resource starvation for the tenants. Admission fair sharing adresses this issue by meeting the following requirements:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"high risk of resource starvation" — redundant phrasing
Suggestion: "When multiple tenants share a single ClusterQueue, some tenants risk resource starvation. Admission fair sharing addresses this by..."


Improve service predictability:: Guarantee each tenant gets a consistent share of resources, reducing latency spikes and preventing starvation.

Enable scalable governance:: Use dynamic, usage-based allocation instead of complex static quotas.

@anahas-redhat anahas-redhat Jun 16, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be changed because Admission fair sharing does not replace quotas. It works alongside ClusterQueue quotas.. maybe something like:
"Complement static quotas with dynamic, usage-based admission ordering that adapts as tenant demand changes."

nvidia.com/gpu.count : 50
----

In this example, ....................

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add some explanation here?


[source,yaml]
----
admissionFairSharing:

@anahas-redhat anahas-redhat Jun 16, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This format does not work because time fields are in second downstream, resource name is not valid and the format is wrong.
If we want to detail about GPUs, the user would need to first create a DeviceClass (considering Nvidia, from your example):

  "spec": {
    "config": {
      "resources": {
        "deviceClassMappings": [{
          "name": "nvidia-gpus",
          "deviceClassNames": ["gpu.nvidia.com"]
        }]
      }
    }
  }
}'

And then configure Kueue Operand like this (considering the time in your example):

oc patch kueue.kueue.openshift.io/cluster --type=merge -p '{
  "spec": {
    "config": {
      "admissionFairSharing": {
        "configuration": "Custom",
        "custom": {
          "usageHalfLifeTimeSeconds": 432000,
          "usageSamplingIntervalSeconds": 300,
          "resourceWeights": [
            {"name": "cpu", "weight": "1"},
            {"name": "memory", "weight": "4"},
            {"name": "nvidia-gpus", "weight": "50"}
          ]
        }
      }
    }
  }
}'

I guess we agreed to talk about Admission Fair Sharing on the Kueue + DRA documents, right? Because it may be out of context for the user to add these concepts here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants