-
Notifications
You must be signed in to change notification settings - Fork 495
[API]: feat(roles): add support for role dependencies to enforce startup order #1836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @googs1025, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
TODO: add unit test and integration test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable feature for managing role dependencies to enforce startup order. The changes are well-structured, covering API definitions, CRD updates, and controller logic. The implementation correctly uses topological sorting to order roles and adds readiness checks before scaling. My review focuses on improving code clarity, efficiency, and maintainability. I've identified some areas with duplicated code and opportunities for loop optimizations. Addressing these points will make the new logic more robust and easier to maintain.
| for i, role := range roleSet.Spec.Roles { | ||
| if !isRoleDependenciesReady(roleSet, &role) { | ||
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using &role where role is a for...range loop variable is a common pitfall in Go. It takes the address of the loop variable, which is reused across iterations. While it might work here because it's used within the loop's scope, it's safer and clearer to get a pointer to the slice element directly to avoid potential bugs.
Additionally, this block of code for dependency checking and scaling is duplicated across RollingManagerSequential, RollingManagerParallel, and RollingManagerInterleave. Consider refactoring it into a shared helper function to improve maintainability and avoid repeating the same logic.
| for i, role := range roleSet.Spec.Roles { | |
| if !isRoleDependenciesReady(roleSet, &role) { | |
| break | |
| for i := range roleSet.Spec.Roles { | |
| if !isRoleDependenciesReady(roleSet, &roleSet.Spec.Roles[i]) { | |
| break |
| for i, role := range roleSet.Spec.Roles { | ||
| if !isRoleDependenciesReady(roleSet, &role) { | ||
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my other comment, using &role with a for...range loop variable can be error-prone. It's better to reference the slice element directly to get a stable pointer.
| for i, role := range roleSet.Spec.Roles { | |
| if !isRoleDependenciesReady(roleSet, &role) { | |
| break | |
| for i := range roleSet.Spec.Roles { | |
| if !isRoleDependenciesReady(roleSet, &roleSet.Spec.Roles[i]) { | |
| break |
| for i, role := range roleSet.Spec.Roles { | ||
| if !isRoleDependenciesReady(roleSet, &role) { | ||
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my other comment, using &role with a for...range loop variable can be error-prone. It's better to reference the slice element directly to get a stable pointer.
| for i, role := range roleSet.Spec.Roles { | |
| if !isRoleDependenciesReady(roleSet, &role) { | |
| break | |
| for i := range roleSet.Spec.Roles { | |
| if !isRoleDependenciesReady(roleSet, &roleSet.Spec.Roles[i]) { | |
| break |
| for _, name := range getRoleNames(roles) { | ||
| graph[name] = []string{} | ||
| inDegree[name] = 0 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to getRoleNames(roles) is redundant. You are iterating over roles to create a slice of names, and then iterating over that slice to initialize the maps. You can achieve the same result by iterating over the roleNames map which is already available and contains all role names as keys.
This change would also allow for the removal of the getRoleNames helper function, simplifying the code.
| for _, name := range getRoleNames(roles) { | |
| graph[name] = []string{} | |
| inDegree[name] = 0 | |
| } | |
| for name := range roleNames { | |
| graph[name] = []string{} | |
| inDegree[name] = 0 | |
| } |
e7f3694 to
d1a9db4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
|
test in local: apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
metadata:
name: vllm-sim-pd
spec:
replicas: 2
stateful: true
template:
spec:
roles:
- name: ingress # No dependencies
replicas: 1
stateful: true
template: { ... }
- name: prefill # Depends on decode (intentionally reversed for testing)
dependencies: ["decode"]
replicas: 2
stateful: true
template: { ... }
- name: decode # Depends on ingress
dependencies: ["ingress"]
replicas: 2
stateful: true
template: { ... }
- name: postprocess # Depends on decode + prefill
dependencies: ["decode", "prefill"]
replicas: 2
stateful: true
template: { ... }➜ aibrix-vllm-smi kubectl get pods -w
NAME READY STATUS RESTARTS AGE
vllm-sim-pd-roleset-fflkd-ingress-769bbc-0 0/1 Init:0/1 0 2s
vllm-sim-pd-roleset-p7hdg-ingress-769bbc-0 0/1 Init:0/1 0 2s
vllm-sim-pd-roleset-p7hdg-ingress-769bbc-0 0/1 Init:0/1 0 6s
vllm-sim-pd-roleset-fflkd-ingress-769bbc-0 0/1 Init:0/1 0 10s
vllm-sim-pd-roleset-p7hdg-ingress-769bbc-0 0/1 PodInitializing 0 15s
vllm-sim-pd-roleset-p7hdg-ingress-769bbc-0 1/1 Running 0 16s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-1 0/2 Pending 0 0s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-0 0/2 Pending 0 0s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-1 0/2 Pending 0 0s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-0 0/2 Pending 0 0s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-1 0/2 Init:0/1 0 0s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-0 0/2 Init:0/1 0 0s
vllm-sim-pd-roleset-fflkd-ingress-769bbc-0 0/1 PodInitializing 0 19s
vllm-sim-pd-roleset-fflkd-ingress-769bbc-0 1/1 Running 0 20s
vllm-sim-pd-roleset-fflkd-decode-67c59b-0 0/2 Pending 0 0s
vllm-sim-pd-roleset-fflkd-decode-67c59b-1 0/2 Pending 0 0s
vllm-sim-pd-roleset-fflkd-decode-67c59b-0 0/2 Pending 0 0s
vllm-sim-pd-roleset-fflkd-decode-67c59b-1 0/2 Pending 0 0s
vllm-sim-pd-roleset-fflkd-decode-67c59b-0 0/2 Init:0/1 0 0s
vllm-sim-pd-roleset-fflkd-decode-67c59b-1 0/2 Init:0/1 0 0s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-1 0/2 Init:0/1 0 5s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-0 0/2 Init:0/1 0 9s
vllm-sim-pd-roleset-fflkd-decode-67c59b-1 0/2 Init:0/1 0 9s
vllm-sim-pd-roleset-fflkd-decode-67c59b-0 0/2 Init:0/1 0 14s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-1 0/2 PodInitializing 0 26s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-1 2/2 Running 0 27s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-0 0/2 PodInitializing 0 30s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-0 2/2 Running 0 31s
vllm-sim-pd-roleset-p7hdg-prefill-d64758-1 0/1 Pending 0 0s
vllm-sim-pd-roleset-p7hdg-prefill-d64758-0 0/1 Pending 0 0s
vllm-sim-pd-roleset-p7hdg-prefill-d64758-1 0/1 Pending 0 0s
vllm-sim-pd-roleset-p7hdg-prefill-d64758-0 0/1 Pending 0 0s
vllm-sim-pd-roleset-p7hdg-prefill-d64758-1 0/1 ContainerCreating 0 0s
vllm-sim-pd-roleset-p7hdg-prefill-d64758-0 0/1 ContainerCreating 0 0s
vllm-sim-pd-roleset-p7hdg-prefill-d64758-0 1/1 Running 0 1s
vllm-sim-pd-roleset-p7hdg-prefill-d64758-1 1/1 Running 0 1s
vllm-sim-pd-roleset-p7hdg-postprocess-594f48-0 0/1 Pending 0 0s
vllm-sim-pd-roleset-p7hdg-postprocess-594f48-1 0/1 Pending 0 0s
vllm-sim-pd-roleset-p7hdg-postprocess-594f48-0 0/1 Pending 0 0s
vllm-sim-pd-roleset-p7hdg-postprocess-594f48-1 0/1 Pending 0 0s
vllm-sim-pd-roleset-p7hdg-postprocess-594f48-0 0/1 ContainerCreating 0 0s
vllm-sim-pd-roleset-p7hdg-postprocess-594f48-1 0/1 ContainerCreating 0 0s
vllm-sim-pd-roleset-fflkd-decode-67c59b-1 0/2 PodInitializing 0 30s
vllm-sim-pd-roleset-fflkd-decode-67c59b-1 2/2 Running 0 31s
vllm-sim-pd-roleset-p7hdg-postprocess-594f48-0 1/1 Running 0 5s
vllm-sim-pd-roleset-fflkd-decode-67c59b-0 0/2 PodInitializing 0 33s
vllm-sim-pd-roleset-fflkd-decode-67c59b-0 2/2 Running 0 34s
vllm-sim-pd-roleset-fflkd-prefill-d64758-1 0/1 Pending 0 0s
vllm-sim-pd-roleset-fflkd-prefill-d64758-0 0/1 Pending 0 0s
vllm-sim-pd-roleset-fflkd-prefill-d64758-1 0/1 Pending 0 0s
vllm-sim-pd-roleset-fflkd-prefill-d64758-0 0/1 Pending 0 0s
vllm-sim-pd-roleset-fflkd-prefill-d64758-1 0/1 ContainerCreating 0 0s
vllm-sim-pd-roleset-fflkd-prefill-d64758-0 0/1 ContainerCreating 0 0s
vllm-sim-pd-roleset-fflkd-prefill-d64758-1 1/1 Running 0 1s
vllm-sim-pd-roleset-fflkd-prefill-d64758-0 1/1 Running 0 1s
vllm-sim-pd-roleset-fflkd-postprocess-594f48-1 0/1 Pending 0 0s
vllm-sim-pd-roleset-fflkd-postprocess-594f48-0 0/1 Pending 0 0s
vllm-sim-pd-roleset-fflkd-postprocess-594f48-1 0/1 Pending 0 0s
vllm-sim-pd-roleset-fflkd-postprocess-594f48-0 0/1 Pending 0 0s
vllm-sim-pd-roleset-fflkd-postprocess-594f48-1 0/1 ContainerCreating 0 0s
vllm-sim-pd-roleset-fflkd-postprocess-594f48-0 0/1 ContainerCreating 0 0s
vllm-sim-pd-roleset-p7hdg-postprocess-594f48-1 1/1 Running 0 9s
vllm-sim-pd-roleset-fflkd-postprocess-594f48-0 1/1 Running 0 6s
vllm-sim-pd-roleset-fflkd-postprocess-594f48-1 1/1 Running 0 9s
|
d1a9db4 to
760d475
Compare
760d475 to
c741fe3
Compare
done |
c741fe3 to
83bd690
Compare
| // topologicalSortRolesFromSpec sorts roles by their dependencies using Kahn's algorithm. | ||
| // Returns sorted []RoleSpec, or error if circular dependency exists. | ||
| // TODO(CYJiang): validate role dependencies during StormService creation in webhook to prevent circular dependencies. | ||
| func (r *StormServiceReconciler) topologicalSortRolesFromSpec(roles []orchestrationv1alpha1.RoleSpec) ([]orchestrationv1alpha1.RoleSpec, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be validated in stormservice webhook
Signed-off-by: CYJiang <googs1025@gmail.com>
83bd690 to
972ffcc
Compare
| if !isRoleDependenciesReady(roleSet, &role) { | ||
| klog.Infof("Skipping role %s/%s: dependencies not ready (will retry in next reconcile)", | ||
| roleSet.Namespace, role.Name) | ||
| continue // interleaved mode: skip only this role, others may proceed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interleaved mode so we continue
| for _, role := range sortedRoles { | ||
| if !isRoleDependenciesReady(roleSet, &role) { | ||
| klog.Infof("Rollout of role %s blocked: dependencies not ready. Stopping sequential update.", role.Name) | ||
| break // enforce strict sequential order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sequential mode so we break
| if !isRoleDependenciesReady(roleSet, &role) { | ||
| klog.Infof("Skipping rollout for role %s: "+ | ||
| "dependencies not ready (parallel update continues for other roles)", role.Name) | ||
| continue // parallel mode, skip only this role |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parallel mode so we continue
|
I left some comments in original ticket. Let's firstly check whether this is necessary |
Pull Request Description
[Please provide a clear and concise description of your changes here]
Related Issues
Resolves: #1834
Important: Before submitting, please complete the description above and review the checklist below.
Contribution Guidelines (Expand for Details)
We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:
Pull Request Title Format
Your PR title should start with one of these prefixes to indicate the nature of the change:
[Bug]: Corrections to existing functionality[CI]: Changes to build process or CI pipeline[Docs]: Updates or additions to documentation[API]: Modifications to aibrix's API or interface[CLI]: Changes or additions to the Command Line Interface[Misc]: For changes not covered above (use sparingly)Note: For changes spanning multiple categories, use multiple prefixes in order of importance.
Submission Checklist
By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.