Skip to content

Conversation

@googs1025
Copy link
Collaborator

Pull Request Description

[Please provide a clear and concise description of your changes here]

Related Issues

Resolves: #1834

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @googs1025, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the aibrix orchestration system by introducing robust support for role dependencies. It enables users to define a precise startup order for roles within a StormService, guaranteeing that critical services are not initiated until their prerequisite components are fully operational. This feature improves the stability and predictability of complex deployments by preventing race conditions and ensuring a controlled application startup sequence.

Highlights

  • API Extension for Role Dependencies: A new Dependencies field has been added to the RoleSpec within the StormService and RoleSet APIs. This field allows users to explicitly define a list of other roles that must be ready before the current role can start, establishing a clear startup order.
  • Enforced Startup Order in Rollout Managers: The RollingManager implementations (Sequential, Parallel, Interleave) have been updated to incorporate dependency checks. Before scaling or rolling out any role, the system now verifies that all its declared dependencies are fully ready, ensuring proper service initialization.
  • Topological Sorting for Role Processing: A topological sort algorithm (Kahn's algorithm) has been introduced to order roles based on their dependencies during StormService creation. This ensures that roles are processed in a valid sequence and includes detection for circular dependencies, preventing potential deadlocks.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@googs1025
Copy link
Collaborator Author

TODO: add unit test and integration test

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature for managing role dependencies to enforce startup order. The changes are well-structured, covering API definitions, CRD updates, and controller logic. The implementation correctly uses topological sorting to order roles and adds readiness checks before scaling. My review focuses on improving code clarity, efficiency, and maintainability. I've identified some areas with duplicated code and opportunities for loop optimizations. Addressing these points will make the new logic more robust and easier to maintain.

Comment on lines +42 to +44
for i, role := range roleSet.Spec.Roles {
if !isRoleDependenciesReady(roleSet, &role) {
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using &role where role is a for...range loop variable is a common pitfall in Go. It takes the address of the loop variable, which is reused across iterations. While it might work here because it's used within the loop's scope, it's safer and clearer to get a pointer to the slice element directly to avoid potential bugs.

Additionally, this block of code for dependency checking and scaling is duplicated across RollingManagerSequential, RollingManagerParallel, and RollingManagerInterleave. Consider refactoring it into a shared helper function to improve maintainability and avoid repeating the same logic.

Suggested change
for i, role := range roleSet.Spec.Roles {
if !isRoleDependenciesReady(roleSet, &role) {
break
for i := range roleSet.Spec.Roles {
if !isRoleDependenciesReady(roleSet, &roleSet.Spec.Roles[i]) {
break

Comment on lines +112 to +118
for i, role := range roleSet.Spec.Roles {
if !isRoleDependenciesReady(roleSet, &role) {
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to my other comment, using &role with a for...range loop variable can be error-prone. It's better to reference the slice element directly to get a stable pointer.

Suggested change
for i, role := range roleSet.Spec.Roles {
if !isRoleDependenciesReady(roleSet, &role) {
break
for i := range roleSet.Spec.Roles {
if !isRoleDependenciesReady(roleSet, &roleSet.Spec.Roles[i]) {
break

Comment on lines +158 to +169
for i, role := range roleSet.Spec.Roles {
if !isRoleDependenciesReady(roleSet, &role) {
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to my other comment, using &role with a for...range loop variable can be error-prone. It's better to reference the slice element directly to get a stable pointer.

Suggested change
for i, role := range roleSet.Spec.Roles {
if !isRoleDependenciesReady(roleSet, &role) {
break
for i := range roleSet.Spec.Roles {
if !isRoleDependenciesReady(roleSet, &roleSet.Spec.Roles[i]) {
break

Comment on lines +102 to +105
for _, name := range getRoleNames(roles) {
graph[name] = []string{}
inDegree[name] = 0
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to getRoleNames(roles) is redundant. You are iterating over roles to create a slice of names, and then iterating over that slice to initialize the maps. You can achieve the same result by iterating over the roleNames map which is already available and contains all role names as keys.

This change would also allow for the removal of the getRoleNames helper function, simplifying the code.

Suggested change
for _, name := range getRoleNames(roles) {
graph[name] = []string{}
inDegree[name] = 0
}
for name := range roleNames {
graph[name] = []string{}
inDegree[name] = 0
}

@googs1025 googs1025 force-pushed the dependency branch 5 times, most recently from e7f3694 to d1a9db4 Compare December 10, 2025 03:04
@googs1025 googs1025 requested a review from Copilot December 10, 2025 03:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@googs1025
Copy link
Collaborator Author

test in local:

apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
metadata:
  name: vllm-sim-pd
spec:
  replicas: 2
  stateful: true
  template:
    spec:
      roles:
        - name: ingress          # No dependencies
          replicas: 1
          stateful: true
          template: { ... }

        - name: prefill          # Depends on decode (intentionally reversed for testing)
          dependencies: ["decode"]
          replicas: 2
          stateful: true
          template: { ... }

        - name: decode           # Depends on ingress
          dependencies: ["ingress"]
          replicas: 2
          stateful: true
          template: { ... }

        - name: postprocess      # Depends on decode + prefill
          dependencies: ["decode", "prefill"]
          replicas: 2
          stateful: true
          template: { ... }
➜  aibrix-vllm-smi kubectl get pods -w
NAME                                         READY   STATUS     RESTARTS      AGE
vllm-sim-pd-roleset-fflkd-ingress-769bbc-0   0/1     Init:0/1   0             2s
vllm-sim-pd-roleset-p7hdg-ingress-769bbc-0   0/1     Init:0/1   0             2s


vllm-sim-pd-roleset-p7hdg-ingress-769bbc-0   0/1     Init:0/1   0             6s
vllm-sim-pd-roleset-fflkd-ingress-769bbc-0   0/1     Init:0/1   0             10s
vllm-sim-pd-roleset-p7hdg-ingress-769bbc-0   0/1     PodInitializing   0             15s
vllm-sim-pd-roleset-p7hdg-ingress-769bbc-0   1/1     Running           0             16s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-1    0/2     Pending           0             0s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-0    0/2     Pending           0             0s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-1    0/2     Pending           0             0s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-0    0/2     Pending           0             0s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-1    0/2     Init:0/1          0             0s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-0    0/2     Init:0/1          0             0s
vllm-sim-pd-roleset-fflkd-ingress-769bbc-0   0/1     PodInitializing   0             19s
vllm-sim-pd-roleset-fflkd-ingress-769bbc-0   1/1     Running           0             20s
vllm-sim-pd-roleset-fflkd-decode-67c59b-0    0/2     Pending           0             0s
vllm-sim-pd-roleset-fflkd-decode-67c59b-1    0/2     Pending           0             0s
vllm-sim-pd-roleset-fflkd-decode-67c59b-0    0/2     Pending           0             0s
vllm-sim-pd-roleset-fflkd-decode-67c59b-1    0/2     Pending           0             0s
vllm-sim-pd-roleset-fflkd-decode-67c59b-0    0/2     Init:0/1          0             0s
vllm-sim-pd-roleset-fflkd-decode-67c59b-1    0/2     Init:0/1          0             0s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-1    0/2     Init:0/1          0             5s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-0    0/2     Init:0/1          0             9s
vllm-sim-pd-roleset-fflkd-decode-67c59b-1    0/2     Init:0/1          0             9s


vllm-sim-pd-roleset-fflkd-decode-67c59b-0    0/2     Init:0/1          0             14s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-1    0/2     PodInitializing   0             26s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-1    2/2     Running           0             27s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-0    0/2     PodInitializing   0             30s
vllm-sim-pd-roleset-p7hdg-decode-67c59b-0    2/2     Running           0             31s
vllm-sim-pd-roleset-p7hdg-prefill-d64758-1   0/1     Pending           0             0s
vllm-sim-pd-roleset-p7hdg-prefill-d64758-0   0/1     Pending           0             0s
vllm-sim-pd-roleset-p7hdg-prefill-d64758-1   0/1     Pending           0             0s
vllm-sim-pd-roleset-p7hdg-prefill-d64758-0   0/1     Pending           0             0s
vllm-sim-pd-roleset-p7hdg-prefill-d64758-1   0/1     ContainerCreating   0             0s
vllm-sim-pd-roleset-p7hdg-prefill-d64758-0   0/1     ContainerCreating   0             0s
vllm-sim-pd-roleset-p7hdg-prefill-d64758-0   1/1     Running             0             1s
vllm-sim-pd-roleset-p7hdg-prefill-d64758-1   1/1     Running             0             1s
vllm-sim-pd-roleset-p7hdg-postprocess-594f48-0   0/1     Pending             0             0s
vllm-sim-pd-roleset-p7hdg-postprocess-594f48-1   0/1     Pending             0             0s
vllm-sim-pd-roleset-p7hdg-postprocess-594f48-0   0/1     Pending             0             0s
vllm-sim-pd-roleset-p7hdg-postprocess-594f48-1   0/1     Pending             0             0s
vllm-sim-pd-roleset-p7hdg-postprocess-594f48-0   0/1     ContainerCreating   0             0s
vllm-sim-pd-roleset-p7hdg-postprocess-594f48-1   0/1     ContainerCreating   0             0s
vllm-sim-pd-roleset-fflkd-decode-67c59b-1        0/2     PodInitializing     0             30s
vllm-sim-pd-roleset-fflkd-decode-67c59b-1        2/2     Running             0             31s
vllm-sim-pd-roleset-p7hdg-postprocess-594f48-0   1/1     Running             0             5s
vllm-sim-pd-roleset-fflkd-decode-67c59b-0        0/2     PodInitializing     0             33s
vllm-sim-pd-roleset-fflkd-decode-67c59b-0        2/2     Running             0             34s
vllm-sim-pd-roleset-fflkd-prefill-d64758-1       0/1     Pending             0             0s
vllm-sim-pd-roleset-fflkd-prefill-d64758-0       0/1     Pending             0             0s
vllm-sim-pd-roleset-fflkd-prefill-d64758-1       0/1     Pending             0             0s
vllm-sim-pd-roleset-fflkd-prefill-d64758-0       0/1     Pending             0             0s
vllm-sim-pd-roleset-fflkd-prefill-d64758-1       0/1     ContainerCreating   0             0s
vllm-sim-pd-roleset-fflkd-prefill-d64758-0       0/1     ContainerCreating   0             0s
vllm-sim-pd-roleset-fflkd-prefill-d64758-1       1/1     Running             0             1s
vllm-sim-pd-roleset-fflkd-prefill-d64758-0       1/1     Running             0             1s
vllm-sim-pd-roleset-fflkd-postprocess-594f48-1   0/1     Pending             0             0s
vllm-sim-pd-roleset-fflkd-postprocess-594f48-0   0/1     Pending             0             0s
vllm-sim-pd-roleset-fflkd-postprocess-594f48-1   0/1     Pending             0             0s
vllm-sim-pd-roleset-fflkd-postprocess-594f48-0   0/1     Pending             0             0s
vllm-sim-pd-roleset-fflkd-postprocess-594f48-1   0/1     ContainerCreating   0             0s
vllm-sim-pd-roleset-fflkd-postprocess-594f48-0   0/1     ContainerCreating   0             0s
vllm-sim-pd-roleset-p7hdg-postprocess-594f48-1   1/1     Running             0             9s
vllm-sim-pd-roleset-fflkd-postprocess-594f48-0   1/1     Running             0             6s
vllm-sim-pd-roleset-fflkd-postprocess-594f48-1   1/1     Running             0             9s

@googs1025
Copy link
Collaborator Author

TODO: add unit test and integration test

done

// topologicalSortRolesFromSpec sorts roles by their dependencies using Kahn's algorithm.
// Returns sorted []RoleSpec, or error if circular dependency exists.
// TODO(CYJiang): validate role dependencies during StormService creation in webhook to prevent circular dependencies.
func (r *StormServiceReconciler) topologicalSortRolesFromSpec(roles []orchestrationv1alpha1.RoleSpec) ([]orchestrationv1alpha1.RoleSpec, error) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be validated in stormservice webhook

if !isRoleDependenciesReady(roleSet, &role) {
klog.Infof("Skipping role %s/%s: dependencies not ready (will retry in next reconcile)",
roleSet.Namespace, role.Name)
continue // interleaved mode: skip only this role, others may proceed
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interleaved mode so we continue

for _, role := range sortedRoles {
if !isRoleDependenciesReady(roleSet, &role) {
klog.Infof("Rollout of role %s blocked: dependencies not ready. Stopping sequential update.", role.Name)
break // enforce strict sequential order
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sequential mode so we break

if !isRoleDependenciesReady(roleSet, &role) {
klog.Infof("Skipping rollout for role %s: "+
"dependencies not ready (parallel update continues for other roles)", role.Name)
continue // parallel mode, skip only this role
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parallel mode so we continue

@googs1025 googs1025 assigned googs1025 and Jeffwan and unassigned googs1025 Dec 11, 2025
@Jeffwan
Copy link
Collaborator

Jeffwan commented Dec 11, 2025

I left some comments in original ticket. Let's firstly check whether this is necessary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Add dependencies field in StormService

2 participants