Skip to content

Initial commit for dynamic and super slicing demo on TPU7x#1271

Open
donmccasland wants to merge 1 commit into
mainfrom
dynamic-slicing
Open

Initial commit for dynamic and super slicing demo on TPU7x#1271
donmccasland wants to merge 1 commit into
mainfrom
dynamic-slicing

Conversation

@donmccasland
Copy link
Copy Markdown
Member

DO NOT MERGE - working effort

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive suite of scripts and templates to set up, deploy, monitor, and tear down a Gemma 3 Disaggregated Serving Environment on GKE with TPUs. The review feedback highlights critical issues that need to be addressed: fixing Python syntax errors and socket recreation bugs in the workload deployment script, reducing excessive CPU and memory requests for the JobSet, Kueue, and Slice controllers to ensure they can be scheduled on the default node pool, and dynamically generating the GKE access token in the environment setup to prevent authentication expiration.

Comment on lines +89 to +98
python3 -c "
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
while True:
try:
s.connect(('127.0.0.1', 6379))
break
except:
time.sleep(1)
"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The socket object s is created outside the while True loop. In Python, once a connection attempt fails, the socket is left in an invalid state. Subsequent calls to s.connect() will fail immediately with OSError (e.g., EISCONN or EINVAL) instead of retrying. Recreate the socket inside the loop to ensure robust retries.

Suggested change
python3 -c "
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
while True:
try:
s.connect(('127.0.0.1', 6379))
break
except:
time.sleep(1)
"
python3 -c "
import socket, time
while True:
try:
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('127.0.0.1', 6379))
s.close()
break
except:
time.sleep(1)
"

while True:
try:
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((''\${HEAD_NODE}'', 6379))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using double single quotes ''\${HEAD_NODE}'' inside the Python command string will result in invalid Python syntax (e.g., ''gemma3-serving-vllm-0-0.gemma3-serving-vllm''), causing the worker pods to crash immediately with a SyntaxError. Use single quotes around the variable to allow proper shell expansion into a valid Python string literal.

Suggested change
s.connect((''\${HEAD_NODE}'', 6379))
s.connect(('\$HEAD_NODE', 6379))

Comment on lines +273 to +282
python3 -c "
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
while True:
try:
s.connect(('127.0.0.1', 6379))
break
except:
time.sleep(1)
"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The socket object s is created outside the while True loop. In Python, once a connection attempt fails, the socket is left in an invalid state. Subsequent calls to s.connect() will fail immediately with OSError instead of retrying. Recreate the socket inside the loop to ensure robust retries.

Suggested change
python3 -c "
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
while True:
try:
s.connect(('127.0.0.1', 6379))
break
except:
time.sleep(1)
"
python3 -c "
import socket, time
while True:
try:
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('127.0.0.1', 6379))
s.close()
break
except:
time.sleep(1)
"

while True:
try:
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((''\${HEAD_NODE}'', 6379))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using double single quotes ''\${HEAD_NODE}'' inside the Python command string will result in invalid Python syntax, causing the worker pods to crash immediately with a SyntaxError. Use single quotes around the variable to allow proper shell expansion into a valid Python string literal.

Suggested change
s.connect((''\${HEAD_NODE}'', 6379))
s.connect(('\$HEAD_NODE', 6379))

Comment on lines +39 to +40
--set controller.resources.requests.cpu=4 \
--set controller.resources.requests.memory=16Gi \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The default GKE node pool only provisions a single e2-standard-4 node (4 vCPUs, 16GB RAM). Requesting 4 vCPUs and 16GiB of memory for the JobSet controller will prevent it from scheduling alongside other system pods. Reduce the resource requests to fit within the default node pool.

Suggested change
--set controller.resources.requests.cpu=4 \
--set controller.resources.requests.memory=16Gi \
--set controller.resources.requests.cpu=500m \
--set controller.resources.requests.memory=512Mi \

Comment on lines +51 to +52
--set controllerManager.manager.resources.requests.cpu=16 \
--set controllerManager.manager.resources.requests.memory=64Gi \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The default GKE node pool only provisions a single e2-standard-4 node (4 vCPUs, 16GB RAM). Requesting 16 vCPUs and 64GiB of memory for the Kueue controller manager will make it completely unschedulable, leaving the Kueue pods in a permanent Pending state. Reduce the resource requests to fit within the default node pool.

Suggested change
--set controllerManager.manager.resources.requests.cpu=16 \
--set controllerManager.manager.resources.requests.memory=64Gi \
--set controllerManager.manager.resources.requests.cpu=500m \
--set controllerManager.manager.resources.requests.memory=512Mi \

Comment on lines +358 to +364
resources:
limits:
cpu: 12000m
memory: 32Gi
requests:
cpu: 8000m
memory: 16Gi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The default GKE node pool only provisions a single e2-standard-4 node (4 vCPUs, 16GB RAM). Requesting 8 vCPUs and 16GiB of memory for the Slice Controller will make it unschedulable, leaving the controller in a permanent Pending state. Reduce the resource requests to fit within the default node pool.

        resources:
          limits:
            cpu: 2000m
            memory: 4Gi
          requests:
            cpu: 500m
            memory: 512Mi

Comment on lines +157 to +165
# Generate fresh access token
TOKEN="$(gcloud auth print-access-token)"

# Append to env.sh
cat <<EOF >> env.sh
export SERVER="https://${PRIVATE_IP}"
export NODE_VM_NAME="${VM_NAME}"
export TOKEN="${TOKEN}"
EOF
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding a static OAuth token in env.sh will cause authentication failures after 1 hour when the token expires. Instead, configure env.sh to dynamically generate a fresh token using gcloud auth print-access-token every time it is sourced.

Suggested change
# Generate fresh access token
TOKEN="$(gcloud auth print-access-token)"
# Append to env.sh
cat <<EOF >> env.sh
export SERVER="https://${PRIVATE_IP}"
export NODE_VM_NAME="${VM_NAME}"
export TOKEN="${TOKEN}"
EOF
# Append to env.sh
cat <<EOF >> env.sh
export SERVER="https://${PRIVATE_IP}"
export NODE_VM_NAME="${VM_NAME}"
export TOKEN="\$(gcloud auth print-access-token)"
EOF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant