Initial commit for dynamic and super slicing demo on TPU7x#1271
Initial commit for dynamic and super slicing demo on TPU7x#1271donmccasland wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive suite of scripts and templates to set up, deploy, monitor, and tear down a Gemma 3 Disaggregated Serving Environment on GKE with TPUs. The review feedback highlights critical issues that need to be addressed: fixing Python syntax errors and socket recreation bugs in the workload deployment script, reducing excessive CPU and memory requests for the JobSet, Kueue, and Slice controllers to ensure they can be scheduled on the default node pool, and dynamically generating the GKE access token in the environment setup to prevent authentication expiration.
| python3 -c " | ||
| import socket, time | ||
| s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | ||
| while True: | ||
| try: | ||
| s.connect(('127.0.0.1', 6379)) | ||
| break | ||
| except: | ||
| time.sleep(1) | ||
| " |
There was a problem hiding this comment.
The socket object s is created outside the while True loop. In Python, once a connection attempt fails, the socket is left in an invalid state. Subsequent calls to s.connect() will fail immediately with OSError (e.g., EISCONN or EINVAL) instead of retrying. Recreate the socket inside the loop to ensure robust retries.
| python3 -c " | |
| import socket, time | |
| s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | |
| while True: | |
| try: | |
| s.connect(('127.0.0.1', 6379)) | |
| break | |
| except: | |
| time.sleep(1) | |
| " | |
| python3 -c " | |
| import socket, time | |
| while True: | |
| try: | |
| s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | |
| s.connect(('127.0.0.1', 6379)) | |
| s.close() | |
| break | |
| except: | |
| time.sleep(1) | |
| " |
| while True: | ||
| try: | ||
| s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | ||
| s.connect((''\${HEAD_NODE}'', 6379)) |
There was a problem hiding this comment.
Using double single quotes ''\${HEAD_NODE}'' inside the Python command string will result in invalid Python syntax (e.g., ''gemma3-serving-vllm-0-0.gemma3-serving-vllm''), causing the worker pods to crash immediately with a SyntaxError. Use single quotes around the variable to allow proper shell expansion into a valid Python string literal.
| s.connect((''\${HEAD_NODE}'', 6379)) | |
| s.connect(('\$HEAD_NODE', 6379)) |
| python3 -c " | ||
| import socket, time | ||
| s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | ||
| while True: | ||
| try: | ||
| s.connect(('127.0.0.1', 6379)) | ||
| break | ||
| except: | ||
| time.sleep(1) | ||
| " |
There was a problem hiding this comment.
The socket object s is created outside the while True loop. In Python, once a connection attempt fails, the socket is left in an invalid state. Subsequent calls to s.connect() will fail immediately with OSError instead of retrying. Recreate the socket inside the loop to ensure robust retries.
| python3 -c " | |
| import socket, time | |
| s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | |
| while True: | |
| try: | |
| s.connect(('127.0.0.1', 6379)) | |
| break | |
| except: | |
| time.sleep(1) | |
| " | |
| python3 -c " | |
| import socket, time | |
| while True: | |
| try: | |
| s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | |
| s.connect(('127.0.0.1', 6379)) | |
| s.close() | |
| break | |
| except: | |
| time.sleep(1) | |
| " |
| while True: | ||
| try: | ||
| s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | ||
| s.connect((''\${HEAD_NODE}'', 6379)) |
There was a problem hiding this comment.
Using double single quotes ''\${HEAD_NODE}'' inside the Python command string will result in invalid Python syntax, causing the worker pods to crash immediately with a SyntaxError. Use single quotes around the variable to allow proper shell expansion into a valid Python string literal.
| s.connect((''\${HEAD_NODE}'', 6379)) | |
| s.connect(('\$HEAD_NODE', 6379)) |
| --set controller.resources.requests.cpu=4 \ | ||
| --set controller.resources.requests.memory=16Gi \ |
There was a problem hiding this comment.
The default GKE node pool only provisions a single e2-standard-4 node (4 vCPUs, 16GB RAM). Requesting 4 vCPUs and 16GiB of memory for the JobSet controller will prevent it from scheduling alongside other system pods. Reduce the resource requests to fit within the default node pool.
| --set controller.resources.requests.cpu=4 \ | |
| --set controller.resources.requests.memory=16Gi \ | |
| --set controller.resources.requests.cpu=500m \ | |
| --set controller.resources.requests.memory=512Mi \ |
| --set controllerManager.manager.resources.requests.cpu=16 \ | ||
| --set controllerManager.manager.resources.requests.memory=64Gi \ |
There was a problem hiding this comment.
The default GKE node pool only provisions a single e2-standard-4 node (4 vCPUs, 16GB RAM). Requesting 16 vCPUs and 64GiB of memory for the Kueue controller manager will make it completely unschedulable, leaving the Kueue pods in a permanent Pending state. Reduce the resource requests to fit within the default node pool.
| --set controllerManager.manager.resources.requests.cpu=16 \ | |
| --set controllerManager.manager.resources.requests.memory=64Gi \ | |
| --set controllerManager.manager.resources.requests.cpu=500m \ | |
| --set controllerManager.manager.resources.requests.memory=512Mi \ |
| resources: | ||
| limits: | ||
| cpu: 12000m | ||
| memory: 32Gi | ||
| requests: | ||
| cpu: 8000m | ||
| memory: 16Gi |
There was a problem hiding this comment.
The default GKE node pool only provisions a single e2-standard-4 node (4 vCPUs, 16GB RAM). Requesting 8 vCPUs and 16GiB of memory for the Slice Controller will make it unschedulable, leaving the controller in a permanent Pending state. Reduce the resource requests to fit within the default node pool.
resources:
limits:
cpu: 2000m
memory: 4Gi
requests:
cpu: 500m
memory: 512Mi| # Generate fresh access token | ||
| TOKEN="$(gcloud auth print-access-token)" | ||
|
|
||
| # Append to env.sh | ||
| cat <<EOF >> env.sh | ||
| export SERVER="https://${PRIVATE_IP}" | ||
| export NODE_VM_NAME="${VM_NAME}" | ||
| export TOKEN="${TOKEN}" | ||
| EOF |
There was a problem hiding this comment.
Hardcoding a static OAuth token in env.sh will cause authentication failures after 1 hour when the token expires. Instead, configure env.sh to dynamically generate a fresh token using gcloud auth print-access-token every time it is sourced.
| # Generate fresh access token | |
| TOKEN="$(gcloud auth print-access-token)" | |
| # Append to env.sh | |
| cat <<EOF >> env.sh | |
| export SERVER="https://${PRIVATE_IP}" | |
| export NODE_VM_NAME="${VM_NAME}" | |
| export TOKEN="${TOKEN}" | |
| EOF | |
| # Append to env.sh | |
| cat <<EOF >> env.sh | |
| export SERVER="https://${PRIVATE_IP}" | |
| export NODE_VM_NAME="${VM_NAME}" | |
| export TOKEN="\$(gcloud auth print-access-token)" | |
| EOF |
DO NOT MERGE - working effort