Add Job workload support to CRUD benchmarking framework by diamondpowell · Pull Request #1133 · Azure/telescope

diamondpowell · 2026-04-14T19:27:57Z

Summary

Adds Job workload support to the CRUD benchmarking framework — the third and final planned workload method. Unlike deployments/statefulsets which run indefinitely, Jobs are run-to-completion workloads: success means the pod terminated cleanly (succeeded > 0), failure raises immediately.

Branch cleanup note: Rebased and squashed for reviewability. Independent of StatefulSet PR (#1132).

Changes

modules/python/crud/workload_templates/job.yml

New K8s manifest template using batch/v1 API with restartPolicy: Never
Uses JOB_COMPLETIONS placeholder, no parallelism (defaults to match completions)

modules/python/crud/azure/node_pool_crud.py

Add create_job() — same loop pattern as other workloads
Uses complete condition instead of available/ready since Jobs terminate after completion
No wait_for_pods_ready — pods exit after job finishes

modules/python/crud/main.py

Add jobs subparser with --node-pool-name, --number-of-jobs, --completions, --manifest-dir
Add elif command == "jobs" routing in handle_workload_operations

modules/python/clients/kubernetes_client.py

Add _check_job_condition and _is_job_condition_met — checks completion_time + succeeded count
Add wait_for_job_completed with 5-min timeout and 30s polling
Add Job kind support to apply_manifest, update_manifest, delete_manifest

steps/engine/crud/k8s/execute.yml

Add jobs script block calling python3 main.py jobs
Add number_of_jobs and completions parameters

steps/topology/k8s-crud-gpu/execute-crud.yml

Wire number_of_jobs and completions through to engine step

modules/python/clients/aks_client.py

Fix: set gpu_profile driver to "None" for non-GPU node pools

Tests

test_azure_node_pool_crud.py:

test_create_job_success
test_create_job_failure
test_create_job_no_client
test_create_job_partial_success

test_kubernetes_client.py:

test_wait_for_condition_job_success — Job completes successfully
test_wait_for_condition_job_timeout — fails within timeout
test_wait_for_condition_job_not_found — not found, returns failure

Dependencies

Based on test-refactor (PR #879) — must merge first. Independent of StatefulSet PR (#1132).

Add create_job() to NodePoolCRUD that deploys K8s Jobs onto node pools. Unlike deployments/statefulsets which run indefinitely, Jobs are run-to-completion workloads — success means the pod terminated cleanly (succeeded > 0), failure raises immediately (no self-healing). - Add 'jobs' subcommand to handle_workload_operations() in main.py with --number-of-jobs and --completions args - Add job.yml workload template with configurable completions and node affinity via label_selector - Add _check_job_condition and _is_job_condition_met to kubernetes_client.py — checks completion_time + succeeded count - Add wait_for_job_completed with 5-min timeout and 30s polling - Job kind support in apply/update/delete manifest methods

Add job execution step to the k8s CRUD engine pipeline between deployment and scale-down. Parameters (number_of_jobs, completions) flow from pipeline matrix → topology → engine step → main.py. - Add jobs script block to steps/engine/crud/k8s/execute.yml - Pass number_of_jobs and completions through topology execute-crud.yml - Jobs run after deployment, before scale-down + delete

Add comprehensive test coverage for create_job and job wait_for_condition: - test_create_job_success: single job completes successfully - test_create_job_failure: job fails to complete - test_create_job_partial_success: continues on individual failures - test_job_wait_for_condition: validates _check_job_condition and _is_job_condition_met for 'complete' and 'failed' states - test_wait_for_job_completed: timeout and polling behavior - Tests cover Job-specific semantics (succeeded count, completion_time, failed + no active pods)

- Extract _apply_job helper (matches _apply_deployment pattern) - Use os.path for default template path instead of hardcoded string - Use per-job labels to avoid selector collision - Remove redundant outer try/except - Use workload_common_parser for shared args (--count, --manifest-dir, etc.) - Add hasattr guard for cloud provider compatibility - Use args.count instead of args.number_of_jobs - Rename subcommand from 'jobs' to 'job' (matches K8s resource type) - Update pipeline YAML to use count parameter

- Wrap job pipeline step inside Azure cloud gate (matches deployment) - Use ${MANIFEST_DIR:+--manifest-dir} conditional (matches deployment pattern)

diamondpowell force-pushed the dipowell/crud-jobs branch 2 times, most recently from 84f388c to 1da0a77 Compare April 21, 2026 04:08

diamondpowell force-pushed the dipowell/crud-jobs branch 3 times, most recently from 787e4d6 to 5a0978d Compare May 5, 2026 19:31

diamondpowell mentioned this pull request May 7, 2026

Improve deployment workload support in CRUD benchmarking framework #879

Merged

diamondpowell and others added 8 commits May 14, 2026 12:09

fix: resolve trailing whitespace and yamllint issues

c2cf983

fix: gate job step to Azure-only and use conditional manifest-dir

5e14829

- Wrap job pipeline step inside Azure cloud gate (matches deployment) - Use ${MANIFEST_DIR:+--manifest-dir} conditional (matches deployment pattern)

fix: correct docstring to list supported workload types

3e9ac9f

docs: clarify nginx -t command choice in job template

697d614

diamondpowell force-pushed the dipowell/crud-jobs branch from 8aed904 to 697d614 Compare May 14, 2026 18:22

diamondpowell added 2 commits May 14, 2026 14:54

test: populate pipeline test config for job validation

a4fa1cd

fix: use k8s-crud-gpu topology (k8s-crud doesn't exist)

72f2650

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Job workload support to CRUD benchmarking framework#1133

Add Job workload support to CRUD benchmarking framework#1133
diamondpowell wants to merge 10 commits into
mainfrom
dipowell/crud-jobs

diamondpowell commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

diamondpowell commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Tests

Dependencies

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

diamondpowell commented Apr 14, 2026 •

edited

Loading