Skip to content

Add Job workload support to CRUD benchmarking framework#1133

Draft
diamondpowell wants to merge 10 commits into
mainfrom
dipowell/crud-jobs
Draft

Add Job workload support to CRUD benchmarking framework#1133
diamondpowell wants to merge 10 commits into
mainfrom
dipowell/crud-jobs

Conversation

@diamondpowell
Copy link
Copy Markdown
Contributor

@diamondpowell diamondpowell commented Apr 14, 2026

Summary

Adds Job workload support to the CRUD benchmarking framework — the third and final planned workload method. Unlike deployments/statefulsets which run indefinitely, Jobs are run-to-completion workloads: success means the pod terminated cleanly (succeeded > 0), failure raises immediately.

Branch cleanup note: Rebased and squashed for reviewability. Independent of StatefulSet PR (#1132).

Changes

modules/python/crud/workload_templates/job.yml

  • New K8s manifest template using batch/v1 API with restartPolicy: Never
  • Uses JOB_COMPLETIONS placeholder, no parallelism (defaults to match completions)

modules/python/crud/azure/node_pool_crud.py

  • Add create_job() — same loop pattern as other workloads
  • Uses complete condition instead of available/ready since Jobs terminate after completion
  • No wait_for_pods_ready — pods exit after job finishes

modules/python/crud/main.py

  • Add jobs subparser with --node-pool-name, --number-of-jobs, --completions, --manifest-dir
  • Add elif command == "jobs" routing in handle_workload_operations

modules/python/clients/kubernetes_client.py

  • Add _check_job_condition and _is_job_condition_met — checks completion_time + succeeded count
  • Add wait_for_job_completed with 5-min timeout and 30s polling
  • Add Job kind support to apply_manifest, update_manifest, delete_manifest

steps/engine/crud/k8s/execute.yml

  • Add jobs script block calling python3 main.py jobs
  • Add number_of_jobs and completions parameters

steps/topology/k8s-crud-gpu/execute-crud.yml

  • Wire number_of_jobs and completions through to engine step

modules/python/clients/aks_client.py

  • Fix: set gpu_profile driver to "None" for non-GPU node pools

Tests

test_azure_node_pool_crud.py:

  • test_create_job_success
  • test_create_job_failure
  • test_create_job_no_client
  • test_create_job_partial_success

test_kubernetes_client.py:

  • test_wait_for_condition_job_success — Job completes successfully
  • test_wait_for_condition_job_timeout — fails within timeout
  • test_wait_for_condition_job_not_found — not found, returns failure

Dependencies

Based on test-refactor (PR #879) — must merge first. Independent of StatefulSet PR (#1132).

@diamondpowell diamondpowell force-pushed the dipowell/crud-jobs branch 2 times, most recently from 84f388c to 1da0a77 Compare April 21, 2026 04:08
@diamondpowell diamondpowell force-pushed the dipowell/crud-jobs branch 3 times, most recently from 787e4d6 to 5a0978d Compare May 5, 2026 19:31
diamondpowell and others added 8 commits May 14, 2026 12:09
Add create_job() to NodePoolCRUD that deploys K8s Jobs onto node pools.
Unlike deployments/statefulsets which run indefinitely, Jobs are
run-to-completion workloads — success means the pod terminated cleanly
(succeeded > 0), failure raises immediately (no self-healing).

- Add 'jobs' subcommand to handle_workload_operations() in main.py
  with --number-of-jobs and --completions args
- Add job.yml workload template with configurable completions and
  node affinity via label_selector
- Add _check_job_condition and _is_job_condition_met to
  kubernetes_client.py — checks completion_time + succeeded count
- Add wait_for_job_completed with 5-min timeout and 30s polling
- Job kind support in apply/update/delete manifest methods
Add job execution step to the k8s CRUD engine pipeline between
deployment and scale-down. Parameters (number_of_jobs, completions)
flow from pipeline matrix → topology → engine step → main.py.

- Add jobs script block to steps/engine/crud/k8s/execute.yml
- Pass number_of_jobs and completions through topology execute-crud.yml
- Jobs run after deployment, before scale-down + delete
Add comprehensive test coverage for create_job and job wait_for_condition:

- test_create_job_success: single job completes successfully
- test_create_job_failure: job fails to complete
- test_create_job_partial_success: continues on individual failures
- test_job_wait_for_condition: validates _check_job_condition and
  _is_job_condition_met for 'complete' and 'failed' states
- test_wait_for_job_completed: timeout and polling behavior
- Tests cover Job-specific semantics (succeeded count, completion_time,
  failed + no active pods)
- Extract _apply_job helper (matches _apply_deployment pattern)
- Use os.path for default template path instead of hardcoded string
- Use per-job labels to avoid selector collision
- Remove redundant outer try/except
- Use workload_common_parser for shared args (--count, --manifest-dir, etc.)
- Add hasattr guard for cloud provider compatibility
- Use args.count instead of args.number_of_jobs
- Rename subcommand from 'jobs' to 'job' (matches K8s resource type)
- Update pipeline YAML to use count parameter
- Wrap job pipeline step inside Azure cloud gate (matches deployment)
- Use ${MANIFEST_DIR:+--manifest-dir} conditional (matches deployment pattern)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant