Skip to content

Add developer doc for job lifecycle events#4935

Open
dejanzele wants to merge 2 commits into
armadaproject:masterfrom
dejanzele:docs/job-lifecycle-events
Open

Add developer doc for job lifecycle events#4935
dejanzele wants to merge 2 commits into
armadaproject:masterfrom
dejanzele:docs/job-lifecycle-events

Conversation

@dejanzele

@dejanzele dejanzele commented May 28, 2026

Copy link
Copy Markdown
Member

Adds a developer reference for the events and state transitions across a job run's lifecycle.

The doc covers the multi-cluster topology and the two transports (Pulsar and the gRPC lease stream) that carry events between the control plane and executors, the job-level and run-level state machines, and the internal proto event vocabulary alongside its mapping to the external API event vocabulary. It then walks through step-by-step flows for the four terminal cases: succeeded, failed (both organic terminal-phase and executor-issue-handler paths), preempted, and cancelled.

The preempt section documents the current double-emission behavior. Each preempted run produces two JobPreemptedEvent messages on the external stream and an overwrite in the job_run_errors row that replaces the scheduler's preemption description with the executor's generic "Run preempted" text. Operators integrating with the event stream need to know about this.

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@greptile-apps

greptile-apps Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a new developer reference doc (docs/developer/job-lifecycle-events.md) covering the event-driven architecture, state machines, and step-by-step terminal flows for Armada jobs. The document is cross-verified against internal/server/event/conversion/conversions.go and internal/scheduler/jobdb/job.go.

  • Topology, transport, and state-machine sections accurately reflect the codebase, including the scheduler's boolean-flag model, Pulsar dual-publication path, and the jobdb heartbeat-based lease-expiry logic.
  • Event vocabulary and conversion tables are largely correct, though the note on the JobErrors → JobFailedEvent row overstates field coverage: FailureCategory and FailureSubcategory are only copied for PodError reasons in the conversion layer.
  • The preemption "Known issues" section is a valuable operator-facing disclosure of the double-emission behavior and the job_run_errors overwrite.

Confidence Score: 5/5

Documentation-only change with no effect on runtime behavior; safe to merge.

The change adds a single Markdown file. All described flows and state transitions are cross-checked against the actual conversion and jobdb source files and are accurate, with one minor note in the conversion table that slightly overstates field coverage for non-PodError failure reasons.

docs/developer/job-lifecycle-events.md — conversion table note on FailureCategory/FailureSubcategory coverage

Important Files Changed

Filename Overview
docs/developer/job-lifecycle-events.md New developer reference document covering job and run lifecycle events, state machines, event vocabulary, and four terminal flow walk-throughs (succeeded, failed, preempted, cancelled). The conversion table note for JobErrors → JobFailedEvent overstates which failure types carry FailureCategory/FailureSubcategory — those fields are only populated for the PodError branch in FromInternalJobErrors.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([Job Submitted]) --> Q[Queued]
    Q --> L[Leased]
    L --> P[Pending]
    P --> R[Running]

    R --> S1[JobRunSucceeded\nexecutor]
    S1 --> S2[JobSucceeded\nscheduler]
    S2 --> SUCC([Succeeded])

    R --> F1A[JobRunErrors PodError\nexecutor - Path A]
    P --> F1B[JobRunErrors PodError\nexecutor - Path B\nissue handler]
    F1A --> F2[JobErrors Terminal\nscheduler]
    F1B --> F2
    F2 --> FAIL([Failed])
    F2 --> REQ([Requeued → Queued])

    R --> PR1[JobRunPreempted + JobRunErrors\nscheduler decision]
    PR1 --> PR2[JobErrors JobRunPreemptedError\nscheduler]
    PR1 --> PR3[JobRunPreempted duplicate\nexecutor via ReportEvents]
    PR2 --> PRMP([Preempted])

    Q & L & P & R --> C1[CancelJob\nserver]
    C1 --> C2[JobRunCancelled + CancelledJob\nscheduler]
    C2 --> CANC([Cancelled])
Loading

Reviews (11): Last reviewed commit: "Reformat per-flow state-transition heade..." | Re-trigger Greptile

Comment thread docs/developer/job-lifecycle-events.md
Comment thread docs/developer/job-lifecycle-events.md
@dejanzele dejanzele force-pushed the docs/job-lifecycle-events branch 8 times, most recently from 44bcedb to 7903765 Compare May 29, 2026 10:49
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the docs/job-lifecycle-events branch from 7903765 to 09a154e Compare May 29, 2026 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant