Add developer doc for job lifecycle events#4935
Conversation
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Greptile SummaryThis PR adds a new developer reference doc (
Confidence Score: 5/5Documentation-only change with no effect on runtime behavior; safe to merge. The change adds a single Markdown file. All described flows and state transitions are cross-checked against the actual conversion and jobdb source files and are accurate, with one minor note in the conversion table that slightly overstates field coverage for non-PodError failure reasons. docs/developer/job-lifecycle-events.md — conversion table note on FailureCategory/FailureSubcategory coverage Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A([Job Submitted]) --> Q[Queued]
Q --> L[Leased]
L --> P[Pending]
P --> R[Running]
R --> S1[JobRunSucceeded\nexecutor]
S1 --> S2[JobSucceeded\nscheduler]
S2 --> SUCC([Succeeded])
R --> F1A[JobRunErrors PodError\nexecutor - Path A]
P --> F1B[JobRunErrors PodError\nexecutor - Path B\nissue handler]
F1A --> F2[JobErrors Terminal\nscheduler]
F1B --> F2
F2 --> FAIL([Failed])
F2 --> REQ([Requeued → Queued])
R --> PR1[JobRunPreempted + JobRunErrors\nscheduler decision]
PR1 --> PR2[JobErrors JobRunPreemptedError\nscheduler]
PR1 --> PR3[JobRunPreempted duplicate\nexecutor via ReportEvents]
PR2 --> PRMP([Preempted])
Q & L & P & R --> C1[CancelJob\nserver]
C1 --> C2[JobRunCancelled + CancelledJob\nscheduler]
C2 --> CANC([Cancelled])
Reviews (11): Last reviewed commit: "Reformat per-flow state-transition heade..." | Re-trigger Greptile |
44bcedb to
7903765
Compare
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
7903765 to
09a154e
Compare
Adds a developer reference for the events and state transitions across a job run's lifecycle.
The doc covers the multi-cluster topology and the two transports (Pulsar and the gRPC lease stream) that carry events between the control plane and executors, the job-level and run-level state machines, and the internal proto event vocabulary alongside its mapping to the external API event vocabulary. It then walks through step-by-step flows for the four terminal cases: succeeded, failed (both organic terminal-phase and executor-issue-handler paths), preempted, and cancelled.
The preempt section documents the current double-emission behavior. Each preempted run produces two
JobPreemptedEventmessages on the external stream and an overwrite in thejob_run_errorsrow that replaces the scheduler's preemption description with the executor's generic "Run preempted" text. Operators integrating with the event stream need to know about this.