Skip to content

Emit JobRunTerminated event when executor observes pod terminal phase#4920

Open
dejanzele wants to merge 1 commit into
armadaproject:masterfrom
dejanzele:job-run-terminated
Open

Emit JobRunTerminated event when executor observes pod terminal phase#4920
dejanzele wants to merge 1 commit into
armadaproject:masterfrom
dejanzele:job-run-terminated

Conversation

@dejanzele

@dejanzele dejanzele commented May 19, 2026

Copy link
Copy Markdown
Member

When the scheduler preempts a run, the pod takes a variable grace period (5s to 5min depending on TerminationGracePeriodSeconds) to actually exit. preempted_timestamp records when the scheduler made the decision, but nothing records when the kubelet finished stopping the containers, so any future retry-policy consumer has no per-run signal to wait on.

This PR adds that signal. It does not consume it; the retry-gate is a separate follow-up.

What this PR does

Adds a new JobRunTerminated event emitted by the executor when it observes a pod reach PodSucceeded or PodFailed. The scheduleringester writes runs.terminated_timestamp from this event when the column is NULL, so the value reflects the kubelet-observed stop time rather than the scheduler decision time.

To preserve the existing Failed-state histogram, this PR also adds a failed_timestamp column (migration 039). MarkRunsFailed writes to failed_timestamp instead of terminated_timestamp, and state_metrics.go reads FailedTime() for the failed branch. The semantic split:

  • failed_timestamp: when armada decided the run failed (scheduler event-time).
  • terminated_timestamp: when the kubelet observed the pod stop (executor-observed).

For organic Succeeded or Cancelled runs the existing writers fill terminated_timestamp first and the JobRunTerminated write is a no-op. For preempted or organic-failed runs the column is NULL after the initial write, so JobRunTerminated fills it with the real kubelet-observed time.

@greptile-apps

greptile-apps Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces a new JobRunTerminated proto event and the infrastructure to emit, ingest, and store it, filling terminated_timestamp with the kubelet-observed pod stop time rather than the scheduler decision time. A companion failed_timestamp column (migration 039) separates "when Armada decided the run failed" from "when the kubelet stopped the pod", preserving existing metrics semantics.

  • Executor side: attemptToReportJobRunTerminatedEvent is called inline from the informer Add/Update handlers and from a new reconciliation loop in ReportMissingJobEvents; the kubelet FinishedAt time flows through LatestContainerFinishedAtevent.CreatedMarkRunsTerminated.
  • Scheduler-ingester side: handleJobRunTerminated produces MarkRunsTerminated, which uses an IS NULL guard (runsSetTimestampIfNullStmt) so succeeded/cancelled runs keep their existing timestamp, while preempted and failed runs get the kubelet-observed time filled in.
  • Metrics side: the Failed histogram branch now reads FailedTime() (decision time) and falls back to TerminatedTime() for pre-migration rows where failed_timestamp is NULL.

Confidence Score: 5/5

This PR is safe to merge. The IS NULL guard correctly arbitrates between all writers of terminated_timestamp, the FailedTime/TerminatedTime semantic split is sound, and the pre-migration fallback preserves existing histogram behaviour.

The changes are well-scoped and internally consistent. The IS NULL guard in runsSetTimestampIfNullStmt correctly defers to earlier writers (MarkRunsSucceeded, MarkJobsCancelled) while letting the kubelet-observed time fill in for preempted and organically-failed runs. The FailedTime() → TerminatedTime() fallback in state_metrics.go handles pre-migration rows cleanly. LatestContainerFinishedAt documents its node-loss edge case and falls back safely. Tests cover the main guard paths and reconciliation logic.

No files require special attention.

Important Files Changed

Filename Overview
internal/executor/reporter/event.go Adds CreateJobRunTerminatedEvent using LatestContainerFinishedAt for the event timestamp — correctly reads kubelet-reported FinishedAt rather than wall-clock time.
internal/executor/util/pod_util.go Adds LatestContainerFinishedAt (scans container and init-container statuses, falls back to time.Now() for node-loss pods) and HasJobRunTerminatedBeenReported; well-documented and tested.
internal/executor/service/job_state_reporter.go Wires attemptToReportJobRunTerminatedEvent into Add/UpdateFunc handlers and adds a reconciliation loop in ReportMissingJobEvents; idempotent via reported_terminated annotation.
internal/scheduleringester/schedulerdb.go Changes MarkRunsFailed to write failed_timestamp instead of terminated_timestamp; adds MarkRunsTerminated case using runsSetTimestampIfNullStmt with a clean IS NULL guard.
internal/scheduler/metrics/state_metrics.go Failed-histogram branch now uses FailedTime() with TerminatedTime() fallback for pre-migration rows; correctly preserves semantics across the migration boundary.
internal/scheduler/database/migrations/039_add_runs_failed_timestamp.sql Adds failed_timestamp nullable timestamptz column with IF NOT EXISTS guard; safe to apply.
internal/scheduleringester/dbops.go Adds MarkRunsTerminated map type with Merge, CanBeAppliedBefore, and GetOperation implementations consistent with sibling operations.
internal/scheduleringester/instructions.go Routes JobRunTerminated events to handleJobRunTerminated which emits MarkRunsTerminated using event.Created (the kubelet FinishedAt time) as the stored timestamp.
internal/scheduler/metrics.go Fixes inflated queued-time window: falls back to FailedTime() when TerminatedTime() is NULL (between MarkRunsFailed and MarkRunsTerminated arrival).
internal/scheduleringester/schedulerdb_test.go New TestMarkRunsTerminated covers preempted, succeeded, and failed seed states; MarkRunsFailed assertion updated to check failed_timestamp instead of terminated_timestamp.

Sequence Diagram

sequenceDiagram
    participant K8s as Kubernetes/Kubelet
    participant Exec as Executor
    participant Kafka as Kafka/Pulsar
    participant Ingest as SchedulerIngester
    participant DB as Scheduler DB

    K8s->>Exec: Pod → PodSucceeded/PodFailed (informer event)
    Exec->>Exec: reportCurrentStatus/reportStatusUpdate
    Exec->>Exec: attemptToReportJobRunTerminatedEvent
    Exec->>Kafka: "JobRunTerminated {Created=LatestContainerFinishedAt}"

    Note over Exec: Separately, scheduler also emits:
    Exec->>Kafka: JobRunErrors (for failed/preempted runs)

    Kafka->>Ingest: JobRunErrors
    Ingest->>DB: "MarkRunsFailed → failed=true, failed_timestamp=event_time"

    Kafka->>Ingest: JobRunTerminated
    Ingest->>DB: "MarkRunsTerminated → terminated_timestamp=kubelet_time (WHERE IS NULL)"

    Note over DB: Succeeded runs: MarkRunsSucceeded already set terminated_timestamp → IS NULL blocks
    Note over DB: Failed/preempted runs: terminated_timestamp was NULL → IS NULL allows kubelet time in
Loading

Reviews (24): Last reviewed commit: "Emit JobRunTerminated event when executo..." | Re-trigger Greptile

Comment thread internal/executor/reporter/event.go
@dejanzele dejanzele force-pushed the job-run-terminated branch 6 times, most recently from 750bc95 to 915e9ee Compare May 20, 2026 13:34
Comment thread internal/scheduleringester/schedulerdb.go Outdated
@dejanzele dejanzele force-pushed the job-run-terminated branch 3 times, most recently from 03170a8 to fbe4c54 Compare May 20, 2026 16:45
@dejanzele

Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele dejanzele force-pushed the job-run-terminated branch 6 times, most recently from ff69b00 to a3d367d Compare May 21, 2026 11:34
@dejanzele dejanzele force-pushed the job-run-terminated branch from a3d367d to c9fda95 Compare May 21, 2026 17:52
Comment thread internal/scheduler/metrics/state_metrics.go
@dejanzele dejanzele force-pushed the job-run-terminated branch 2 times, most recently from 6b55e34 to 5ff969a Compare May 22, 2026 09:41
@dejanzele

Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele dejanzele force-pushed the job-run-terminated branch 3 times, most recently from 2731e41 to 6ae99e8 Compare May 22, 2026 10:53
dejanzele added a commit that referenced this pull request May 22, 2026
`TestEventServer_GetJobSetEvents_ErrorIfMissing` (4 subtests) and
`TestEventServer_GetJobSetEvents_Permissions` (3 subtests) share a
single 5s `armadacontext.WithTimeout` across all their subtests. Each
subtest spins up a fresh lookout DB and runs every migration via
`withEventServer` and `WithLookoutDb`, so the per-subtest cost is not
trivial.

On a slow CI runner the cumulative cost blew the 5s budget. We hit this
on PR #4920 where 4 subtests of `_ErrorIfMissing` took 5.74s combined
and the last one died with `context deadline exceeded` inside
`PostgresQueueRepository.upsertQueue`.

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the job-run-terminated branch from 6ae99e8 to 540b066 Compare May 22, 2026 12:23
@dejanzele

Copy link
Copy Markdown
Member Author

@greptileai

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the job-run-terminated branch from 540b066 to 39b4c26 Compare May 27, 2026 07:40
nikola-jokic pushed a commit that referenced this pull request Jun 4, 2026
`TestEventServer_GetJobSetEvents_ErrorIfMissing` (4 subtests) and
`TestEventServer_GetJobSetEvents_Permissions` (3 subtests) share a
single 5s `armadacontext.WithTimeout` across all their subtests. Each
subtest spins up a fresh lookout DB and runs every migration via
`withEventServer` and `WithLookoutDb`, so the per-subtest cost is not
trivial.

On a slow CI runner the cumulative cost blew the 5s budget. We hit this
on PR #4920 where 4 subtests of `_ErrorIfMissing` took 5.74s combined
and the last one died with `context deadline exceeded` inside
`PostgresQueueRepository.upsertQueue`.

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant