Making scheduling cycle optionally run asynchronously by JamesMurkin · Pull Request #4952 · armadaproject/armada

JamesMurkin · 2026-06-09T20:00:54Z

This PR makes it so the scheduler can run the scheduling loop in 2 modes:

Synchronous
- The existing approach, scheduling algo is called directly by the main loop, which blocks until schedule is complete)
Asynchronous
- The new approach, where scheduling algo runs on a separate go routine and the main loop simply triggers it and retrieves the result when it is ready

Motivation

The reason for this PR is that most job state transitions run through the scheduler (pending/running/succeeded/failed/preempted/cancelled etc), but they are getting delayed by the scheduling algo taking a long time, resulting in poor UX.

By splitting the scheduling_algo into a separate routine that gets its result merged in when ready, we can have a far more responsive system

Improves UX (time between cancel request -> cancel showing up in events + lookout can be reduced from 30-60s to ideally 1-2s
Allows downstream systems to process next steps faster - as they get job state updates in a more timely manner

Caveats

When the scheduling algo runs, it maintains a snapshot of the state at the time it started. As time passes it inevitably gets more and more out of date

Meaning we can end up scheduling a job that has been cancelled on the main loop.

There is reconciliation code to remove decisions on jobs that have since finished

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> # Conflicts: # internal/scheduler/scheduler.go # internal/scheduler/scheduler_test.go # internal/scheduler/scheduling/scheduling_algo.go

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> # Conflicts: # internal/scheduler/scheduler.go # internal/scheduler/scheduling/context/queue.go # internal/scheduler/scheduling/context/queue_test.go # internal/scheduler/scheduling/context/scheduling.go

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

greptile-apps · 2026-06-12T13:23:31Z

Greptile Summary

This PR introduces an optional asynchronous scheduling mode that decouples the scheduling algorithm from the main scheduler loop, allowing job state transitions (cancel, fail, preempt) to flow through without waiting for the scheduling algo to complete. When async mode is enabled, the background goroutine operates on a DryRunTxn snapshot and results are reconciled against current state before being applied.

AsyncSchedulingRunner implements a clear state machine (Idle → RunRequested → Running → ResultReady → Idle) with mutex-protected state, a buffered wake channel, and a Reset() that cancels in-flight runs and blocks until completion — ensuring leadership failovers produce a clean slate.
scheduler.go drives the async lifecycle correctly: Reset() on every become-leader transition, Trigger() only after a clean committed cycle, and result consumption gated on shouldSchedule to avoid hammering GetSchedulerResult.
Reconciliation in async.go filters stale scheduled/preempted decisions (including whole-gang drops) against the current txn, and a new schedulingDuration histogram tracks the algo's wall-clock time independently from the main cycle time.

Confidence Score: 5/5

The async scheduling mode is safe to merge; the state machine, cancellation path, and reconciliation logic are all well-reasoned and well-tested.

The core state machine in AsyncSchedulingRunner is correct and race-free. Reset() blocks until any in-flight scheduling run has finished before returning, ensuring the next Trigger always schedules against committed state. The two findings in the reconcile path are defensive-coding concerns against non-standard SchedulingAlgo implementations; the production FairSchedulingAlgo invariants prevent them from firing today. Test coverage for the async lifecycle, leadership transitions, and reconciliation logic is thorough.

internal/scheduler/scheduling/runner/async.go — the reconcile helpers updatePreemptedJobs and the UpdateFairShares call in reconcile() assume AdditionalSchedulingInfo/EvictorResult/SchedulingContext are non-nil; worth adding nil guards for future-proofing.

Important Files Changed

Filename	Overview
internal/scheduler/scheduling/runner/async.go	New async runner with a well-structured state machine (Idle→RunRequested→Running→ResultReady→Idle); Reset() correctly cancels in-flight runs and blocks until they complete. Two missing nil guards in the reconcile path could panic with non-standard SchedulingAlgo implementations.
internal/scheduler/scheduler.go	Refactors cycle() to return schedulingAttempted flag; Trigger() correctly fires only after a clean committed cycle as leader; Reset() on leadership transitions correctly discards stale async state.
internal/scheduler/scheduler_test.go	Good coverage added: TestScheduler_AsyncRunner exercises the two-cycle consume pattern, TestRun_AsyncRunnerResetOnLeadershipChange verifies Reset fires on become-leader transitions. testSchedulingAlgo refactored to use atomic.Int64 for thread safety.
internal/scheduler/scheduling/runner/async_test.go	Thorough unit tests covering state transitions, Reset on in-flight runs, gang reconciliation, and context cancellation.
internal/scheduler/metrics/cycle_metrics.go	Adds schedulingDuration histogram properly wired into describe/collect. New metric measures algo wall-clock time independently from the main-loop cycle time.

Sequence Diagram

sequenceDiagram
    participant ML as Main Loop
    participant AR as AsyncRunner
    participant BG as Background Goroutine
    participant DB as JobDb

    ML->>AR: Reset() [on become-leader]
    AR-->>ML: (state → Idle)

    ML->>AR: "GetSchedulerResult() [shouldSchedule=true, no result]"
    AR-->>ML: "nil, nil (state=Idle)"
    ML->>AR: Trigger() [after clean commit]
    AR->>BG: wake (state → RunRequested → Running)

    BG->>DB: DryRunTxn()
    BG->>BG: schedulingAlgo.Schedule(runCtx, txn)
    BG-->>AR: finishRun(result) [state → ResultReady]

    ML->>AR: GetSchedulerResult() [next shouldSchedule cycle]
    AR->>AR: reconcile(result, currentTxn)
    AR->>AR: upsertSchedulerResult(txn, result)
    AR-->>ML: "*SchedulerResult (state → Idle)"
    ML->>ML: Publish events, commit txn
    ML->>AR: Trigger() [after clean commit]
    AR->>BG: wake (state → RunRequested → Running)

    Note over ML,AR: On error or leadership loss
    ML->>AR: Reset() [become-leader / error path]
    AR->>BG: cancel() [if Running]
    BG-->>AR: done closed
    AR-->>ML: (state → Idle, result discarded)

_{Reviews (3): Last reviewed commit: "Merge branch 'master' into async_schedul..." | Re-trigger Greptile}

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

JamesMurkin added 22 commits May 1, 2026 10:55

WIP

46c99b5

Merge branch 'master' into async_scheduling_poc

8e0610b

Make async scheduling configurable + code structure

57dd5e8

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Reset() call on runner

c4119ee

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

WIP - Adding end to end flow

7c1fe58

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Merge branch 'master' into async_scheduling_poc

f58f8b2

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> # Conflicts: # internal/scheduler/scheduler.go # internal/scheduler/scheduler_test.go # internal/scheduler/scheduling/scheduling_algo.go

Don't skip errored pools

51ff957

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Logging

0bdec77

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Context tests

6b6d39c

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Tests

edc792a

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Refactor + fix cycle time metrics

af267b3

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Tidy runner tests + comments

1ec32a3

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Fix interface

187970f

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Tidy async

2e9798a

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

tidy

5e29751

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

tidy

454c70f

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Lint fix

a1e9f1d

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Tidy tests + minor fixes

a838bef

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Further test cleanup

058302c

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Review comments

11250d0

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Fix tests

ea8706d

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

JamesMurkin marked this pull request as ready for review June 12, 2026 13:17

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread internal/scheduler/scheduler.go

Comment thread internal/scheduler/scheduler_test.go Outdated

JamesMurkin added 2 commits June 12, 2026 14:28

Code review

382b0c4

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Move trigger to a more reliable location and fix tests

034325e

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

geaere approved these changes Jun 15, 2026

View reviewed changes

Merge branch 'master' into async_scheduling_poc

d66b55a

JamesMurkin enabled auto-merge (squash) June 15, 2026 16:16

JamesMurkin merged commit f6f899f into master Jun 15, 2026
16 of 17 checks passed

JamesMurkin deleted the async_scheduling_poc branch June 15, 2026 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making scheduling cycle optionally run asynchronously#4952

Making scheduling cycle optionally run asynchronously#4952
JamesMurkin merged 25 commits into
masterfrom
async_scheduling_poc

JamesMurkin commented Jun 9, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JamesMurkin commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JamesMurkin commented Jun 9, 2026 •

edited

Loading

greptile-apps Bot commented Jun 12, 2026 •

edited

Loading