agent: fail the job if process spawning fails by emilyalbini · Pull Request #102 · oxidecomputer/buildomat

emilyalbini · 2026-04-20T11:21:45Z

Note: This PR is stacked on top of #101.

While working on post tasks, I discovered that the agent doesn't mark the job as failed if spawning tasks or diagnostics fail. We have never hit this in practice as we spawn /bin/bash for both of them, and the chances of that path not existing on the worker are negligible. This is far more likely to happen for post tasks, as we spawn the command requested by the user directly, and the command might not exist.

To test the agent didn't mark the job as failed, I patched the agent to try to spawn a missing binary rather than /bin/bash:

job 01KPN9E2F6K0FCXC6P1JKETF98 submitted
polling for job output...
STATE CHANGE:  -> queued
STATE CHANGE: queued -> running
|=| job assigned to worker 01KPN9E30QVKRF09BA9YV30BEK [factory aws-GotGCZDz, i-03ce969a962af0c5a] (queued for 32 s)
|T| starting task 0: "default"
JobEvent { payload: "ERROR: exec: No such file or directory (os error 2)", seq: 3, stream: "agent", task: None, time: 2026-04-20T11:12:32.081619121Z, time_remote: Some(2026-04-20T11:12:31.991713448Z) }
|W| found 0 output files
STATE CHANGE: running -> completed
|=| task 0 was incomplete, marked failed

Note how the state is changed to completed rather than failed. Task 0 being marked as failed doesn't change the state of the job as a whole.

emilyalbini · 2026-04-20T11:22:47Z

-                {
-                    stage = Stage::PreDiagnostics(c, None);
-                } else {
-                    stage = Stage::Ready;


I assume this behavior was wrong, and that we shouldn't mark the worker as ready if the pre-diagnostics script fails to execute.

It was, I believe, a deliberate choice. The pre-diagnostic scripts were something I added while debugging various bits of AWS-induced malaise, but I was generally leaning to failing open in this case if we weren't able to get it started.

It seems fine to tighten it up, though; we'll just need to be careful not to misconfigure the pre-diag stuff in such a way that we hit this condition.

jclulow · 2026-05-01T06:33:25Z

-                {
-                    stage = Stage::PreDiagnostics(c, None);
-                } else {
-                    stage = Stage::Ready;


It was, I believe, a deliberate choice. The pre-diagnostic scripts were something I added while debugging various bits of AWS-induced malaise, but I was generally leaning to failing open in this case if we weren't able to get it started.

It seems fine to tighten it up, though; we'll just need to be careful not to misconfigure the pre-diag stuff in such a way that we hit this condition.

jclulow · 2026-05-01T06:40:19Z

-                    if let Some(c) = cw.start_diag_script("post", script).await
-                    {
-                        stage = Stage::PostDiagnostics(c, None);
-                    } else {
-                        cw.diagnostics_complete(false).await;
-                        stage = Stage::Complete;


I think not failing here if the post diagnostics couldn't get underway was definitely more of a deliberate choice, FWIW, as it means a job that had succeeded might then fail for what didn't seem like great reasons, etc. It's probably fine to tighten up, though, and we'll see how it goes!

emilyalbini requested a review from jclulow April 20, 2026 11:21

emilyalbini commented Apr 20, 2026

View reviewed changes

emilyalbini mentioned this pull request Apr 20, 2026

agent: implement post tasks #103

Open

jclulow approved these changes May 1, 2026

View reviewed changes

emilyalbini force-pushed the ea-ptqwnqpswsuv branch 2 times, most recently from c8f7e71 to 7f32072 Compare May 4, 2026 09:03

emilyalbini force-pushed the ea-xsxwxqsvylmq branch from 632feca to 83cd528 Compare May 4, 2026 09:09

agent: fail the job if process spawning fails

7fbcbb2

emilyalbini force-pushed the ea-xsxwxqsvylmq branch from 83cd528 to 7fbcbb2 Compare May 4, 2026 09:17

emilyalbini changed the base branch from ea-ptqwnqpswsuv to main May 4, 2026 09:17

emilyalbini merged commit 28684c6 into main May 4, 2026
4 checks passed

emilyalbini deleted the ea-xsxwxqsvylmq branch May 4, 2026 09:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent: fail the job if process spawning fails#102

agent: fail the job if process spawning fails#102
emilyalbini merged 1 commit intomainfrom
ea-xsxwxqsvylmq

emilyalbini commented Apr 20, 2026

Uh oh!

emilyalbini Apr 20, 2026

Uh oh!

jclulow May 1, 2026

Uh oh!

jclulow May 1, 2026

Uh oh!

jclulow May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

emilyalbini commented Apr 20, 2026

Uh oh!

emilyalbini Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

jclulow May 1, 2026

Choose a reason for hiding this comment

Uh oh!

jclulow May 1, 2026

Choose a reason for hiding this comment

Uh oh!

jclulow May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants