Skip to content

agent: fail the job if process spawning fails#102

Merged
emilyalbini merged 1 commit intomainfrom
ea-xsxwxqsvylmq
May 4, 2026
Merged

agent: fail the job if process spawning fails#102
emilyalbini merged 1 commit intomainfrom
ea-xsxwxqsvylmq

Conversation

@emilyalbini
Copy link
Copy Markdown
Member

Note: This PR is stacked on top of #101.


While working on post tasks, I discovered that the agent doesn't mark the job as failed if spawning tasks or diagnostics fail. We have never hit this in practice as we spawn /bin/bash for both of them, and the chances of that path not existing on the worker are negligible. This is far more likely to happen for post tasks, as we spawn the command requested by the user directly, and the command might not exist.

To test the agent didn't mark the job as failed, I patched the agent to try to spawn a missing binary rather than /bin/bash:

job 01KPN9E2F6K0FCXC6P1JKETF98 submitted
polling for job output...
STATE CHANGE:  -> queued
STATE CHANGE: queued -> running
|=| job assigned to worker 01KPN9E30QVKRF09BA9YV30BEK [factory aws-GotGCZDz, i-03ce969a962af0c5a] (queued for 32 s)
|T| starting task 0: "default"
JobEvent { payload: "ERROR: exec: No such file or directory (os error 2)", seq: 3, stream: "agent", task: None, time: 2026-04-20T11:12:32.081619121Z, time_remote: Some(2026-04-20T11:12:31.991713448Z) }
|W| found 0 output files
STATE CHANGE: running -> completed
|=| task 0 was incomplete, marked failed

Note how the state is changed to completed rather than failed. Task 0 being marked as failed doesn't change the state of the job as a whole.

@emilyalbini emilyalbini requested a review from jclulow April 20, 2026 11:21
Comment thread agent/src/main.rs
{
stage = Stage::PreDiagnostics(c, None);
} else {
stage = Stage::Ready;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this behavior was wrong, and that we shouldn't mark the worker as ready if the pre-diagnostics script fails to execute.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was, I believe, a deliberate choice. The pre-diagnostic scripts were something I added while debugging various bits of AWS-induced malaise, but I was generally leaning to failing open in this case if we weren't able to get it started.

It seems fine to tighten it up, though; we'll just need to be careful not to misconfigure the pre-diag stuff in such a way that we hit this condition.

Comment thread agent/src/main.rs
{
stage = Stage::PreDiagnostics(c, None);
} else {
stage = Stage::Ready;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was, I believe, a deliberate choice. The pre-diagnostic scripts were something I added while debugging various bits of AWS-induced malaise, but I was generally leaning to failing open in this case if we weren't able to get it started.

It seems fine to tighten it up, though; we'll just need to be careful not to misconfigure the pre-diag stuff in such a way that we hit this condition.

Comment thread agent/src/main.rs
Comment on lines -1803 to -1808
if let Some(c) = cw.start_diag_script("post", script).await
{
stage = Stage::PostDiagnostics(c, None);
} else {
cw.diagnostics_complete(false).await;
stage = Stage::Complete;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not failing here if the post diagnostics couldn't get underway was definitely more of a deliberate choice, FWIW, as it means a job that had succeeded might then fail for what didn't seem like great reasons, etc. It's probably fine to tighten up, though, and we'll see how it goes!

@emilyalbini emilyalbini force-pushed the ea-ptqwnqpswsuv branch 2 times, most recently from c8f7e71 to 7f32072 Compare May 4, 2026 09:03
@emilyalbini emilyalbini changed the base branch from ea-ptqwnqpswsuv to main May 4, 2026 09:17
@emilyalbini emilyalbini merged commit 28684c6 into main May 4, 2026
4 checks passed
@emilyalbini emilyalbini deleted the ea-xsxwxqsvylmq branch May 4, 2026 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants