agent: fail the job if process spawning fails#102
Conversation
| { | ||
| stage = Stage::PreDiagnostics(c, None); | ||
| } else { | ||
| stage = Stage::Ready; |
There was a problem hiding this comment.
I assume this behavior was wrong, and that we shouldn't mark the worker as ready if the pre-diagnostics script fails to execute.
There was a problem hiding this comment.
It was, I believe, a deliberate choice. The pre-diagnostic scripts were something I added while debugging various bits of AWS-induced malaise, but I was generally leaning to failing open in this case if we weren't able to get it started.
It seems fine to tighten it up, though; we'll just need to be careful not to misconfigure the pre-diag stuff in such a way that we hit this condition.
| { | ||
| stage = Stage::PreDiagnostics(c, None); | ||
| } else { | ||
| stage = Stage::Ready; |
There was a problem hiding this comment.
It was, I believe, a deliberate choice. The pre-diagnostic scripts were something I added while debugging various bits of AWS-induced malaise, but I was generally leaning to failing open in this case if we weren't able to get it started.
It seems fine to tighten it up, though; we'll just need to be careful not to misconfigure the pre-diag stuff in such a way that we hit this condition.
| if let Some(c) = cw.start_diag_script("post", script).await | ||
| { | ||
| stage = Stage::PostDiagnostics(c, None); | ||
| } else { | ||
| cw.diagnostics_complete(false).await; | ||
| stage = Stage::Complete; |
There was a problem hiding this comment.
I think not failing here if the post diagnostics couldn't get underway was definitely more of a deliberate choice, FWIW, as it means a job that had succeeded might then fail for what didn't seem like great reasons, etc. It's probably fine to tighten up, though, and we'll see how it goes!
c8f7e71 to
7f32072
Compare
632feca to
83cd528
Compare
83cd528 to
7fbcbb2
Compare
Note: This PR is stacked on top of #101.
While working on post tasks, I discovered that the agent doesn't mark the job as failed if spawning tasks or diagnostics fail. We have never hit this in practice as we spawn
/bin/bashfor both of them, and the chances of that path not existing on the worker are negligible. This is far more likely to happen for post tasks, as we spawn the command requested by the user directly, and the command might not exist.To test the agent didn't mark the job as failed, I patched the agent to try to spawn a missing binary rather than
/bin/bash:Note how the state is changed to
completedrather thanfailed. Task 0 being marked as failed doesn't change the state of the job as a whole.