report: include failure_reason for manual kills and marking run dead;… #2102
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
feat(observe): Include failure_reason when jobs are manually killed or marked dead
Why this change?
Debugging and monitoring are difficult when manually killed jobs or runs marked 'dead' only update their status without providing a human-readable explanation. This makes it impossible for UI users or automated scraping tools to distinguish between a timeout, an application failure, or a deliberate operator action.
This change introduces a failure_reason field for these scenarios, which significantly improves observability and provides crucial context for why a job was terminated.
What I changed
This PR ensures a failure_reason is populated and pushed to the results server in three key scenarios:
report.py: The try_mark_run_dead() function now accepts a reason argument. If provided, this reason is injected into the job info as the failure_reason.
kill.py: When a user kills a job or run, the system now pushes a status='dead' update that includes a failure_reason (e.g., "killed by ").
init.py: When package checks fail (which also pushes status='dead'), the specific error message is now included as the failure_reason.
test_report_dead_reason.py: A new unit test is added to mock the reporter and verify that calling try_mark_run_dead(..., reason=...) correctly includes the failure_reason in the final report.
Tests performed (local)
Installed the package in editable mode in a new venv.
Ran the new unit test successfully:
pytest -q test_report_dead_reason.py
1 passed