Skip to content

Conversation

@Annmool
Copy link

@Annmool Annmool commented Nov 3, 2025

feat(observe): Include failure_reason when jobs are manually killed or marked dead
Why this change?

Debugging and monitoring are difficult when manually killed jobs or runs marked 'dead' only update their status without providing a human-readable explanation. This makes it impossible for UI users or automated scraping tools to distinguish between a timeout, an application failure, or a deliberate operator action.

This change introduces a failure_reason field for these scenarios, which significantly improves observability and provides crucial context for why a job was terminated.

What I changed

This PR ensures a failure_reason is populated and pushed to the results server in three key scenarios:

report.py: The try_mark_run_dead() function now accepts a reason argument. If provided, this reason is injected into the job info as the failure_reason.

kill.py: When a user kills a job or run, the system now pushes a status='dead' update that includes a failure_reason (e.g., "killed by ").

init.py: When package checks fail (which also pushes status='dead'), the specific error message is now included as the failure_reason.

test_report_dead_reason.py: A new unit test is added to mock the reporter and verify that calling try_mark_run_dead(..., reason=...) correctly includes the failure_reason in the final report.

Tests performed (local)

Installed the package in editable mode in a new venv.

Ran the new unit test successfully:

pytest -q test_report_dead_reason.py
1 passed

@Annmool Annmool requested a review from a team as a code owner November 3, 2025 06:33
@Annmool Annmool requested review from amathuria and kamoltat and removed request for a team November 3, 2025 06:33
@Annmool
Copy link
Author

Annmool commented Nov 4, 2025

Hey @kamoltat, @amathuria, and @djgalloway, my PR is ready—please take a look when you can.

# codebase so tooling can pick it up.
job_info['failure_reason'] = reason

reporter.report_job(run_name, job_id, job_info=job_info)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you still need dead=True here.
Otherwise it'll be False:
def report_job(self, run_name, job_id, job_info=None, dead=False)
and once you do that job status is already updated in report_job():

 if dead and get_status(job_info) is None:
            set_status(job_info, 'dead')

I think we can stick to just updating the failure reason here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, got it. I’ll make the suggested changes accordingly.

@Annmool Annmool requested a review from amathuria November 7, 2025 13:41
@Annmool
Copy link
Author

Annmool commented Nov 12, 2025

Steps performed

->Searched codebase for where jobs are marked dead and where job info is pushed (try_push_job_info, reporter.report_job, kill, report.py).

->Implemented try_mark_run_dead(run_name, reason=None) in report.py to inject job_info['failure_reason'] = reason when provided.

->Updated kill.py:

kill_job() now pushes dict(status='dead', failure_reason='killed by user').

kill_run() calls report.try_mark_run_dead(run_name, reason='killed by user').

->Updated package-failure path (task/internal/init.py): push status='dead' with failure_reason=msg.

->Added unit test teuthology/test/test_report_dead_reason.py that mocks ResultsReporter and asserts failure_reason is included when try_mark_run_dead(..., reason=...) is called.

After the suggestion

->Implemented the change in report.py: call reporter.report_job(..., dead=True) (i.e., include dead=True flag when reporting the job).

->Re-ran the focused unit test (test_report_dead_reason.py) to ensure behavior unchanged → passed.

->Re-ran the full test suite → all tests passed locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants