report: include failure_reason for manual kills and marking run dead;… #2102

Annmool · 2025-11-03T06:33:47Z

feat(observe): Include failure_reason when jobs are manually killed or marked dead
Why this change?

Debugging and monitoring are difficult when manually killed jobs or runs marked 'dead' only update their status without providing a human-readable explanation. This makes it impossible for UI users or automated scraping tools to distinguish between a timeout, an application failure, or a deliberate operator action.

This change introduces a failure_reason field for these scenarios, which significantly improves observability and provides crucial context for why a job was terminated.

What I changed

This PR ensures a failure_reason is populated and pushed to the results server in three key scenarios:

report.py: The try_mark_run_dead() function now accepts a reason argument. If provided, this reason is injected into the job info as the failure_reason.

kill.py: When a user kills a job or run, the system now pushes a status='dead' update that includes a failure_reason (e.g., "killed by ").

init.py: When package checks fail (which also pushes status='dead'), the specific error message is now included as the failure_reason.

test_report_dead_reason.py: A new unit test is added to mock the reporter and verify that calling try_mark_run_dead(..., reason=...) correctly includes the failure_reason in the final report.

Tests performed (local)

Installed the package in editable mode in a new venv.

Ran the new unit test successfully:

pytest -q test_report_dead_reason.py
1 passed

… add test

Annmool · 2025-11-04T13:38:07Z

Hey @kamoltat, @amathuria, and @djgalloway, my PR is ready—please take a look when you can.

amathuria · 2025-11-07T05:29:10Z

teuthology/report.py

+                    # codebase so tooling can pick it up.
+                    job_info['failure_reason'] = reason
+
+                reporter.report_job(run_name, job_id, job_info=job_info)


I think you still need dead=True here.
Otherwise it'll be False:
def report_job(self, run_name, job_id, job_info=None, dead=False)
and once you do that job status is already updated in report_job():

if dead and get_status(job_info) is None: set_status(job_info, 'dead')

I think we can stick to just updating the failure reason here

Sure, got it. I’ll make the suggested changes accordingly.

Annmool · 2025-11-12T17:46:44Z

Steps performed

->Searched codebase for where jobs are marked dead and where job info is pushed (try_push_job_info, reporter.report_job, kill, report.py).

->Implemented try_mark_run_dead(run_name, reason=None) in report.py to inject job_info['failure_reason'] = reason when provided.

->Updated kill.py:

kill_job() now pushes dict(status='dead', failure_reason='killed by user').

kill_run() calls report.try_mark_run_dead(run_name, reason='killed by user').

->Updated package-failure path (task/internal/init.py): push status='dead' with failure_reason=msg.

->Added unit test teuthology/test/test_report_dead_reason.py that mocks ResultsReporter and asserts failure_reason is included when try_mark_run_dead(..., reason=...) is called.

After the suggestion

->Implemented the change in report.py: call reporter.report_job(..., dead=True) (i.e., include dead=True flag when reporting the job).

->Re-ran the focused unit test (test_report_dead_reason.py) to ensure behavior unchanged → passed.

->Re-ran the full test suite → all tests passed locally.

report: include failure_reason for manual kills and marking run dead;…

26c44ec

… add test

Annmool requested a review from a team as a code owner November 3, 2025 06:33

Annmool requested review from amathuria and kamoltat and removed request for a team November 3, 2025 06:33

ci: sanitize docker image tag; fix flake8 unused import in test

4e2e59d

amathuria requested changes Nov 7, 2025

View reviewed changes

chore: remove explanatory comments from modified reporting paths

f86da1e

Annmool requested a review from amathuria November 7, 2025 13:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

report: include failure_reason for manual kills and marking run dead;… #2102

report: include failure_reason for manual kills and marking run dead;… #2102

Uh oh!

Annmool commented Nov 3, 2025

Uh oh!

Annmool commented Nov 4, 2025

Uh oh!

amathuria Nov 7, 2025

Uh oh!

Annmool Nov 7, 2025

Uh oh!

Annmool commented Nov 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

report: include failure_reason for manual kills and marking run dead;… #2102

Are you sure you want to change the base?

report: include failure_reason for manual kills and marking run dead;… #2102

Uh oh!

Conversation

Annmool commented Nov 3, 2025

Uh oh!

Annmool commented Nov 4, 2025

Uh oh!

amathuria Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Annmool Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Annmool commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Steps performed

After the suggestion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Annmool commented Nov 12, 2025 •

edited

Loading