worker: remove TodoWrite tool by anders-heimer · Pull Request #176 · sashiko-dev/sashiko

anders-heimer · 2026-05-11T17:57:26Z

Upstream review-prompts (vendored in third_party/prompts/) instruct the model to call TodoWrite for each analysis task, assuming Claude Code's native task-tracking tool. Sashiko's local todowrite implementation merely appended a line to a TODO.md file in the worktree and returned; it had no analytical effect on the review.

In practice this caused the model to spend a large fraction of its per-stage turn budget issuing TodoWrite calls. In one observed Stage 3 run, 27 of 46 turns were TodoWrite invocations, which pushed the stage toward the max_interactions limit and left insufficient budget for the actual callstack analysis.

Remove the TodoWrite tool entirely: the declaration in get_declarations_generic, the dispatcher entry, the todowrite method, the now-unused AsyncWriteExt import, and the companion unit test. The tool-name normalization test is updated to use a tool that still exists (list_dir). The DESIGN_REVIEW_WORKER.md references to todo_write and the Todo List state are dropped. Once the tool is no longer advertised, the upstream prompt instruction "Add each Task into a TodoWrite" becomes a harmless no-op.

Measured on a 9-entry Linux-kernel benchmark set with the Kiro CLI provider running claude-sonnet-4.5:

Detection rate: 4/9 (44%) -> 6/9 (67%).
Average turns per review: 99 -> 55.7 (-43.8%).
Average output tokens per review: 17306 -> 14428 (-16.6%).
Two previously missed patches (cifs SendReceive response-buffer leak, nci command workqueue use-after-free) are now detected correctly; the nci review dropped from 159 turns with no findings to 29 turns with a correct finding.

No regressions were observed: every patch detected before is still detected, with lower turn counts and comparable or smaller output.

rgushchin · 2026-05-11T19:18:52Z

Thank you for the idea! I generally agree with the direction, but last time I tested it, it showed some regression. I'll run it with gemini and on a 1000 dataset - 10 is way too noisy for this.

anders-heimer · 2026-05-12T12:25:41Z

Thank you for your quick feedback!

It will be interesting to see the results from a bigger dataset.

anders-heimer · 2026-05-19T13:41:12Z

Hi @rgushchin ,

I have done some more benchmarking, again with the tiny benchmarking, usingQwen3-Coder-30B-A3B-Instruct-Q
5_K_M on an L40S. As you say the measurements are noisy, but the improvements makes sense theoretically at least.

Metric	After (TodoWrite removed)	Before
Detected	4/9	3/9
Missed	5/9	6/9
Runtime	5,430s (1.5h)	20,825s (5.8h)
Avg duration/patch	603s	2,314s
Total turns	949	1,334
Avg turns/patch	105.4	148.2
Format validation failures	343	306
Aborted reviews	24	23
Full review restarts	20	18

Br, Anders

derekbarbosa · 2026-05-19T17:43:01Z

Hey Anders,

The changes look good, and the results are promising!

You mentioned that this was done with the tiny benchmark, correct? Would you kindly elaborate?

How many trials of the tiny benchmark were performed (before/after)
Did you store the output results of the model's findings anywhere? I see detected & missed as entries there, but I am also curious if the review "output" response blabbed on about anything unrelated in either case (false positives)?

anders-heimer · 2026-05-19T18:41:40Z

Hi Derek,

I only ran one A and B comparison per model (Sonnet and Qwen3-Coder-30B). I ran benchmarks/benchmark_tiny.json all four times.

I only had the db from the last run saved. With "unrelated" I mean the findings it didn't match the specific known bug being tested for.

Per-patch unrelated findings:

Target bug	Before	After
btrfs: stripe_uptodate_bitmap memleak	0	2
cgroup: NULL deref on dmem.max write	5	10
cgroup: RCU usage without read lock	5	9
nfc: workqueue UAF after rfkill	5	7
t7xx: MAX_SKB_FRAGS overflow	3	15
smb: response buffer memleak	1	2
smb: SendReceive() buffer leak	0	0
nvme: swapped sgl parameters	0	0
flex_proportions: deadlock	0	0

The before findings are fewer but more specific. Example from t7xx (3 findings total):
"In t7xx_dpmaif_rx_buf_alloc(), the error path at line 264 uses while (--i > 0) which skips unmapping index 0..."

The after findings are more numerous but vaguer. Example from t7xx (15 findings total):
"Hardware register access without power domain/clk checks"
"Potential integer overflow in ring buffer calculations"

It looks to me that removing TodoWrite freed up turn budget by not wasting it on no-ops, but in my run the model used that budget to produce more findings rather than deeper analysis.

Br, Anders Heimer

derekbarbosa · 2026-05-20T12:13:59Z

thank you for the expanded findings!

The before findings are fewer but more specific. Example from t7xx (3 findings total):
"In t7xx_dpmaif_rx_buf_alloc(), the error path at line 264 uses while (--i > 0) which skips unmapping index 0..."

The after findings are more numerous but vaguer. Example from t7xx (15 findings total):
"Hardware register access without power domain/clk checks"
"Potential integer overflow in ring buffer calculations"

this is interesting behavior, if I had to guess, the TodoWrite specified what the finding was, and then spent time aggregating/validating them?

Do you have any access to the logs, if they exist? (example on prod instance via the "View Raw Log" button: https://sashiko.dev/#/log/54060)

It looks to me that removing TodoWrite freed up turn budget by not wasting it on no-ops, but in my run the model used that budget to produce more findings rather than deeper analysis.

Now I'm curious -- if adding a shim to prompts.rs, preferably one of the review stages (say stages 8/9/10) may "massage" out more precise findings. Something akin to "please be specific: elaborate what functions and what line numbers trigger the error"

anders-heimer · 2026-05-21T13:13:46Z

Thanks Derek, I had missed that aspect.

Unfortunately, the initial logs from the runs I mentioned are gone. I used Kiro to collect whatever information I still had lying around, hence the odd format. Those notes are attached, I think it shows what you are after.

From the logs, it does not look like TodoWrite was discovering bugs directly. Most TodoWrite calls were checklist/status churn. A few calls did contain concrete candidate-like text, but that only seems useful within the same stage unless the model also emits the candidate as a structured concern. As I understand it, later stages do not consume TODO.md as persistent review state.

I tested your theory by preserving the useful part in the finding pipeline by emitting concrete suspected bugs with structured location data, for later stages. In the tiny benchmark I ran with this change, that did preserve structured locations for the final findings.

The downside is that the richer structured output appears to put more pressure on the JSON path, at least with Kiro. In my partial run it caused more malformed-output/schema failures and timeout churn. I need to look in to that more in detail.
review-7-patchset-1.html
review-2-patchset-5.html
review-8-patchset-7.html
review-1-patchset-3.html
review-5-patchset-8.html
review-6-patchset-4.html
review-3-patchset-6.html
review-4-patchset-2.html
review-1-patchset-1.html
review-2-patchset-2.html
review-5-patchset-1.html
review-6-patchset-6.html

derekbarbosa · 2026-05-23T13:24:25Z

Hi Anders!

structured output appears to put more pressure on the JSON path

Thank you for the logs. I see what you mean by malformed output in some of those results. Specifically referencing review-2-patchset-5 in this case.

Running these experiments with these minor tweaks seems... annoying, to say the least.

prompts.rs has some try_parse used in a match arm in json_request as some sort of validation.

        match try_parse(content, &validate) {
            Ok(v) => return Some(v),
            Err(e) => {
                warn!("{}: {}, retrying with correction", label, e);
                let mut retry_req = retry_base;
                retry_req.messages.push(AiMessage {
                    role: AiRole::Assistant,
                    content: Some(content.to_string()),
                    thought: None,
                    thought_signature: None,
                    tool_calls: None,
                    tool_call_id: None,
                });
                retry_req.messages.push(AiMessage {
                    role: AiRole::User,
                    content: Some(format!(
                        "Your response is not valid: {}\nRespond with ONLY valid JSON conforming to the schema. No markdown, no explanation.",
                        e
                    )),
                    thought: None,
                    thought_signature: None,
                    tool_calls: None,
                    tool_call_id: None,
                });

I have been thinking about expanding the tool calls to include something like jsonschema with jq to validate the JSON output from the model, or if that would introduce more "pressure" or churn.

In review-3-patchset-6, the generated "reasoning" text is also needlessly verbose.

it is somewhat of a joke, but we could consider editing the prompt headers to reflect something like this:

Respond tersely. Cut all filler, keep technical substance.
- Drop filler (just, really, basically, actually).
- Drop pleasantries (sure, certainly, happy to).
- No hedging. Fragments fine. Short synonyms.
- Technical terms stay exact. Code blocks unchanged.
- Pattern: [thing] [action] [reason]. [next step].

maybe that would alleviate the "pressure" on your JSON path.

Anyway, the findings look good so far -- it would be nice if we could strike a balance between completely eliminating TodoWrite and still maintaining some detail in the findings.

Cheers,

anders-heimer · 2026-05-24T20:01:04Z

btrfs_raid_56_before.html
btrfs_raid_56_after.html
Hi Derek,

I have been thinking about expanding the tool calls to include something like jsonschema with jq to validate >the JSON output from the model, or if that would introduce more "pressure" or churn.

I think something like that would be an improvement. Especially some of the smaller models struggle with producing JSON output. However in my recent runs with Sonnet, there has been only a few JSON errors, but the review always recovered though.

In review-3-patchset-6, the generated "reasoning" text is also needlessly verbose.

it is somewhat of a joke, but we could consider editing the prompt headers to reflect something like this:

Seriously, that would be interesting to test out.

I have updated the PR and the two mechanical goals are working. TodoWrite is no longer used, and saved findings now preserve structured locations. Look at the attached html pages. In the current run I see 0 TodoWrite calls and where findings had empty locations, they all have the location preserved.

Br, Anders

derekbarbosa · 2026-05-25T17:32:58Z

Hey Anders,

Thanks for the changes -- results are good.
Tomorrow I will be reviewing a few more PRs -- would you be so kind to collect more data with this current revision?
Sorry for the tedium, we just need to ensure that we don't see any degradation on the deployed instance.

Cheers,

anders-heimer · 2026-05-25T20:31:31Z

Hi Derek,

I did run the tiny-benchmark with the updated PR. This time the judge verdict was that 5 out of 9 was detected.
This compared to 3 detected (baseline) and 4 detected (only TodoWrite removal).

I have another run ongoing, maybe I will manage to do two until tomorrow.

2.html
3.html
5.html
7.html
10.html
11.html
12.html

Br, Anders

anders-heimer · 2026-05-26T19:44:11Z

Hi,

I have run two more tiny benchmarks, no new issues. Both detected 5 out of 9.

Br, Anders

derekbarbosa · 2026-05-26T20:29:40Z

Hi Anders,

I quickly scanned the attached .html files -- thank you for uploading them, they LGTM.

Overall, the changes appear to be a net positive. Like Roman said, data points on the scale of hundreds/thousands as opposed to just 10 may make a better case, but the results so far have been convincing.

I will append the output of /review-pr here. Please treat it as an ACK from me.

✦ The review of commit 1d3cfe2ce23e3059a9c577bf4eaf0d6786fd989c is complete. The changes successfully remove the TodoWrite tool while ensuring that candidate locations are preserved and
  promoted through the multi-stage review pipeline.

  Summary of Findings

  🟢 Design Alignment
   - Removal of TodoWrite: The tool was a "black hole" as the worker never read the written file back. Promoting candidate locations into the JSON concerns array ensures this metadata
     survives between stages.
   - Compatibility Mitigation: The addition of the "TodoWrite compatibility" instruction in src/worker/prompts.rs is an excellent defensive measure for vendored prompts that might still
     attempt to use the deleted tool.

  🟡 Complexity / Logic
   - Specificity Requirement: The new requirements for locations (adding line, why_this_location_matters, and changing function/symbol to function_or_symbol) provide much more actionable
     data for the LKML report generation.
   - Prompt Verbosity: The JSON examples in the prompts have become quite large. While this is necessary for AI few-shot performance and schema adherence, it may increase token
     consumption.

  🔴 Safety / Bugs
   - None identified. The cleanup of the dispatcher and tests in src/worker/tools.rs was thorough.

  Review Report


  ┌──────────┬──────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
  │ Severity │ Category     │ Description                                                                                                                                        │
  ├──────────┼──────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 🟢       │ Design       │ Alignment: TodoWrite removal aligns with the goal of keeping the cross-stage pipeline data-driven via JSON rather than side-effecting local files. │
  │ 🟢       │ Architecture │ Compatibility: The prompt-level override for TodoWrite ensures vendored kernel prompts don't fail when attempting to use the removed tool.         │
  │ 🔵       │ Style        │ Naming: Consistent update of function/symbol to function_or_symbol across all stage prompts.                                                       │
  └──────────┴──────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘


  No actionable code changes are required for this PR. The implementation is idiomatic and follows the project standards defined in GEMINI.md.

rgushchin · 2026-05-26T21:45:50Z

I'm going to run it through 1k benchmark. As I said, overall I'm supportive on the idea, but it might require corresponding changes in prompts plus maybe something else, let's see. And tests on 10 patches are not really conclusive, unfortunately.

rgushchin · 2026-05-27T20:58:11Z

I ran the full benchmark and it's a parity. Please, rebase and I gonna merge it. Thanks!

before:
2026-05-19T15:17:38.777457Z INFO benchmark: Total Entries: 999
2026-05-19T15:17:38.777480Z INFO benchmark: Detected (Exact): 460
2026-05-19T15:17:38.777499Z INFO benchmark: Partially Detected: 68
2026-05-19T15:17:38.777518Z INFO benchmark: Missed: 458
2026-05-19T15:17:38.777536Z INFO benchmark: Not Reviewed/Found: 12
2026-05-19T15:17:38.777556Z INFO benchmark: Skipped (No Description): 1
2026-05-19T15:17:38.777575Z INFO benchmark: Total Concerns (Before Stage 8): 13906
2026-05-19T15:17:38.777596Z INFO benchmark: Total Findings (Final Report): 4214

after:
2026-05-27T20:55:22.904059Z INFO benchmark: Total Entries: 999
2026-05-27T20:55:22.904083Z INFO benchmark: Detected (Exact): 466
2026-05-27T20:55:22.904102Z INFO benchmark: Partially Detected: 62
2026-05-27T20:55:22.904123Z INFO benchmark: Missed: 458
2026-05-27T20:55:22.904141Z INFO benchmark: Not Reviewed/Found: 12
2026-05-27T20:55:22.904159Z INFO benchmark: Skipped (No Description): 1
2026-05-27T20:55:22.904180Z INFO benchmark: Total Concerns (Before Stage 8): 14021
2026-05-27T20:55:22.904200Z INFO benchmark: Total Findings (Final Report): 4160

TodoWrite only existed to satisfy vendored prompts. It appended to TODO.md, but the worker never read that file after a stage. Later stages only receive serialized concerns/findings. Removing it can still affect review quality because tool-call arguments remain in the stage-local transcript. TodoWrite may have acted as scratch space for candidate bug locations before the stage emitted final JSON. Preserve that signal explicitly: treat TodoWrite instructions as an internal checklist, emit concrete candidates as concerns with locations, and require stages 8-10 to carry those locations forward. This promotes candidate locations into the cross-stage pipeline earlier, which can increase false-positives. Stage 8/9 remain responsible for deduplicating, verifying, and filtering candidates before they become reported findings. Remove the advertised tool, dispatcher entry, implementation, and stale design/test references. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Anders Heimer <anders.heimer@est.tech>

anders-heimer · 2026-05-28T12:30:51Z

Thanks Roman and Derek for the reviews, benchmark run, and feedback.

I rebased the PR onto current main.

Br, Anders

anders-heimer mentioned this pull request May 19, 2026

Required optimization on local review with claude cli #184

Open

anders-heimer force-pushed the remove-todo-write-tool branch from 1b67fa5 to 052b1dc Compare May 21, 2026 04:05

anders-heimer force-pushed the remove-todo-write-tool branch from 052b1dc to 1d3cfe2 Compare May 24, 2026 19:45

derekbarbosa approved these changes May 26, 2026

View reviewed changes

anders-heimer force-pushed the remove-todo-write-tool branch from e2edb31 to c2b7e3e Compare May 28, 2026 09:52

rgushchin merged commit dc400a5 into sashiko-dev:main May 28, 2026
3 checks passed

anders-heimer deleted the remove-todo-write-tool branch May 28, 2026 14:57

Conversation

anders-heimer commented May 11, 2026

Uh oh!

rgushchin commented May 11, 2026

Uh oh!

anders-heimer commented May 12, 2026

Uh oh!

anders-heimer commented May 19, 2026

Uh oh!

derekbarbosa commented May 19, 2026

Uh oh!

anders-heimer commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

derekbarbosa commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anders-heimer commented May 21, 2026

Uh oh!

derekbarbosa commented May 23, 2026

Uh oh!

anders-heimer commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

derekbarbosa commented May 25, 2026

Uh oh!

anders-heimer commented May 25, 2026

Uh oh!

anders-heimer commented May 26, 2026

Uh oh!

derekbarbosa commented May 26, 2026

Uh oh!

rgushchin commented May 26, 2026

Uh oh!

rgushchin commented May 27, 2026

Uh oh!

anders-heimer commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anders-heimer commented May 19, 2026 •

edited

Loading

derekbarbosa commented May 20, 2026 •

edited

Loading

anders-heimer commented May 24, 2026 •

edited

Loading