Skip to content

worker: remove TodoWrite tool#176

Merged
rgushchin merged 1 commit into
sashiko-dev:mainfrom
anders-heimer:remove-todo-write-tool
May 28, 2026
Merged

worker: remove TodoWrite tool#176
rgushchin merged 1 commit into
sashiko-dev:mainfrom
anders-heimer:remove-todo-write-tool

Conversation

@anders-heimer
Copy link
Copy Markdown
Contributor

Upstream review-prompts (vendored in third_party/prompts/) instruct the model to call TodoWrite for each analysis task, assuming Claude Code's native task-tracking tool. Sashiko's local todowrite implementation merely appended a line to a TODO.md file in the worktree and returned; it had no analytical effect on the review.

In practice this caused the model to spend a large fraction of its per-stage turn budget issuing TodoWrite calls. In one observed Stage 3 run, 27 of 46 turns were TodoWrite invocations, which pushed the stage toward the max_interactions limit and left insufficient budget for the actual callstack analysis.

Remove the TodoWrite tool entirely: the declaration in get_declarations_generic, the dispatcher entry, the todowrite method, the now-unused AsyncWriteExt import, and the companion unit test. The tool-name normalization test is updated to use a tool that still exists (list_dir). The DESIGN_REVIEW_WORKER.md references to todo_write and the Todo List state are dropped. Once the tool is no longer advertised, the upstream prompt instruction "Add each Task into a TodoWrite" becomes a harmless no-op.

Measured on a 9-entry Linux-kernel benchmark set with the Kiro CLI provider running claude-sonnet-4.5:

  • Detection rate: 4/9 (44%) -> 6/9 (67%).
  • Average turns per review: 99 -> 55.7 (-43.8%).
  • Average output tokens per review: 17306 -> 14428 (-16.6%).
  • Two previously missed patches (cifs SendReceive response-buffer leak, nci command workqueue use-after-free) are now detected correctly; the nci review dropped from 159 turns with no findings to 29 turns with a correct finding.

No regressions were observed: every patch detected before is still detected, with lower turn counts and comparable or smaller output.

@rgushchin
Copy link
Copy Markdown
Member

Thank you for the idea! I generally agree with the direction, but last time I tested it, it showed some regression. I'll run it with gemini and on a 1000 dataset - 10 is way too noisy for this.

@anders-heimer
Copy link
Copy Markdown
Contributor Author

Thank you for your quick feedback!

It will be interesting to see the results from a bigger dataset.

@anders-heimer
Copy link
Copy Markdown
Contributor Author

Hi @rgushchin ,

I have done some more benchmarking, again with the tiny benchmarking, usingQwen3-Coder-30B-A3B-Instruct-Q
5_K_M on an L40S. As you say the measurements are noisy, but the improvements makes sense theoretically at least.

Metric After (TodoWrite removed) Before
Detected 4/9 3/9
Missed 5/9 6/9
Runtime 5,430s (1.5h) 20,825s (5.8h)
Avg duration/patch 603s 2,314s
Total turns 949 1,334
Avg turns/patch 105.4 148.2
Format validation failures 343 306
Aborted reviews 24 23
Full review restarts 20 18

Br, Anders

@derekbarbosa
Copy link
Copy Markdown
Collaborator

Hey Anders,

The changes look good, and the results are promising!

You mentioned that this was done with the tiny benchmark, correct? Would you kindly elaborate?

  • How many trials of the tiny benchmark were performed (before/after)
  • Did you store the output results of the model's findings anywhere? I see detected & missed as entries there, but I am also curious if the review "output" response blabbed on about anything unrelated in either case (false positives)?

@anders-heimer
Copy link
Copy Markdown
Contributor Author

anders-heimer commented May 19, 2026

Hi Derek,

I only ran one A and B comparison per model (Sonnet and Qwen3-Coder-30B). I ran benchmarks/benchmark_tiny.json all four times.

I only had the db from the last run saved. With "unrelated" I mean the findings it didn't match the specific known bug being tested for.

Per-patch unrelated findings:

Target bug Before After
btrfs: stripe_uptodate_bitmap memleak 0 2
cgroup: NULL deref on dmem.max write 5 10
cgroup: RCU usage without read lock 5 9
nfc: workqueue UAF after rfkill 5 7
t7xx: MAX_SKB_FRAGS overflow 3 15
smb: response buffer memleak 1 2
smb: SendReceive() buffer leak 0 0
nvme: swapped sgl parameters 0 0
flex_proportions: deadlock 0 0

The before findings are fewer but more specific. Example from t7xx (3 findings total):
"In t7xx_dpmaif_rx_buf_alloc(), the error path at line 264 uses while (--i > 0) which skips unmapping index 0..."

The after findings are more numerous but vaguer. Example from t7xx (15 findings total):
"Hardware register access without power domain/clk checks"
"Potential integer overflow in ring buffer calculations"

It looks to me that removing TodoWrite freed up turn budget by not wasting it on no-ops, but in my run the model used that budget to produce more findings rather than deeper analysis.

Br, Anders Heimer

@derekbarbosa
Copy link
Copy Markdown
Collaborator

derekbarbosa commented May 20, 2026

thank you for the expanded findings!

The before findings are fewer but more specific. Example from t7xx (3 findings total):
"In t7xx_dpmaif_rx_buf_alloc(), the error path at line 264 uses while (--i > 0) which skips unmapping index 0..."

The after findings are more numerous but vaguer. Example from t7xx (15 findings total):
"Hardware register access without power domain/clk checks"
"Potential integer overflow in ring buffer calculations"

this is interesting behavior, if I had to guess, the TodoWrite specified what the finding was, and then spent time aggregating/validating them?

Do you have any access to the logs, if they exist? (example on prod instance via the "View Raw Log" button: https://sashiko.dev/#/log/54060)

It looks to me that removing TodoWrite freed up turn budget by not wasting it on no-ops, but in my run the model used that budget to produce more findings rather than deeper analysis.

Now I'm curious -- if adding a shim to prompts.rs, preferably one of the review stages (say stages 8/9/10) may "massage" out more precise findings. Something akin to "please be specific: elaborate what functions and what line numbers trigger the error"

@anders-heimer anders-heimer force-pushed the remove-todo-write-tool branch from 1b67fa5 to 052b1dc Compare May 21, 2026 04:05
@anders-heimer
Copy link
Copy Markdown
Contributor Author

Thanks Derek, I had missed that aspect.

Unfortunately, the initial logs from the runs I mentioned are gone. I used Kiro to collect whatever information I still had lying around, hence the odd format. Those notes are attached, I think it shows what you are after.

From the logs, it does not look like TodoWrite was discovering bugs directly. Most TodoWrite calls were checklist/status churn. A few calls did contain concrete candidate-like text, but that only seems useful within the same stage unless the model also emits the candidate as a structured concern. As I understand it, later stages do not consume TODO.md as persistent review state.

I tested your theory by preserving the useful part in the finding pipeline by emitting concrete suspected bugs with structured location data, for later stages. In the tiny benchmark I ran with this change, that did preserve structured locations for the final findings.

The downside is that the richer structured output appears to put more pressure on the JSON path, at least with Kiro. In my partial run it caused more malformed-output/schema failures and timeout churn. I need to look in to that more in detail.
review-7-patchset-1.html
review-2-patchset-5.html
review-8-patchset-7.html
review-1-patchset-3.html
review-5-patchset-8.html
review-6-patchset-4.html
review-3-patchset-6.html
review-4-patchset-2.html
review-1-patchset-1.html
review-2-patchset-2.html
review-5-patchset-1.html
review-6-patchset-6.html

@derekbarbosa
Copy link
Copy Markdown
Collaborator

Hi Anders!

structured output appears to put more pressure on the JSON path

Thank you for the logs. I see what you mean by malformed output in some of those results. Specifically referencing review-2-patchset-5 in this case.

Running these experiments with these minor tweaks seems... annoying, to say the least.

prompts.rs has some try_parse used in a match arm in json_request as some sort of validation.

        match try_parse(content, &validate) {
            Ok(v) => return Some(v),
            Err(e) => {
                warn!("{}: {}, retrying with correction", label, e);
                let mut retry_req = retry_base;
                retry_req.messages.push(AiMessage {
                    role: AiRole::Assistant,
                    content: Some(content.to_string()),
                    thought: None,
                    thought_signature: None,
                    tool_calls: None,
                    tool_call_id: None,
                });
                retry_req.messages.push(AiMessage {
                    role: AiRole::User,
                    content: Some(format!(
                        "Your response is not valid: {}\nRespond with ONLY valid JSON conforming to the schema. No markdown, no explanation.",
                        e
                    )),
                    thought: None,
                    thought_signature: None,
                    tool_calls: None,
                    tool_call_id: None,
                });

I have been thinking about expanding the tool calls to include something like jsonschema with jq to validate the JSON output from the model, or if that would introduce more "pressure" or churn.

In review-3-patchset-6, the generated "reasoning" text is also needlessly verbose.

it is somewhat of a joke, but we could consider editing the prompt headers to reflect something like this:

Respond tersely. Cut all filler, keep technical substance.
- Drop filler (just, really, basically, actually).
- Drop pleasantries (sure, certainly, happy to).
- No hedging. Fragments fine. Short synonyms.
- Technical terms stay exact. Code blocks unchanged.
- Pattern: [thing] [action] [reason]. [next step].

maybe that would alleviate the "pressure" on your JSON path.

Anyway, the findings look good so far -- it would be nice if we could strike a balance between completely eliminating TodoWrite and still maintaining some detail in the findings.

Cheers,

@anders-heimer anders-heimer force-pushed the remove-todo-write-tool branch from 052b1dc to 1d3cfe2 Compare May 24, 2026 19:45
@anders-heimer
Copy link
Copy Markdown
Contributor Author

anders-heimer commented May 24, 2026

btrfs_raid_56_before.html
btrfs_raid_56_after.html
Hi Derek,

I have been thinking about expanding the tool calls to include something like jsonschema with jq to validate >the JSON output from the model, or if that would introduce more "pressure" or churn.

I think something like that would be an improvement. Especially some of the smaller models struggle with producing JSON output. However in my recent runs with Sonnet, there has been only a few JSON errors, but the review always recovered though.

In review-3-patchset-6, the generated "reasoning" text is also needlessly verbose.

it is somewhat of a joke, but we could consider editing the prompt headers to reflect something like this:

Seriously, that would be interesting to test out.

I have updated the PR and the two mechanical goals are working. TodoWrite is no longer used, and saved findings now preserve structured locations. Look at the attached html pages. In the current run I see 0 TodoWrite calls and where findings had empty locations, they all have the location preserved.

Br, Anders

@derekbarbosa
Copy link
Copy Markdown
Collaborator

Hey Anders,

Thanks for the changes -- results are good.
Tomorrow I will be reviewing a few more PRs -- would you be so kind to collect more data with this current revision?
Sorry for the tedium, we just need to ensure that we don't see any degradation on the deployed instance.

Cheers,

@anders-heimer
Copy link
Copy Markdown
Contributor Author

Hi Derek,

I did run the tiny-benchmark with the updated PR. This time the judge verdict was that 5 out of 9 was detected.
This compared to 3 detected (baseline) and 4 detected (only TodoWrite removal).

I have another run ongoing, maybe I will manage to do two until tomorrow.

2.html
3.html
5.html
7.html
10.html
11.html
12.html

Br, Anders

@anders-heimer
Copy link
Copy Markdown
Contributor Author

Hi,

I have run two more tiny benchmarks, no new issues. Both detected 5 out of 9.

Br, Anders

@derekbarbosa
Copy link
Copy Markdown
Collaborator

Hi Anders,

I quickly scanned the attached .html files -- thank you for uploading them, they LGTM.

Overall, the changes appear to be a net positive. Like Roman said, data points on the scale of hundreds/thousands as opposed to just 10 may make a better case, but the results so far have been convincing.

I will append the output of /review-pr here. Please treat it as an ACK from me.

✦ The review of commit 1d3cfe2ce23e3059a9c577bf4eaf0d6786fd989c is complete. The changes successfully remove the TodoWrite tool while ensuring that candidate locations are preserved and
  promoted through the multi-stage review pipeline.

  Summary of Findings

  🟢 Design Alignment
   - Removal of TodoWrite: The tool was a "black hole" as the worker never read the written file back. Promoting candidate locations into the JSON concerns array ensures this metadata
     survives between stages.
   - Compatibility Mitigation: The addition of the "TodoWrite compatibility" instruction in src/worker/prompts.rs is an excellent defensive measure for vendored prompts that might still
     attempt to use the deleted tool.

  🟡 Complexity / Logic
   - Specificity Requirement: The new requirements for locations (adding line, why_this_location_matters, and changing function/symbol to function_or_symbol) provide much more actionable
     data for the LKML report generation.
   - Prompt Verbosity: The JSON examples in the prompts have become quite large. While this is necessary for AI few-shot performance and schema adherence, it may increase token
     consumption.

  🔴 Safety / Bugs
   - None identified. The cleanup of the dispatcher and tests in src/worker/tools.rs was thorough.

  Review Report


  ┌──────────┬──────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
  │ Severity │ Category     │ Description                                                                                                                                        │
  ├──────────┼──────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 🟢       │ Design       │ Alignment: TodoWrite removal aligns with the goal of keeping the cross-stage pipeline data-driven via JSON rather than side-effecting local files. │
  │ 🟢       │ Architecture │ Compatibility: The prompt-level override for TodoWrite ensures vendored kernel prompts don't fail when attempting to use the removed tool.         │
  │ 🔵       │ Style        │ Naming: Consistent update of function/symbol to function_or_symbol across all stage prompts.                                                       │
  └──────────┴──────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘


  No actionable code changes are required for this PR. The implementation is idiomatic and follows the project standards defined in GEMINI.md.

@rgushchin
Copy link
Copy Markdown
Member

I'm going to run it through 1k benchmark. As I said, overall I'm supportive on the idea, but it might require corresponding changes in prompts plus maybe something else, let's see. And tests on 10 patches are not really conclusive, unfortunately.

@rgushchin
Copy link
Copy Markdown
Member

I ran the full benchmark and it's a parity. Please, rebase and I gonna merge it. Thanks!

before:
2026-05-19T15:17:38.777457Z INFO benchmark: Total Entries: 999
2026-05-19T15:17:38.777480Z INFO benchmark: Detected (Exact): 460
2026-05-19T15:17:38.777499Z INFO benchmark: Partially Detected: 68
2026-05-19T15:17:38.777518Z INFO benchmark: Missed: 458
2026-05-19T15:17:38.777536Z INFO benchmark: Not Reviewed/Found: 12
2026-05-19T15:17:38.777556Z INFO benchmark: Skipped (No Description): 1
2026-05-19T15:17:38.777575Z INFO benchmark: Total Concerns (Before Stage 8): 13906
2026-05-19T15:17:38.777596Z INFO benchmark: Total Findings (Final Report): 4214

after:
2026-05-27T20:55:22.904059Z INFO benchmark: Total Entries: 999
2026-05-27T20:55:22.904083Z INFO benchmark: Detected (Exact): 466
2026-05-27T20:55:22.904102Z INFO benchmark: Partially Detected: 62
2026-05-27T20:55:22.904123Z INFO benchmark: Missed: 458
2026-05-27T20:55:22.904141Z INFO benchmark: Not Reviewed/Found: 12
2026-05-27T20:55:22.904159Z INFO benchmark: Skipped (No Description): 1
2026-05-27T20:55:22.904180Z INFO benchmark: Total Concerns (Before Stage 8): 14021
2026-05-27T20:55:22.904200Z INFO benchmark: Total Findings (Final Report): 4160

TodoWrite only existed to satisfy vendored prompts. It appended to
TODO.md, but the worker never read that file after a stage. Later
stages only receive serialized concerns/findings.

Removing it can still affect review quality because tool-call arguments
remain in the stage-local transcript. TodoWrite may have acted as
scratch space for candidate bug locations before the stage emitted
final JSON.

Preserve that signal explicitly: treat TodoWrite instructions as an
internal checklist, emit concrete candidates as concerns with locations,
and require stages 8-10 to carry those locations forward.

This promotes candidate locations into the cross-stage pipeline earlier,
which can increase false-positives. Stage 8/9 remain responsible for
deduplicating, verifying, and filtering candidates before they become
reported findings.

Remove the advertised tool, dispatcher entry, implementation, and stale
design/test references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Anders Heimer <anders.heimer@est.tech>
@anders-heimer anders-heimer force-pushed the remove-todo-write-tool branch from e2edb31 to c2b7e3e Compare May 28, 2026 09:52
@anders-heimer
Copy link
Copy Markdown
Contributor Author

Thanks Roman and Derek for the reviews, benchmark run, and feedback.

I rebased the PR onto current main.

Br, Anders

@rgushchin rgushchin merged commit dc400a5 into sashiko-dev:main May 28, 2026
3 checks passed
@anders-heimer anders-heimer deleted the remove-todo-write-tool branch May 28, 2026 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants