worker: remove TodoWrite tool#176
Conversation
|
Thank you for the idea! I generally agree with the direction, but last time I tested it, it showed some regression. I'll run it with gemini and on a 1000 dataset - 10 is way too noisy for this. |
|
Thank you for your quick feedback! It will be interesting to see the results from a bigger dataset. |
|
Hi @rgushchin , I have done some more benchmarking, again with the tiny benchmarking, usingQwen3-Coder-30B-A3B-Instruct-Q
Br, Anders |
|
Hey Anders, The changes look good, and the results are promising! You mentioned that this was done with the tiny benchmark, correct? Would you kindly elaborate?
|
|
Hi Derek, I only ran one A and B comparison per model (Sonnet and Qwen3-Coder-30B). I ran benchmarks/benchmark_tiny.json all four times. I only had the db from the last run saved. With "unrelated" I mean the findings it didn't match the specific known bug being tested for. Per-patch unrelated findings:
The before findings are fewer but more specific. Example from t7xx (3 findings total): The after findings are more numerous but vaguer. Example from t7xx (15 findings total): It looks to me that removing TodoWrite freed up turn budget by not wasting it on no-ops, but in my run the model used that budget to produce more findings rather than deeper analysis. Br, Anders Heimer |
|
thank you for the expanded findings!
this is interesting behavior, if I had to guess, the TodoWrite specified what the finding was, and then spent time aggregating/validating them? Do you have any access to the logs, if they exist? (example on prod instance via the "View Raw Log" button: https://sashiko.dev/#/log/54060)
Now I'm curious -- if adding a shim to |
1b67fa5 to
052b1dc
Compare
|
Thanks Derek, I had missed that aspect. Unfortunately, the initial logs from the runs I mentioned are gone. I used Kiro to collect whatever information I still had lying around, hence the odd format. Those notes are attached, I think it shows what you are after. From the logs, it does not look like TodoWrite was discovering bugs directly. Most TodoWrite calls were checklist/status churn. A few calls did contain concrete candidate-like text, but that only seems useful within the same stage unless the model also emits the candidate as a structured concern. As I understand it, later stages do not consume TODO.md as persistent review state. I tested your theory by preserving the useful part in the finding pipeline by emitting concrete suspected bugs with structured location data, for later stages. In the tiny benchmark I ran with this change, that did preserve structured locations for the final findings. The downside is that the richer structured output appears to put more pressure on the JSON path, at least with Kiro. In my partial run it caused more malformed-output/schema failures and timeout churn. I need to look in to that more in detail. |
|
Hi Anders!
Thank you for the logs. I see what you mean by malformed output in some of those results. Specifically referencing review-2-patchset-5 in this case. Running these experiments with these minor tweaks seems... annoying, to say the least.
I have been thinking about expanding the tool calls to include something like In review-3-patchset-6, the generated "reasoning" text is also needlessly verbose. it is somewhat of a joke, but we could consider editing the prompt headers to reflect something like this: maybe that would alleviate the "pressure" on your JSON path. Anyway, the findings look good so far -- it would be nice if we could strike a balance between completely eliminating TodoWrite and still maintaining some detail in the findings. Cheers, |
052b1dc to
1d3cfe2
Compare
|
btrfs_raid_56_before.html
I think something like that would be an improvement. Especially some of the smaller models struggle with producing JSON output. However in my recent runs with Sonnet, there has been only a few JSON errors, but the review always recovered though.
Seriously, that would be interesting to test out. I have updated the PR and the two mechanical goals are working. TodoWrite is no longer used, and saved findings now preserve structured locations. Look at the attached html pages. In the current run I see 0 TodoWrite calls and where findings had empty locations, they all have the location preserved. Br, Anders |
|
Hey Anders, Thanks for the changes -- results are good. Cheers, |
|
Hi Derek, I did run the tiny-benchmark with the updated PR. This time the judge verdict was that 5 out of 9 was detected. I have another run ongoing, maybe I will manage to do two until tomorrow. 2.html Br, Anders |
|
Hi, I have run two more tiny benchmarks, no new issues. Both detected 5 out of 9. Br, Anders |
|
Hi Anders, I quickly scanned the attached .html files -- thank you for uploading them, they LGTM. Overall, the changes appear to be a net positive. Like Roman said, data points on the scale of hundreds/thousands as opposed to just 10 may make a better case, but the results so far have been convincing. I will append the output of |
|
I'm going to run it through 1k benchmark. As I said, overall I'm supportive on the idea, but it might require corresponding changes in prompts plus maybe something else, let's see. And tests on 10 patches are not really conclusive, unfortunately. |
|
I ran the full benchmark and it's a parity. Please, rebase and I gonna merge it. Thanks! before: after: |
TodoWrite only existed to satisfy vendored prompts. It appended to TODO.md, but the worker never read that file after a stage. Later stages only receive serialized concerns/findings. Removing it can still affect review quality because tool-call arguments remain in the stage-local transcript. TodoWrite may have acted as scratch space for candidate bug locations before the stage emitted final JSON. Preserve that signal explicitly: treat TodoWrite instructions as an internal checklist, emit concrete candidates as concerns with locations, and require stages 8-10 to carry those locations forward. This promotes candidate locations into the cross-stage pipeline earlier, which can increase false-positives. Stage 8/9 remain responsible for deduplicating, verifying, and filtering candidates before they become reported findings. Remove the advertised tool, dispatcher entry, implementation, and stale design/test references. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Anders Heimer <anders.heimer@est.tech>
e2edb31 to
c2b7e3e
Compare
|
Thanks Roman and Derek for the reviews, benchmark run, and feedback. I rebased the PR onto current main. Br, Anders |
Upstream review-prompts (vendored in third_party/prompts/) instruct the model to call TodoWrite for each analysis task, assuming Claude Code's native task-tracking tool. Sashiko's local todowrite implementation merely appended a line to a TODO.md file in the worktree and returned; it had no analytical effect on the review.
In practice this caused the model to spend a large fraction of its per-stage turn budget issuing TodoWrite calls. In one observed Stage 3 run, 27 of 46 turns were TodoWrite invocations, which pushed the stage toward the max_interactions limit and left insufficient budget for the actual callstack analysis.
Remove the TodoWrite tool entirely: the declaration in get_declarations_generic, the dispatcher entry, the todowrite method, the now-unused AsyncWriteExt import, and the companion unit test. The tool-name normalization test is updated to use a tool that still exists (list_dir). The DESIGN_REVIEW_WORKER.md references to todo_write and the Todo List state are dropped. Once the tool is no longer advertised, the upstream prompt instruction "Add each Task into a TodoWrite" becomes a harmless no-op.
Measured on a 9-entry Linux-kernel benchmark set with the Kiro CLI provider running claude-sonnet-4.5:
No regressions were observed: every patch detected before is still detected, with lower turn counts and comparable or smaller output.