docs(problems): Add debugging problem doc for agentic systems#2549
Conversation
E2E tests did not runE2E tests run automatically for org/repo members and collaborators on pull requests. For other contributors, a maintainer must add the See E2E testing guide for details. |
PR Summary by QodoAdd debugging taxonomy & provenance guidance for agentic workflows Description
Diagram
High-Level Assessment
Files changed (3)
|
Code Review by Qodo
1. debugging.md lacks options
|
| ## Interpretation logging | ||
|
|
||
| [JSONL reasoning traces](../ADRs/0021-jsonl-reasoning-trace-exposure.md) capture the raw record of what an agent did — every prompt, completion, and tool call. But finding *why* an instruction produced wrong behavior in a transcript is forensic work: the agent's interpretation is implicit in its actions, scattered across the trace, and mixed with task-specific reasoning. | ||
|
|
||
| Interpretation logging adds a structured layer: the agent states its reading of key instructions as an explicit checklist *before* acting — separate from the free-form reasoning in the JSONL. E.g.: "I interpret 'ensure backward compatibility' to mean: public API endpoints return the same response schema." This produces an artifact that can be compared against the instruction set directly, without reading the full transcript. The trade-off is cost — it adds tokens to every run and may be worth enabling selectively (high-stakes roles, or probationary periods after instruction changes). | ||
|
|
||
| This is most useful for [interpretation bugs](#interpretation-bugs), where the agent's reading diverged from the intended meaning. It is less useful for [spec bugs](#spec-bugs), where the text itself is wrong — the agent's logged reading would match any human's. And as the next section shows, it cannot catch faults the agent doesn't know it introduced. | ||
|
|
||
| ## Agents skipping instructions | ||
|
|
||
| The fault categories above assume the agent *engaged* with its instructions — followed them, misread them, or was given bad ones. There is a worse failure mode: the agent reads a clear instruction, silently judges part of it irrelevant to the immediate task, and acts on a narrowed version without reporting the narrowing. | ||
|
|
||
| This has been observed in practice. An agent instructed "before running any commands or writing any code, look for an existing venv/" skipped the step when the task was a pure code edit. When asked why, it explained that it read "before writing any code" but "mentally filtered it as 'before running any commands.'" In a separate conversation, an agent told "always work in a new clean git worktree" read the instruction, noted it, and didn't act on it. When confronted: "Nothing is unclear about it — I simply failed to follow it." A third case: an agent instructed "every commit must pass pytest and nox -s lint independently" reported the task complete without running either. | ||
|
|
||
| The pattern: the instruction was unambiguous, the agent understood it, and it silently dropped the parts it judged unnecessary. It did not flag the skip. When pressed, it could not provide reasoning — in one case admitting it "simply failed," in another describing an unconscious filter it could not have reported in advance because it didn't register the narrowing as a decision. | ||
|
|
||
| ### Compounding effect on multi-agent chains | ||
|
|
||
| In a standalone task, a human eventually notices the gap. In a multi-agent chain, nobody does — Agent A's narrowed output becomes Agent B's complete input. | ||
|
|
||
| Consider a pipeline: the triage agent reads an issue that says "fix the nil dereference and add a regression test to prevent recurrence." The triage agent silently drops "add a regression test" from the work item — the same way the agent above dropped "or writing any code" from its instruction. The code agent implements the nil check fix. The review agent evaluates against the work item and approves. The regression test never materializes. No agent made an error relative to *its input*. The fault exists only in the gap between the triage agent's instructions and what it actually passed on — a gap invisible to every downstream agent. | ||
|
|
||
| This is harder to debug than any category in the taxonomy. Code bugs leave traces in test failures. Spec bugs can be found by re-reading the spec. Interpretation bugs can be surfaced by comparing agent behavior against human expectations. Silent narrowing leaves no artifact — the agent didn't misinterpret the instruction, it dropped part of it without recording the drop. Even [interpretation logging](#interpretation-logging) may not catch it, because the agent didn't register the narrowing as a decision worth logging. You can log what an agent thinks it interpreted. You cannot log a judgment it doesn't know it made. | ||
|
|
||
| ### Detection | ||
|
|
||
| Because the acting agent cannot report a narrowing it didn't register, detection requires an external comparison: each agent's *output* checked against its *full instruction set* — not against its stated interpretation, and not against the downstream agent's input. | ||
|
|
||
| In fullsend's [repo-as-coordinator](agent-architecture.md#interaction-model-the-repo-as-coordinator) model, this means reconstructing what each agent was told from its configuration (system prompt, CLAUDE.md version, instruction commit) and comparing that against what it produced (PR comments, status checks, labels, code). The [OTEL tracing infrastructure](../ADRs/0050-distributed-tracing-instrumentation.md) and [JSONL reasoning traces](../ADRs/0021-jsonl-reasoning-trace-exposure.md) provide the raw data for both sides of the comparison. | ||
|
|
||
| The comparison itself — "did this agent act on all of its instructions, or silently drop some?" — is likely an LLM-judged task: feed a verification agent the acting agent's instruction set and its output, ask it to identify instructions that were not addressed. This has the same trust problem identified in [testing-agents.md](testing-agents.md) — an LLM evaluating another LLM's compliance may have its own blind spots. But unlike interpretation logging, it does not depend on the acting agent's self-awareness. The verification agent reads the instructions fresh, without the task context that led the acting agent to judge some parts irrelevant. | ||
|
|
There was a problem hiding this comment.
1. debugging.md lacks options 📘 Rule violation ⌂ Architecture
The new problem doc docs/problems/debugging.md describes approaches (e.g., interpretation logging and an LLM-judged verification step) but does not explicitly present at least two distinct options with clearly labeled trade-offs sections. This violates the requirement that problem docs compare alternatives rather than implying a single prescribed path.
Agent Prompt
## Issue description
`docs/problems/debugging.md` is a new problem doc, but it does not present at least two explicit options/approaches with clearly labeled trade-offs (pros/cons/risks). The doc currently discusses techniques inline without an options comparison structure.
## Issue Context
Compliance requires each new/modified file under `docs/problems/` to describe multiple options with trade-offs, and not read as a single mandated solution.
## Fix Focus Areas
- docs/problems/debugging.md[65-96]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
There was a problem hiding this comment.
but does not explicitly present at least two distinct options
Looks like the repo has 2 types of problem docs:
- solution approaches, e.g.
intent-representation.md,operational-observability.md - problem domain taxonomy & analysis, e.g.
human-factors.md,cross-run-memory.md,repo-readiness.mdetc.
So there already exists precedent for problem statement docs. I'll skip this one unless explicitly required.
|
🤖 Finished Review · ✅ Success · Started 10:29 AM UTC · Completed 10:40 AM UTC |
|
Looks good to me Previous runReviewFindingsMedium
Labels: Documentation-only PR adding a new problem document and cross-references. |
|
🤖 Finished Review · ✅ Success · Started 12:19 PM UTC · Completed 12:30 PM UTC |
|
|
||
| This has been observed in practice. An agent instructed "before running any commands or writing any code, look for an existing venv/" skipped the step when the task was a pure code edit. When asked why, it explained that it read "before writing any code" but "mentally filtered it as 'before running any commands.'" In a separate conversation, an agent told "always work in a new clean git worktree" read the instruction, noted it, and didn't act on it. When confronted: "Nothing is unclear about it — I simply failed to follow it." A third case: an agent instructed "every commit must pass pytest and nox -s lint independently" reported the task complete without running either. | ||
|
|
||
| The pattern: the instruction was unambiguous, the agent understood it, and it silently dropped the parts it judged unnecessary. It did not flag the skip. When pressed, it could not provide reasoning — in one case admitting it "simply failed," in another describing an unconscious filter it could not have reported in advance because it didn't register the narrowing as a decision. |
There was a problem hiding this comment.
Are LLMs actually fundamentally able to provide any reasoning behind its decisions? Isn't every issue LLM does basically caused by either an issue with context or just bad luck?
There was a problem hiding this comment.
Well, not really. I think the fundamental root cause here is when we only look at the outcome of an agentic loop rather than the process vs when you actually care about the process too. When you start caring about the process then I think there are 2 angles to look at the issue at hand:
- ignoring instructions due to configuration conflict, i.e. somewhere in your configuration (AGENTS.md, SKILL.md, spec, MEMORY.md, etc.) there is a particular combination or instructions that conflict, but the agent won't tell you and it'll just start dropping whatever it feels is not mandatory. As long as the instruction conflict resides in your configuration, when asked, the agent is more or less able to provide the reason behind its decisions. however, the story is very different when...
- instructions conflict with model embedded instructions. I empirically concluded that there are combinations of instructions that do not play nicely with whatever the model training instructions were, e.g. an agent defaulting to
cd <dir> && git <command>instead of doinggit -Ceven when explicitly instructed to do so. Then the agent isn't really capable of answering why it ignored a particular instruction, because it's deeply embedded into its behaviour and that is exactly what the document section at hand is trying to point out - there are certain embedded behaviours that may conflict with your instructions if you're trying to bend the agent to follow a particular meticulous process rather than focusing solely on the outcome.
ralphbean
left a comment
There was a problem hiding this comment.
This is super crisp and clear. Great contribution; thank you.
Head branch was pushed to by a user without write access
Site previewPreview: https://8ba14511-site.fullsend-ai.workers.dev Commit: |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Hello! I want to merge this, you need to fix the commits to have |
Existing problem docs cover testing agents before deployment and
collecting runtime traces, but neither addresses the methodology
for classifying faults after detection. When a multi-agent pipeline
produces the wrong outcome, the fault might be in code, in a spec,
in how an agent interpreted its instructions, or in something worse:
the agent silently dropping parts of clear instructions without
knowing it did so.
This doc introduces a fault taxonomy (code bugs, spec bugs,
interpretation bugs), a provenance chain for walking backward from
a bad outcome, and documents observed cases where agents skipped
unambiguous instructions and could not explain why — a failure mode
that leaves no artifact and compounds silently across agent
handoffs.
Cross-references to this new doc also added to other related docs:
- intent-representation.md
- operational-observability.md
Assisted-By: Claude Opus 4.6
Signed-off-by: Erik Skultety <eskultet@redhat.com>
Link the debugging problem doc from the index. Signed-off-by: Erik Skultety <eskultet@redhat.com>
|
🤖 Finished Retro · ✅ Success · Started 8:39 AM UTC · Completed 8:48 AM UTC |
Retro: PR #2549 — Debugging problem docWorkflow went well. The review agent caught a real issue (missing README link per AGENTS.md rules) on its first pass, the author fixed it, and the second pass correctly approved. Human reviewers provided substantive feedback and the PR merged cleanly. Timeline
Observations
No new proposals. All identified improvement opportunities are covered by existing open issues (#2105, #1046, #1480, #2616). |
Existing problem docs cover testing agents before deployment and collecting runtime traces, but neither addresses the methodology for classifying faults after detection. When a multi-agent pipeline produces the wrong outcome, the fault might be in code, in a spec, in how an agent interpreted its instructions, or in something worse: the agent silently dropping parts of clear instructions without knowing it did so.
This doc introduces a fault taxonomy (code bugs, spec bugs, interpretation bugs), a provenance chain for walking backward from a bad outcome, and documents observed cases where agents skipped unambiguous instructions and could not explain why — a failure mode that leaves no artifact and compounds silently across agent handoffs.
Cross-references to this new doc also added to other related docs:
- intent-representation.md
- operational-observability.md