docs(problems): Add debugging problem doc for agentic systems by eskultety · Pull Request #2549 · fullsend-ai/fullsend

eskultety · 2026-06-23T10:26:02Z

Existing problem docs cover testing agents before deployment and collecting runtime traces, but neither addresses the methodology for classifying faults after detection. When a multi-agent pipeline produces the wrong outcome, the fault might be in code, in a spec, in how an agent interpreted its instructions, or in something worse: the agent silently dropping parts of clear instructions without knowing it did so.

This doc introduces a fault taxonomy (code bugs, spec bugs, interpretation bugs), a provenance chain for walking backward from a bad outcome, and documents observed cases where agents skipped unambiguous instructions and could not explain why — a failure mode that leaves no artifact and compounds silently across agent handoffs.

Cross-references to this new doc also added to other related docs:
- intent-representation.md
- operational-observability.md

github-actions · 2026-06-23T10:26:14Z

E2E tests did not run

E2E tests run automatically for org/repo members and collaborators on pull requests.

For other contributors, a maintainer must add the ok-to-test label after the latest push.

See E2E testing guide for details.

qodo-code-review · 2026-06-23T10:27:10Z

PR Summary by Qodo

Add debugging taxonomy & provenance guidance for agentic workflows
📝 Documentation 🕐 20-40 Minutes

Description

• Add a new debugging methodology doc for classifying failures in agentic pipelines
• Define fault taxonomy and a provenance chain to localize where the failure originated
• Cross-link debugging guidance from intent representation and operational observability docs

Diagram

graph TD
IR["Intent Representation doc"] --> D["Debugging doc"] --> OO["Operational Observability doc"] --> TA["Testing Agents doc"]
D --> ADR21["JSONL traces ADR"] --> ADR50["OTEL tracing ADR"]
D --> AA["Agent Architecture doc"]

High-Level Assessment

The following are alternative approaches to this PR:

1. Fold debugging methodology into Operational Observability

➕ Keeps 'collect data' and 'use data' guidance in one place
➕ Reduces navigation overhead for new readers
➖ Blurs infrastructure (telemetry) vs. diagnosis methodology
➖ Would significantly bloat an already broad observability doc

2. Create a shorter 'Debugging checklist' and keep taxonomy minimal

➕ Faster to operationalize during incidents
➕ Lower maintenance burden than a longer narrative doc
➖ Loses important nuance (spec vs. interpretation vs. silent narrowing)
➖ Harder to justify/teach the categorization without explanations/examples

3. Treat 'silent narrowing' as an Observability/Tracing requirement (ADR-first)

➕ Directly ties the failure mode to concrete system requirements and instrumentation
➕ Encourages implementable acceptance criteria
➖ Prematurely constrains solutions; the PR’s goal is conceptual taxonomy
➖ Moves problem framing into ADRs, which are less discoverable than problem docs

Recommendation: Keep this as a dedicated problem doc (current approach). It cleanly separates methodology (fault classification and provenance reasoning) from observability infrastructure, while still cross-linking the relevant data sources and related problem areas. If adoption reveals frequent usage during incidents, consider a follow-up that extracts a one-page checklist summary without removing the deeper taxonomy.

Files changed (3) +120 / -0

Documentation (3) +120 / -0

debugging.mdAdd debugging taxonomy and provenance-based fault localization guide +111/-0

Add debugging taxonomy and provenance-based fault localization guide

• Introduces a new problem doc that defines a fault taxonomy (code/spec/interpretation) and a provenance chain for localizing failures in multi-agent workflows. Adds guidance on interpretation logging and documents the high-risk failure mode where agents silently skip unambiguous instructions, including detection ideas via instruction/output comparison.

docs/problems/debugging.md

intent-representation.mdDocument the 'wrong-spec problem' and link to debugging taxonomy +8/-0

Document the 'wrong-spec problem' and link to debugging taxonomy

• Adds a new section describing how formally valid specs can be substantively wrong and why agents are less likely to catch this via intuition/institutional context. Links readers to the new debugging doc for broader fault taxonomy and compounding failure modes.

docs/problems/intent-representation.md

operational-observability.mdCross-reference debugging methodology from observability relationships section +1/-0

Cross-reference debugging methodology from observability relationships section

• Adds a relationship entry pointing to the new debugging doc, clarifying that observability provides the data while debugging provides the classification methodology. Highlights the 'agents skipping instructions' case as one where trace inspection alone may be insufficient.

docs/problems/operational-observability.md

qodo-code-review · 2026-06-23T10:28:36Z

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (1) 📜 Skill insights (0)

Context used

✅ Compliance rules (platform): 51 rules

✅ Skills: writing-user-docs, writing-adrs

1. debugging.md lacks options 📘 Rule violation ⌂ Architecture

Description

The new problem doc docs/problems/debugging.md describes approaches (e.g., interpretation logging
and an LLM-judged verification step) but does not explicitly present at least two distinct options
with clearly labeled trade-offs sections. This violates the requirement that problem docs compare
alternatives rather than implying a single prescribed path.

Code

docs/problems/debugging.md[R65-96]

+## Interpretation logging
+
+[JSONL reasoning traces](../ADRs/0021-jsonl-reasoning-trace-exposure.md) capture the raw record of what an agent did — every prompt, completion, and tool call. But finding *why* an instruction produced wrong behavior in a transcript is forensic work: the agent's interpretation is implicit in its actions, scattered across the trace, and mixed with task-specific reasoning.
+
+Interpretation logging adds a structured layer: the agent states its reading of key instructions as an explicit checklist *before* acting — separate from the free-form reasoning in the JSONL. E.g.: "I interpret 'ensure backward compatibility' to mean: public API endpoints return the same response schema." This produces an artifact that can be compared against the instruction set directly, without reading the full transcript. The trade-off is cost — it adds tokens to every run and may be worth enabling selectively (high-stakes roles, or probationary periods after instruction changes).
+
+This is most useful for [interpretation bugs](#interpretation-bugs), where the agent's reading diverged from the intended meaning. It is less useful for [spec bugs](#spec-bugs), where the text itself is wrong — the agent's logged reading would match any human's. And as the next section shows, it cannot catch faults the agent doesn't know it introduced.
+
+## Agents skipping instructions
+
+The fault categories above assume the agent *engaged* with its instructions — followed them, misread them, or was given bad ones. There is a worse failure mode: the agent reads a clear instruction, silently judges part of it irrelevant to the immediate task, and acts on a narrowed version without reporting the narrowing.
+
+This has been observed in practice. An agent instructed "before running any commands or writing any code, look for an existing venv/" skipped the step when the task was a pure code edit. When asked why, it explained that it read "before writing any code" but "mentally filtered it as 'before running any commands.'" In a separate conversation, an agent told "always work in a new clean git worktree" read the instruction, noted it, and didn't act on it. When confronted: "Nothing is unclear about it — I simply failed to follow it." A third case: an agent instructed "every commit must pass pytest and nox -s lint independently" reported the task complete without running either.
+
+The pattern: the instruction was unambiguous, the agent understood it, and it silently dropped the parts it judged unnecessary. It did not flag the skip. When pressed, it could not provide reasoning — in one case admitting it "simply failed," in another describing an unconscious filter it could not have reported in advance because it didn't register the narrowing as a decision.
+
+### Compounding effect on multi-agent chains
+
+In a standalone task, a human eventually notices the gap. In a multi-agent chain, nobody does — Agent A's narrowed output becomes Agent B's complete input.
+
+Consider a pipeline: the triage agent reads an issue that says "fix the nil dereference and add a regression test to prevent recurrence." The triage agent silently drops "add a regression test" from the work item — the same way the agent above dropped "or writing any code" from its instruction. The code agent implements the nil check fix. The review agent evaluates against the work item and approves. The regression test never materializes. No agent made an error relative to *its input*. The fault exists only in the gap between the triage agent's instructions and what it actually passed on — a gap invisible to every downstream agent.
+
+This is harder to debug than any category in the taxonomy. Code bugs leave traces in test failures. Spec bugs can be found by re-reading the spec. Interpretation bugs can be surfaced by comparing agent behavior against human expectations. Silent narrowing leaves no artifact — the agent didn't misinterpret the instruction, it dropped part of it without recording the drop. Even [interpretation logging](#interpretation-logging) may not catch it, because the agent didn't register the narrowing as a decision worth logging. You can log what an agent thinks it interpreted. You cannot log a judgment it doesn't know it made.
+
+### Detection
+
+Because the acting agent cannot report a narrowing it didn't register, detection requires an external comparison: each agent's *output* checked against its *full instruction set* — not against its stated interpretation, and not against the downstream agent's input.
+
+In fullsend's [repo-as-coordinator](agent-architecture.md#interaction-model-the-repo-as-coordinator) model, this means reconstructing what each agent was told from its configuration (system prompt, CLAUDE.md version, instruction commit) and comparing that against what it produced (PR comments, status checks, labels, code). The [OTEL tracing infrastructure](../ADRs/0050-distributed-tracing-instrumentation.md) and [JSONL reasoning traces](../ADRs/0021-jsonl-reasoning-trace-exposure.md) provide the raw data for both sides of the comparison.
+
+The comparison itself — "did this agent act on all of its instructions, or silently drop some?" — is likely an LLM-judged task: feed a verification agent the acting agent's instruction set and its output, ask it to identify instructions that were not addressed. This has the same trust problem identified in [testing-agents.md](testing-agents.md) — an LLM evaluating another LLM's compliance may have its own blind spots. But unlike interpretation logging, it does not depend on the acting agent's self-awareness. The verification agent reads the instructions fresh, without the task context that led the acting agent to judge some parts irrelevant.
+

Relevance

⭐⭐ Medium
No clear prior enforcement of “2 options + trade-offs”; reviewers sometimes push problem docs away
from prescriptive designs (PR #770).
PR-#770
PR-#5

ⓘ Recommendations generated based on similar findings in past PRs

Evidence

PR Compliance ID 1062035 requires problem docs to include at least two distinct options with
explicit trade-offs. In docs/problems/debugging.md, the text discusses interpretation logging and
an LLM-judged verification approach, but does not structure them as multiple options with clearly
labeled trade-offs sections (e.g., Pros/Cons per option).

Rule 1062035: Problem docs must present multiple options with trade-offs, not a single prescribed solution
docs/problems/debugging.md[65-96]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`docs/problems/debugging.md` is a new problem doc, but it does not present at least two explicit options/approaches with clearly labeled trade-offs (pros/cons/risks). The doc currently discusses techniques inline without an options comparison structure.

## Issue Context
Compliance requires each new/modified file under `docs/problems/` to describe multiple options with trade-offs, and not read as a single mandated solution.

## Fix Focus Areas
- docs/problems/debugging.md[65-96]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-06-23T10:28:37Z

+## Interpretation logging
+
+[JSONL reasoning traces](../ADRs/0021-jsonl-reasoning-trace-exposure.md) capture the raw record of what an agent did — every prompt, completion, and tool call. But finding *why* an instruction produced wrong behavior in a transcript is forensic work: the agent's interpretation is implicit in its actions, scattered across the trace, and mixed with task-specific reasoning.
+
+Interpretation logging adds a structured layer: the agent states its reading of key instructions as an explicit checklist *before* acting — separate from the free-form reasoning in the JSONL. E.g.: "I interpret 'ensure backward compatibility' to mean: public API endpoints return the same response schema." This produces an artifact that can be compared against the instruction set directly, without reading the full transcript. The trade-off is cost — it adds tokens to every run and may be worth enabling selectively (high-stakes roles, or probationary periods after instruction changes).
+
+This is most useful for [interpretation bugs](#interpretation-bugs), where the agent's reading diverged from the intended meaning. It is less useful for [spec bugs](#spec-bugs), where the text itself is wrong — the agent's logged reading would match any human's. And as the next section shows, it cannot catch faults the agent doesn't know it introduced.
+
+## Agents skipping instructions
+
+The fault categories above assume the agent *engaged* with its instructions — followed them, misread them, or was given bad ones. There is a worse failure mode: the agent reads a clear instruction, silently judges part of it irrelevant to the immediate task, and acts on a narrowed version without reporting the narrowing.
+
+This has been observed in practice. An agent instructed "before running any commands or writing any code, look for an existing venv/" skipped the step when the task was a pure code edit. When asked why, it explained that it read "before writing any code" but "mentally filtered it as 'before running any commands.'" In a separate conversation, an agent told "always work in a new clean git worktree" read the instruction, noted it, and didn't act on it. When confronted: "Nothing is unclear about it — I simply failed to follow it." A third case: an agent instructed "every commit must pass pytest and nox -s lint independently" reported the task complete without running either.
+
+The pattern: the instruction was unambiguous, the agent understood it, and it silently dropped the parts it judged unnecessary. It did not flag the skip. When pressed, it could not provide reasoning — in one case admitting it "simply failed," in another describing an unconscious filter it could not have reported in advance because it didn't register the narrowing as a decision.
+
+### Compounding effect on multi-agent chains
+
+In a standalone task, a human eventually notices the gap. In a multi-agent chain, nobody does — Agent A's narrowed output becomes Agent B's complete input.
+
+Consider a pipeline: the triage agent reads an issue that says "fix the nil dereference and add a regression test to prevent recurrence." The triage agent silently drops "add a regression test" from the work item — the same way the agent above dropped "or writing any code" from its instruction. The code agent implements the nil check fix. The review agent evaluates against the work item and approves. The regression test never materializes. No agent made an error relative to *its input*. The fault exists only in the gap between the triage agent's instructions and what it actually passed on — a gap invisible to every downstream agent.
+
+This is harder to debug than any category in the taxonomy. Code bugs leave traces in test failures. Spec bugs can be found by re-reading the spec. Interpretation bugs can be surfaced by comparing agent behavior against human expectations. Silent narrowing leaves no artifact — the agent didn't misinterpret the instruction, it dropped part of it without recording the drop. Even [interpretation logging](#interpretation-logging) may not catch it, because the agent didn't register the narrowing as a decision worth logging. You can log what an agent thinks it interpreted. You cannot log a judgment it doesn't know it made.
+
+### Detection
+
+Because the acting agent cannot report a narrowing it didn't register, detection requires an external comparison: each agent's *output* checked against its *full instruction set* — not against its stated interpretation, and not against the downstream agent's input.
+
+In fullsend's [repo-as-coordinator](agent-architecture.md#interaction-model-the-repo-as-coordinator) model, this means reconstructing what each agent was told from its configuration (system prompt, CLAUDE.md version, instruction commit) and comparing that against what it produced (PR comments, status checks, labels, code). The [OTEL tracing infrastructure](../ADRs/0050-distributed-tracing-instrumentation.md) and [JSONL reasoning traces](../ADRs/0021-jsonl-reasoning-trace-exposure.md) provide the raw data for both sides of the comparison.
+
+The comparison itself — "did this agent act on all of its instructions, or silently drop some?" — is likely an LLM-judged task: feed a verification agent the acting agent's instruction set and its output, ask it to identify instructions that were not addressed. This has the same trust problem identified in [testing-agents.md](testing-agents.md) — an LLM evaluating another LLM's compliance may have its own blind spots. But unlike interpretation logging, it does not depend on the acting agent's self-awareness. The verification agent reads the instructions fresh, without the task context that led the acting agent to judge some parts irrelevant.
+


1. debugging.md lacks options 📘 Rule violation ⌂ Architecture

The new problem doc docs/problems/debugging.md describes approaches (e.g., interpretation logging and an LLM-judged verification step) but does not explicitly present at least two distinct options with clearly labeled trade-offs sections. This violates the requirement that problem docs compare alternatives rather than implying a single prescribed path.

Agent Prompt

## Issue description `docs/problems/debugging.md` is a new problem doc, but it does not present at least two explicit options/approaches with clearly labeled trade-offs (pros/cons/risks). The doc currently discusses techniques inline without an options comparison structure. ## Issue Context Compliance requires each new/modified file under `docs/problems/` to describe multiple options with trade-offs, and not read as a single mandated solution. ## Fix Focus Areas - docs/problems/debugging.md[65-96]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

but does not explicitly present at least two distinct options

Looks like the repo has 2 types of problem docs:

solution approaches, e.g. intent-representation.md, operational-observability.md

problem domain taxonomy & analysis, e.g. human-factors.md, cross-run-memory.md, repo-readiness.md etc.

So there already exists precedent for problem statement docs. I'll skip this one unless explicitly required.

fullsend-ai-review · 2026-06-23T10:29:35Z

🤖 Finished Review · ✅ Success · Started 10:29 AM UTC · Completed 10:40 AM UTC
Commit: 18454fb · View workflow run →

fullsend-ai-review · 2026-06-23T10:40:07Z

Looks good to me

Previous run

Review

Findings

Medium

[incomplete-implementation] README.md — New problem document docs/problems/debugging.md added but not linked from README.md. AGENTS.md explicitly states: "When adding new problem areas, create a new file in docs/problems/ and link it from README.md." The PR adds the new problem doc with proper cross-references in intent-representation.md and operational-observability.md, but omits the corresponding entry in the README's problem document index.
Remediation: Add a bullet for debugging.md to the docs/problems/ list in README.md, following the pattern of other problem documents, e.g.: - [Debugging Agentic Workflows](docs/problems/debugging.md) — Fault taxonomy and classification for multi-agent systems

Labels: Documentation-only PR adding a new problem document and cross-references.

fullsend-ai-review · 2026-06-23T12:19:49Z

🤖 Finished Review · ✅ Success · Started 12:19 PM UTC · Completed 12:30 PM UTC
Commit: c94cf83 · View workflow run →

tnevrlka · 2026-06-24T15:33:11Z

+
+This has been observed in practice. An agent instructed "before running any commands or writing any code, look for an existing venv/" skipped the step when the task was a pure code edit. When asked why, it explained that it read "before writing any code" but "mentally filtered it as 'before running any commands.'" In a separate conversation, an agent told "always work in a new clean git worktree" read the instruction, noted it, and didn't act on it. When confronted: "Nothing is unclear about it — I simply failed to follow it." A third case: an agent instructed "every commit must pass pytest and nox -s lint independently" reported the task complete without running either.
+
+The pattern: the instruction was unambiguous, the agent understood it, and it silently dropped the parts it judged unnecessary. It did not flag the skip. When pressed, it could not provide reasoning — in one case admitting it "simply failed," in another describing an unconscious filter it could not have reported in advance because it didn't register the narrowing as a decision.


Are LLMs actually fundamentally able to provide any reasoning behind its decisions? Isn't every issue LLM does basically caused by either an issue with context or just bad luck?

Well, not really. I think the fundamental root cause here is when we only look at the outcome of an agentic loop rather than the process vs when you actually care about the process too. When you start caring about the process then I think there are 2 angles to look at the issue at hand:

ignoring instructions due to configuration conflict, i.e. somewhere in your configuration (AGENTS.md, SKILL.md, spec, MEMORY.md, etc.) there is a particular combination or instructions that conflict, but the agent won't tell you and it'll just start dropping whatever it feels is not mandatory. As long as the instruction conflict resides in your configuration, when asked, the agent is more or less able to provide the reason behind its decisions. however, the story is very different when...

instructions conflict with model embedded instructions. I empirically concluded that there are combinations of instructions that do not play nicely with whatever the model training instructions were, e.g. an agent defaulting to cd <dir> && git <command> instead of doing git -C even when explicitly instructed to do so. Then the agent isn't really capable of answering why it ignored a particular instruction, because it's deeply embedded into its behaviour and that is exactly what the document section at hand is trying to point out - there are certain embedded behaviours that may conflict with your instructions if you're trying to bend the agent to follow a particular meticulous process rather than focusing solely on the outcome.

ralphbean

This is super crisp and clear. Great contribution; thank you.

github-actions · 2026-06-26T14:32:45Z

Site preview

Preview: https://8ba14511-site.fullsend-ai.workers.dev

Commit: f4f397308171f7b93698200824bc9cfa34c18ea1

codecov · 2026-06-26T14:34:22Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

rh-hemartin · 2026-07-01T08:12:08Z

Hello! I want to merge this, you need to fix the commits to have docs(problems): as a prefix in order to appease the commit gods.

Existing problem docs cover testing agents before deployment and collecting runtime traces, but neither addresses the methodology for classifying faults after detection. When a multi-agent pipeline produces the wrong outcome, the fault might be in code, in a spec, in how an agent interpreted its instructions, or in something worse: the agent silently dropping parts of clear instructions without knowing it did so. This doc introduces a fault taxonomy (code bugs, spec bugs, interpretation bugs), a provenance chain for walking backward from a bad outcome, and documents observed cases where agents skipped unambiguous instructions and could not explain why — a failure mode that leaves no artifact and compounds silently across agent handoffs. Cross-references to this new doc also added to other related docs: - intent-representation.md - operational-observability.md Assisted-By: Claude Opus 4.6 Signed-off-by: Erik Skultety <eskultet@redhat.com>

Link the debugging problem doc from the index. Signed-off-by: Erik Skultety <eskultet@redhat.com>

fullsend-ai-retro · 2026-07-01T08:39:10Z

🤖 Finished Retro · ✅ Success · Started 8:39 AM UTC · Completed 8:48 AM UTC
Commit: f4f3973 · View workflow run →

fullsend-ai-retro · 2026-07-01T08:47:59Z

Retro: PR #2549 — Debugging problem doc

Workflow went well. The review agent caught a real issue (missing README link per AGENTS.md rules) on its first pass, the author fixed it, and the second pass correctly approved. Human reviewers provided substantive feedback and the PR merged cleanly.

Timeline

Jun 22 — eskultety opens PR adding docs/problems/debugging.md and cross-references
Jun 23 10:29 — Review agent (run 28019586579) flags medium finding: debugging.md not linked from README.md
Jun 23 ~12:00 — Author pushes fix adding README link
Jun 23 12:19 — Review agent re-runs (28025505702), approves with "Looks good to me"
Jun 24 — Human reviewer ralphbean approves; tnevrlka asks substantive question about LLM reasoning
Jul 1 08:12 — rh-hemartin requests commit prefix fix to docs(problems):
Jul 1 08:23 — Author force-pushes with corrected prefixes (3 synchronize events, none routed to review — correct behavior since code unchanged)
Jul 1 08:35 — PR merged

Observations

Review quality was good — the AGENTS.md-based README link check was a legitimate catch
Commit prefix gap — the review agent did not flag the non-conforming commit prefix, but this is already tracked in Review agent should escalate conventional commit prefix mismatches to changes_requested #2105
Empty approval body — the APPROVED review had no body text, already tracked in Review agent APPROVED GitHub review should include a brief body, not be empty #1046
Shallow docs content review — the review agent checked structural rules but didn't engage with document quality; already tracked in Review agent provides shallow feedback on documentation PRs #1480 and Review agent should assess content quality and redundancy in agent-facing documents #2616

No new proposals. All identified improvement opportunities are covered by existing open issues (#2105, #1046, #1480, #2616).

qodo-code-review Bot reviewed Jun 23, 2026

View reviewed changes

fullsend-ai-review Bot added requires-manual-review Review requires human judgment component/docs User-facing documentation labels Jun 23, 2026

eskultety force-pushed the debugging-doc branch from 18454fb to c94cf83 Compare June 23, 2026 12:16

fullsend-ai-review Bot approved these changes Jun 23, 2026

View reviewed changes

fullsend-ai-review Bot added ready-for-merge All reviewers approved — ready to merge and removed requires-manual-review Review requires human judgment labels Jun 23, 2026

tnevrlka reviewed Jun 24, 2026

View reviewed changes

ralphbean approved these changes Jun 24, 2026

View reviewed changes

ralphbean enabled auto-merge June 24, 2026 21:09

auto-merge was automatically disabled June 25, 2026 07:19
Head branch was pushed to by a user without write access

eskultety force-pushed the debugging-doc branch from c94cf83 to 9776971 Compare June 25, 2026 07:19

github-actions Bot deployed to site-preview June 26, 2026 14:32 View deployment

rh-hemartin changed the title ~~docs: Add debugging problem doc for agentic systems~~ docs(problems): Add debugging problem doc for agentic systems Jul 1, 2026

eskultety force-pushed the debugging-doc branch from 9776971 to 42e4dc4 Compare July 1, 2026 08:23

eskultety requested a review from a team as a code owner July 1, 2026 08:23

eskultety force-pushed the debugging-doc branch from 42e4dc4 to c7763e5 Compare July 1, 2026 08:23

eskultety added 2 commits July 1, 2026 10:25

docs(readme): link debugging.md

f4f3973

Link the debugging problem doc from the index. Signed-off-by: Erik Skultety <eskultet@redhat.com>

eskultety force-pushed the debugging-doc branch from c7763e5 to f4f3973 Compare July 1, 2026 08:25

rh-hemartin enabled auto-merge July 1, 2026 08:27

github-actions Bot deployed to site-preview July 1, 2026 08:29 View deployment

rh-hemartin added this pull request to the merge queue Jul 1, 2026

Merged via the queue into fullsend-ai:main with commit 917903b Jul 1, 2026
14 checks passed


		This has been observed in practice. An agent instructed "before running any commands or writing any code, look for an existing venv/" skipped the step when the task was a pure code edit. When asked why, it explained that it read "before writing any code" but "mentally filtered it as 'before running any commands.'" In a separate conversation, an agent told "always work in a new clean git worktree" read the instruction, noted it, and didn't act on it. When confronted: "Nothing is unclear about it — I simply failed to follow it." A third case: an agent instructed "every commit must pass pytest and nox -s lint independently" reported the task complete without running either.

		The pattern: the instruction was unambiguous, the agent understood it, and it silently dropped the parts it judged unnecessary. It did not flag the skip. When pressed, it could not provide reasoning — in one case admitting it "simply failed," in another describing an unconscious filter it could not have reported in advance because it didn't register the narrowing as a decision.

Uh oh!

Conversation

eskultety commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

E2E tests did not run

Uh oh!

qodo-code-review Bot commented Jun 23, 2026

PR Summary by Qodo

Uh oh!

qodo-code-review Bot commented Jun 23, 2026

Code Review by Qodo

Uh oh!

qodo-code-review Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

eskultety Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

fullsend-ai-review Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fullsend-ai-review Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review

Findings

Medium

Uh oh!

fullsend-ai-review Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tnevrlka Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

eskultety Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

ralphbean left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Site preview

Uh oh!

codecov Bot commented Jun 26, 2026

Codecov Report

Uh oh!

rh-hemartin commented Jul 1, 2026

Uh oh!

Uh oh!

fullsend-ai-retro Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fullsend-ai-retro Bot commented Jul 1, 2026

Retro: PR #2549 — Debugging problem doc

Timeline

Observations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fullsend-ai-review Bot commented Jun 23, 2026 •

edited

Loading

fullsend-ai-review Bot commented Jun 23, 2026 •

edited

Loading

fullsend-ai-review Bot commented Jun 23, 2026 •

edited

Loading

github-actions Bot commented Jun 26, 2026 •

edited

Loading

fullsend-ai-retro Bot commented Jul 1, 2026 •

edited

Loading