Spec 0034: failed-delivery observability for confirmation-based producers (#4179)#4180
Conversation
Add Spec 0034 capturing requirements for surfacing failed message delivery from confirmation-based producers (Kafka/RMQ): Warning log, standalone OTel span linked to the original publish span, and circuit breaker trip on confirmation failure — addressing issue #4179. Requirements approved after three adversarial review rounds (7 -> 4 -> 2 findings). Breaker/log topic uses the per-message wire topic (message.Header.Topic) for exact parity with the non-confirmation path; the trace-context/wire-topic plumbing mechanism (extend OnMessagePublished contract vs correlation map vs header propagation) is deferred to ADR 0063. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review — PR #4180 (Spec 0034: failed-delivery observability)Scope check: This PR is specification artifacts only — no production code ( ✅ Strengths
|
| Finding | Status in requirements.md |
|---|---|
| 1. C-6 contradicts C-8 wire-topic decision (82) | ✅ C-6 rewritten ("captured producer yields only the static Publication.Topic…") |
| 2. NFR-4 orphan (no AC) (66) | ✅ AC-15 added |
3. C-8 mis-attributes RoutingKey? to :44 (55) |
✅ reworded to "interface overload takes RoutingKey" |
4. FR-10 overstates what Send param delivers (50) |
✅ note added (param only makes it available inside the producer) |
| 5. AC-10 "background threads" (30) | ✅ reworded to "gateway-dependent threads" |
So the spec is actually in better shape than the review doc it ships with claims. Anyone reading review-requirements.md in isolation sees "NEEDS WORK / 2 unresolved findings" against an approved spec. Recommendation: either commit the (presumably clean) round-4 review that justified approval, or add a one-line header to review-requirements.md noting "all findings below resolved in requirements.md; superseded — see .requirements-approved." Right now the approval evidence and the newest committed review disagree.
🔸 Minor
- Absolute line numbers will drift. The doc embeds many
file.cs:NNNreferences (OutboxProducerMediator.cs:998/:1168/:741,KafkaMessagePublisher.cs:52-62, etc.). They're correct today, but the moment implementation lands they'll be off. Consider anchoring to method names (which the doc mostly already does) and treating line numbers as illustrative. - For the ADR phase (not a defect here): the async callback is registered as
OnMessagePublished += async delegate(...)against anAction<bool,string>— i.e. async void. That makes NFR-4 / AC-15 (error isolation) genuinely load-bearing: an unobserved exception on that path can tear down the thread-pool thread (Kafka invokes the callback viaTask.Run). Worth calling out explicitly in the design so the try/catch boundary isn't treated as optional.
Verdict
As a requirements spec this is thorough, accurate, and well-grounded in the actual code — clearly the product of the multi-round adversarial process the repo mandates. The only thing I'd fix before merge is the stale review-requirements.md vs. approved-state mismatch so the artifact set is self-consistent. Everything substantive is ready to carry into ADR 0063, where the C-7 mechanism (a/b/c) is the real open decision.
🤖 Generated with Claude Code
ADR 0063 designs observability + circuit-breaker behaviour for failed publishes reported asynchronously via OnMessagePublished(success:false). Decision (mechanism (a), extend the contract — single ADR): - Enrich the ISupportPublishConfirmation.OnMessagePublished contract to Action<PublishConfirmationResult> carrying success, message id, wire topic (message.Header.Topic) and the original publish ActivityContext. - Producer self-sources the publish span from Activity.Current (set to S1 by CreateProducerSpan, AsyncLocal so race-free), captured synchronously inside Send/SendAsync — NO Send-signature change, so the ~10 non-confirmation gateways are untouched (meets FR-10's intent with a strictly smaller blast radius). - Mediator false-branch: Warning log (FR-1), standalone OTel span linked to S1 (FR-2), TripTopic(wire topic) (FR-3), no mark-dispatched / no bubble (FR-4), error-isolated (NFR-4). Kafka FR-6/FR-8; RMQ verify+enrich. Reviewed adversarially over three rounds (Critical -> High -> PASS); findings in specs/0034-failed-delivery-context/review-design.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Review — Spec 0034: failed-delivery observability (PR 4180) This is a spec/documentation-only PR (no production code), so I reviewed it for internal consistency, requirements/design alignment, and the soundness of the design's load-bearing assumptions. Overall a strong, careful spec — the ADR in particular is excellent: it uses an RDD information-holder (
A few things to reconcile before moving to 1. PR description undersells the actual scope (please update) The body says Specification artifacts only and the What this PR contains list stops at 2. Requirements / design drift: FR-10 / AC-14 / AC-2 (resolve before tasks) The ADR's Decision deliberately abandons FR-10's
When 3. Approval marker vs the committed review file disagree
4. Design risk for the implementation phase (not a spec blocker) The mechanism's correctness rests on a narrow, implicit, ordering- and config-dependent invariant:
Minor
None of the above blocks the spec direction — items 2 and 3 are the ones I would reconcile before Generated with Claude Code |
Per review feedback, store a named internal PendingConfirmation record (MessageId, Topic, Context) keyed by delivery tag rather than widening Dictionary<ulong, string> into a loose tuple. Keeps the delivery-tag map focused — it is the only bridge between RabbitMQ's delivery-tag-based, async, batched-confirm protocol and Brighter's message identity + telemetry context. The record is internal (no public-API obligation). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review — Spec 0034 / ADR 0063 (failed-delivery observability)This PR is specification + ADR only (no production code): a new requirements doc, ADR 0063, two adversarial-review records, and tracking dotfiles. So I reviewed it as a design artifact — primarily by verifying the ADR's codebase claims against the live source, since the whole design rests on a handful of them. ✅ Codebase claims verified accurateI spot-checked ~12 of the specific line/behaviour references the ADR depends on; every one matches
The accuracy and FR→AC→C traceability here are excellent — this is a high-quality spec. Design soundnessThe central decision — source the publish span from Issues / suggestions (for the design→tasks transition)
None of these block the design phase; (1) is the only one I'd fix before calling the artifacts final. Nice work overall. 🤖 Generated with Claude Code |
ADR 0063 passes a second clean adversarial round after the PendingConfirmation record refinement. Folded the two cosmetic nits: RMQ removal touch-point now cites RemovePendingConfirmations/RemoveConfirmationsLocked (:419/:447), and the resilience-extension path includes the Extensions/ segment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…req review Per PR #4180 wider-review feedback: - FR-10/AC-14 superseded: ADR 0063 adopts C-7 mechanism (a) refined to source the publish span from Activity.Current (no Send-signature parameter). Banners added so /spec:tasks does not derive a RequestContext?-parameter test from a contract that was deliberately not built. AC-2/AC-2b/FR-2 example reworded to the Activity.Current capture (observable assertions unchanged). - review-requirements.md: added a RESOLVED/SUPERSEDED banner — the committed file was the round-3 NEEDS WORK record, contradicting .requirements-approved; all its findings were folded into requirements.md before approval. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review — Spec 0034 / ADR 0063 (design-phase, docs-only)Reviewed the full diff and independently verified the load-bearing codebase facts the design rests on. This is a specification + ADR PR with no production code, which is the correct posture for this point in the ✅ Facts I re-verified (all accurate)
|
…ducer PO-authorised scope expansion: make InMemoryMessageProducer a third ISupportPublishConfirmation implementer so the full failed-delivery flow is testable in-process with a production provider rather than a test-only fake (consistent with Brighter treating in-memory providers as real providers). - Opt-in async-confirm switch (default off = today's synchronous (true,id)); on = Channel + worker pump raising the enriched callback off-thread (exercises the cross-thread NFR-3 path like Kafka's Task.Run). - Failure-injection hook drives the false path end-to-end in-process (FR-1/2/3/5, AC-5/10/15). - Same Activity.Current capture-before-enqueue invariant; worker lifecycle drained on dispose; NFR-3/NFR-4 apply to the pump. - Orphaned Action<bool,Id> event (no external subscribers) becomes the enriched Action<PublishConfirmationResult>. - PublishConfirmationResult.MessageId retyped string -> Id (Id.Empty for FR-5; implicit string<->Id keeps Kafka/RMQ passing strings). - OOS-3/NFR-5 reconciled: the InMemory switch toggles provider confirm timing/outcome for dev+tests, not the always-on observability behaviour. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review — Spec 0034 / ADR 0063 (failed-delivery observability)🤖 Automated review by Claude. This is a spec + design-only PR (requirements + ADR 0063, no production code), so I focused on (a) whether the ADR's load-bearing code claims are actually true, and (b) the soundness of the design and its fit with the project's guardrails. Verdict: strong, unusually well-grounded design. A few process/scope items worth resolving before
|
Tidy requirements & ADR 0063 for a fresh reader, and adopt symmetric success/failure confirmation-span tracing (PO decision 2026-06-11). Clarity pass: - Remove FR-10 (the optional RequestContext?-on-Send mechanism) and all mechanism prose from C-7/C-8/ACs in requirements; the spec now states the need (publish context reachable at callback), not the how. - ADR: delete the "Relationship to FR-10" self-debate; move the rejected Send-parameter approach into Alternatives Considered alongside the correlation-map and header-propagation options. Symmetric decision: - FR-2 now emits the standalone linked confirmation span on BOTH success and failure branches. On success it parents the MarkDispatched DB span, fixing today's orphaned/null-parent Outbox span (C-6). Warning log and breaker trip stay failure-only; success delivery semantics frozen. - OOS-4 narrowed to delivery semantics; AC-13 rewritten; NFR-2/AC-12 generalized; ground-truth item documents the orphan defect. - ADR Decision, architecture diagram, pseudocode, components, consequences, and critical files updated to match. Refresh PROMPT.md resume state (resolved decisions, R5 still owed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
InMemoryMessageProducer's channel-based drain does not compile on netstandard2.0, which lacks System.Threading.Channels in-box. Add the backfill package reference (already in central package management) scoped to the netstandard2.0 target. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Code Review — PR #4180: Failed-delivery observability for confirmation-based producersReviewed the production changes (~511 LOC across 11 Below are observations — mostly minor — that may be worth a look before merge. Strengths
Observations / possible issues
Test coverageCoverage is thorough and the test names follow the Nice work overall; the items above are refinements rather than blockers. 🤖 Generated with Claude Code |
Code Review — Spec 0034 (failed-delivery observability)Reviewed the production changes across the mediator, tracer, Kafka/RMQ/InMemory producers, and the circuit breaker, plus the test suite. Overall this is high-quality, carefully-scoped work: the blast radius is genuinely small (only the two A few observations, mostly minor: 1. Confirmation span is
|
…Ok status Both confirmation callbacks closed the span with raw Activity.Dispose(), which stamps the end from the wall clock (not the tracer's TimeProvider) and never sets Ok on success spans. Use _tracer.EndSpan() like every other span in the mediator. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er concurrency CoolDown did a non-atomic read-modify-write (_trippedTopics[key] -= 1) plus a second indexer read for the eviction check. A concurrent TripTopic could be clobbered by the stale write, and a concurrent CoolDown removing the key first made the second read throw KeyNotFoundException. Use AddOrUpdate for an atomic decrement and a value-conditional remove keyed on the computed value, so a fresh trip is never lost and the read-after- remove fault is gone. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The two PublishResults raises used a fire-and-forget Task.Run whose task was discarded, so a faulting subscriber became an unobserved Task exception. Route both through a RaisePublishConfirmation helper that wraps the invoke in try/catch and logs at Warning, keeping the fault contained on the broker thread. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Code Review — Spec 0034: failed-delivery observability for confirmation-based producersReviewed the production changes across the mediator, tracer, circuit breaker, and the Kafka / RMQ / InMemory producers, plus the supporting tests. This is a careful, well-documented PR — the inline comments explaining why (threading constraints, NFR references, parity with the non-confirmation path) are genuinely excellent and made the review tractable. Overall it looks solid; none of the points below are blocking. Strengths
Points worth a look (minor / non-blocking)
Verification noteI reviewed statically and reasoned through the threading/concurrency paths; I did not run the full build or transport test suites here (Kafka/RMQ integration tests need broker containers). Recommend confirming the Core + InMemory unit suites and the new TestCorrelator-based Kafka log assertions are green in CI. Nice work on a tricky, observability-sensitive change. 🤖 Generated with Claude Code |
…ename async cast Document in CreateConfirmationSpan that the forced-root span self-clears Activity.Current to its null parent on EndSpan (so it cannot dangle at the disposed span) and that the threading model prevents any leak into reused-thread work. Rename the async callback's ISupportPublishConfirmation cast from producerSync to confirmingProducer to match the async producer. No behavioural change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Code Review — Spec 0034: failed-delivery observability (#4180) Reviewed the production changes across the core library, Kafka, RMQ (sync + async), and the in-memory producer, plus a representative sample of the test suite. This is a high-quality PR: the full requirements to ADR to adversarial-review to TDD trail is present, the concurrency/tracing invariants are documented inline at the point they matter, and test coverage is genuinely strong (distinct wire-vs-publication-topic assertions, broker-backed Kafka/RMQ tests, and concurrency races driven by A few observations, none of them blockers: 1. Kafka: persisted-with-missing-id now degrades to 2. 3. 4. Things verified correct:
Nice work — the design rationale embedded in the comments made this a pleasant review. The points above are refinements, not gates. 🤖 Generated with Claude Code |
…ed-without-id Per PR review: lower ConfirmationObservabilityError to Warning (nothing is lost — the message is safely in the Outbox for Sweeper retry) and rename it ConfirmationObservabilityFault so the name no longer implies the level; likewise rename Kafka's ErrorRaisingPublishConfirmation (already Warning) to PublishConfirmationRaiseFault. Add a Warning (PersistedReportMissingId) when a Kafka Persisted report carries no message id, so the degraded Id.Empty path — which leaves the message un-dispatched for the Sweeper rather than marked sent — is diagnosable rather than silent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cContext Sync Dispose blocked the async drain with GetAwaiter().GetResult(), which can deadlock when called from a thread carrying a single-threaded synchronization context while the async pump is enabled. Pump the drain on a dedicated single-threaded BrighterAsyncContext instead (the idiom already used by ChannelAsync.Dispose), and document preferring DisposeAsync. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Code Review — Spec 0034: failed-delivery observability for confirmation-based producersThorough, well-documented PR. The change is genuinely useful (failed confirmations were silently dropped) and the design stays faithful to the deliberate "don't bubble, let the Sweeper retry" model. The inline comments explaining the why (AsyncLocal capture invariant, async-void safety, Strengths
Observations / suggestions
Build / verificationReviewed statically (couldn't run Overall: clean, careful, well-tested work. Items 1–3 are the ones I'd ask the author to look at, and only #3 (a confirming test for the bulk-path change) feels worth gating on for safety. 🤖 Generated with Claude Code |
…firming test Responds to the code review on PR #4180: - Lock the BulkDispatchAsync guard change with a focused core test: a producer that is both bulk and confirming (InMemoryMessageProducer) now enters the bulk branch and defers MarkDispatched to its confirmation rather than throwing "No async bulk message producer defined.". Success fact proves dispatch via confirmation (pre-change it threw, leaving the batch outstanding); failure fact proves no inline MarkDispatched on a nacked confirmation and a wire-topic trip. - InMemoryMessageProducer success confirmations now carry message.Header.Topic on both the sync (Enqueue) and async (DrainAsync) paths, mirroring Kafka/RMQ which carry the wire topic on ack and nack. - InMemoryMessageProducer.Dispose short-circuits when the async pump never started, keeping the default sync-confirmation path a true no-op instead of spinning up a BrighterAsyncContext on every dispose. Full core suite green: 775 passed, 0 failed, 7 skipped (net9.0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the thorough review — addressed in 2233fa5. #3 — bulk-confirming guard (the gating item). Added
#1 — InMemory success path drops the topic. Fixed — success confirmations now carry #2 — #4 / #5 are notes only and already covered by the existing XML remark on Full core suite green: 775 passed, 0 failed, 7 skipped (net9.0). |
Code Review — Spec 0034: failed-delivery observability for confirmation-based producersI reviewed the production changes across the mediator, tracer, circuit breaker, and all four producer implementations (Kafka, RMQ sync/async, InMemory), plus the test suite. This is a high-quality PR: the design is sound, the concurrency reasoning is careful and well-documented, the change is appropriately scoped (the only contract change is Verified correct
Minor observations (non-blocking)
Process notePer CLAUDE.md, production code in a spec PR should arrive via the TDD/ Overall: well-engineered, thoroughly tested, and the self-conducted adversarial reviews clearly paid off. The observations above are polish, not blockers. 🤖 Generated with Claude Code |
|
Code Review — Spec 0034: failed-delivery observability for confirmation-based producers Reviewed the full diff (production + tests + spec docs). High-quality, carefully-reasoned PR. The blast radius is well-contained (only the two ISupportPublishConfirmation producers see a contract change), the concurrency reasoning is documented inline at the call sites, and test coverage is genuinely thorough — concurrency races, fault isolation, throwing test doubles, and empty-id degradation are all exercised. Feedback below, ordered by importance. Nothing is a blocker; most are questions/consistency notes. Strengths
Questions / consistency notes
Minor
Tests Coverage is strong and naming follows the repo convention (When_...should...). The throwing/gating doubles and the concurrent-same-topic / cooldown-atomicity tests are the right adversarial cases. When_observability_throws_should_isolate_and_still_trip pins the NFR-4 contract directly. Note: I could not run dotnet build / the suite in this review environment (sandbox restrictions), so the above is static review only. Please ensure CI is green, particularly the Kafka/RMQ integration tests that depend on live brokers. Overall: well-structured, well-documented, and faithful to the "do not bubble, let the sweeper retry" design. The items above are clarifications and consistency tweaks rather than defects. 🤖 Generated with Claude Code |
PublishFailurePredicate is init-only on InMemoryMessageProducer (deliberate, so it can't be flipped at runtime), which broke the inline assignment in the bulk-dispatch test. Move the confirmation-failure case into its own fixture whose constructor seeds the always-nack predicate via the object initializer, preserving the init-only contract. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Gates Failed
Prevent hotspot decline
(1 hotspot with Overall Code Complexity)
Enforce critical code health rules
(1 file with Bumpy Road Ahead)
Enforce advisory code health rules
(1 file with Overall Code Complexity)
Our agent can fix these. Install it.
Gates Passed
1 Quality Gates Passed
Reason for failure
| Prevent hotspot decline | Violations | Code Health Impact | |
|---|---|---|---|
| OutboxProducerMediator.cs | 1 rule in this hotspot | 6.97 → 6.56 | Suppress |
| Enforce critical code health rules | Violations | Code Health Impact | |
|---|---|---|---|
| KafkaMessageProducer.cs | 1 critical rule | 8.75 → 8.61 | Suppress |
| Enforce advisory code health rules | Violations | Code Health Impact | |
|---|---|---|---|
| OutboxProducerMediator.cs | 1 advisory rule | 6.97 → 6.56 | Suppress |
Quality Gate Profile: Clean Code Collective
Install CodeScene MCP: safeguard and uplift AI-generated code. Catch issues early with our IDE extension and CLI tool.
Code Review — PR #4180 (Spec 0034: failed-delivery observability)Reviewed the ~624 LOC of production changes across the 12 Strengths
Minor findings1. Asymmetric subscriber-throw guarding (RMQ vs Kafka). 2. Misleading success log on the degraded Kafka 3. 4. Two trip paths now coexist in bulk dispatch. Removing Questions / nits
Overall: clean, conservative, and faithful to the spec/ADR trajectory. The findings above are polish, not blockers. 🤖 Generated with Claude Code |
Summary
Implements Spec 0034 — making failed message delivery from confirmation-based producers (Kafka, RabbitMQ) observable, addressing issue #4179. This PR carries the full workflow: requirements → design (ADR 0063) → tasks → implementation (the production code is included; see "Production changes" below).
Today, when
OnMessagePublished(success: false, ...)fires, the failure is silently dropped: nothing logged, no telemetry, the circuit breaker is never tripped for confirmation-based producers, and Kafka discards the failed message id and theProduceExceptionreason/code. This PR closes those gaps without changing the deliberate "don't bubble, let the Sweeper retry" design.Production changes (~511 LOC across 11
src/files)ISupportPublishConfirmation/PublishConfirmationResult—OnMessagePublishedis enriched toAction<PublishConfirmationResult>(success, message id, wire topic, publishActivityContext). The only interface contract that changes, affecting just the twoISupportPublishConfirmationproducers.OutboxProducerMediator— the confirmation callback emits a standalone settle span first (linked to the publish span), logs a Warning on failure, and trips the circuit breaker on the per-message wire topic (parity with the non-confirmation!sentpath). Observability is isolated intry/catch(NFR-4); the span is disposed infinally(NFR-2).BrighterTracer.CreateConfirmationSpan(+IAmABrighterTracer) — creates the short-lived settle span carrying anActivityLinkto the original publish span; degrades to no-link when the context is absent.KafkaMessageProducer/KafkaMessagePublisher) — logs the previously-swallowedProduceException(reason + code) at Warning and propagates the failed message id throughPublishResults.RmqMessageProducerasync + sync) — confirmation carries the id, wire topic and publish link on ack/nack.InMemoryMessageProducer— opt-in async confirmation pump (used in tests and early/local development).InMemoryOutboxCircuitBreaker—ConcurrentDictionary+TryRemoveto address the documented concurrency race.Requirements highlights
Activitycarrying anActivityLinkto the original publish span (graceful degradation to no-Link when the context isn't reachable).message.Header.Topic) — exact parity with the non-confirmation!sentpath, including reply/rewritten topics.ProduceException(reason + code); propagate the failed message id throughPublishResults.RequestContext?on the producerSendmethods; ADR 0063 supersedes this — see below.ADR 0063 — the design decision
The open C-7/C-8 question (how the wire topic + original trace context reach the
OnMessagePublishedcallback) is resolved as mechanism (a): extend the contract, refined:OnMessagePublishedtoAction<PublishConfirmationResult>(success, message id, wire topic, publishActivityContext) — the only interface contract that changes, affecting just the twoISupportPublishConfirmationproducers (Kafka, RMQ).Send-signature change. The producer self-sources the publish span fromActivity.Current(set to S1 byCreateProducerSpan,AsyncLocalso race-free), captured synchronously insideSend/SendAsync. This meets FR-10's intent with a strictly smaller blast radius (the ~10 non-confirmation gateways are untouched) — so FR-10/AC-14 are marked superseded inrequirements.md.Review trajectory
producer.Span, fixed by sourcingActivity.Current) → PASS → PASS. Findings inreview-design.md.Closes #4179.
🤖 Generated with Claude Code