Skip to content

feat!: make queue and reconciliation leader-loss-aware#1130

Draft
kimpenhaus wants to merge 32 commits into
mainfrom
bugfix/clear-queue-when-no-leadership
Draft

feat!: make queue and reconciliation leader-loss-aware#1130
kimpenhaus wants to merge 32 commits into
mainfrom
bugfix/clear-queue-when-no-leadership

Conversation

@kimpenhaus

Copy link
Copy Markdown
Collaborator
  • stops reconciliation on leadership loss

closes #784

- stops reconciliation and leadership loss

closes #784
@kimpenhaus kimpenhaus marked this pull request as ready for review May 30, 2026 11:15
@kimpenhaus kimpenhaus marked this pull request as draft June 2, 2026 13:33
@kimpenhaus kimpenhaus marked this pull request as ready for review June 3, 2026 07:22
kimpenhaus added 13 commits June 3, 2026 09:22
- Introduced `ValidateRegistrations` setting in `OperatorSettings` to enable validation of DI registrations on host startup.
- Added `OperatorRegistrationValidator` to ensure required components are registered for each managed entity, preventing silent misconfigurations.
- Implemented `OperatorRegistrationRegistry` to track managed entities and their associated services.
- Updated documentation with usage details and examples for registration validation.
- Added comprehensive unit tests to cover all validation scenarios.
…plication cache consistency

- Added checks to preserve deduplication cache state if enqueue fails due to leadership loss.
- Introduced tests for drop scenarios: updates, deletions, and retry behavior.
- Updated `EntityQueueBackgroundService` to use correct cancellation token for error retries.
- Improved logging to trace dropped enqueues.
…isposal

- Made `StartAsync` idempotent to avoid duplicate processing loops under concurrent leadership signals.
- Added lifecycle lock to synchronize start/stop state transitions.
- Fixed `Dispose` and `DisposeAsync` to unsubscribe from leadership elector callbacks.
- Updated `DisposeAsync` to follow the asynchronous disposal pattern and release shared resources.
- Introduced additional tests to validate idempotency and proper disposal behavior.
…ership flaps

- Updated `EntityQueueBackgroundService` to assign a fresh `CancellationTokenSource` for each processing loop, ensuring proper disposal only after the loop ends.
- Refactored `_cts` handling to avoid disposing a token source still observed by a previously running loop.
- Enhanced DI validation to correctly handle open-generic service registrations with generic constraints.
- Added unit tests for leadership flap scenarios and DI validation improvements.
@kimpenhaus kimpenhaus marked this pull request as draft June 21, 2026 19:18
… prevent token disposal during in-flight reconciliations

- Made `ReconcileAsync` fully asynchronous in multiple integration tests to align with updated queue behavior.
- Refactored `EntityQueueBackgroundService` to manage multiple active processing loops, ensuring proper disposal and cancellation.
- Introduced safeguards against `ObjectDisposedException` when a token is accessed during in-flight reconciliations.
- Added timeout to drain in-flight reconciliations during disposal to prevent indefinite blocking.
@kimpenhaus kimpenhaus changed the title feat: make queue and reconciliation leader-aware feat!: make queue and reconciliation leader-loss-aware Jun 21, 2026
… lifecycle management

- Replaced duplicated lifecycle handling logic in `EntityQueueBackgroundService` and `ResourceWatcher` with the new `RestartableHostedService` base class.
- Simplified start/stop mechanics by centralizing idempotent loop execution and cancellation handling in `RestartableHostedService`.
- Updated disposal methods to align with the asynchronous disposal pattern, ensuring proper resource cleanup.
- Adjusted integration tests to accommodate changes in background service behavior.
… `LeaderElectionType`

- Updated documentation to explain how `LeaderElectionType` affects the queue-consumer service configuration, including behavior for `None`, `Single`, and `Custom` types.
- Clarified scheduling state management and leadership-loss protection mechanisms.
…ElectionSubscription`

- Introduced `LeaderElectionSubscription` to manage leadership callbacks consistently across services.
- Simplified elector subscription/unsubscription logic in `LeaderAwareResourceWatcher` and `EntityQueueBackgroundService`.
- Updated `RestartableHostedService` to support non-blocking stop behavior (`RequestStopAsync`).
- Enhanced async disposal flow to ensure handlers are unsubscribed, preventing lingering references.
- Added tests for leadership transitions, idle draining, and disposal correctness.
…handling

- Fixed `EntityQueueBackgroundService` to reset `_running` state for proper restart after unexpected loop exits.
- Added handling for unexpected loop faults with explicit logging via `OnLoopFaulted`.
- Enhanced `LeaderAwareEntityQueueBackgroundService` to record reconciliation metrics for leader-elected consumers using `OperatorMetrics`.
- Introduced comprehensive tests for loop restarting and metrics recording.
… handling

- Updated `EntityCache` logic to scope removal by entity type, preserving unrelated entries during leadership loss.
- Improved `RestartableHostedService` to restart loops with exponential backoff on faults, preventing silent service failures.
- Refactored `LeaderElectionBackgroundService` to cancel backoff promptly on shutdown, ensuring graceful disposal.
- Added new tests for loop restart behavior, entity-specific cache clearing, and backoff timing during shutdown.
…lt tolerance

- Introduced startup validation to ensure FusionCache tagging remains enabled for resource watcher caches, preventing runtime failures due to misconfiguration.
- Enhanced cache cleanup logic during leadership loss to handle exceptions gracefully, ensuring safety-critical stops proceed unaffected.
- Improved error handling in leader-sensitive services to prevent propagation of faults into elector callbacks.
- Added comprehensive unit tests for tagging validation, cache cleanup behavior, and leadership fault tolerance.
…adership flap handling

- Added unit tests for `RestartableHostedService` to verify backoff escalation in crash loops and reset behavior after healthy runs.
- Introduced a test for leadership flap handling, ensuring concurrent loops drain gracefully without faults.
- Applied `[Trait("Area", "LeaderLoss")]` to categorize relevant tests.
…or` and fix warning format in docs

- Cleaned up comments in `OperatorRegistrationValidator`, removing redundant explanations and correcting phrasing.
- Fixed markdown formatting for warnings in caching documentation to ensure proper rendering.
…ant comments in `RestartableHostedService` and `ResourceWatcher`
…OrUpdate` method

- Replaced positional arguments with named arguments to improve readability and maintainability.
- Adjusted log messages for consistency and conciseness.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug]: Leadership election faulty when network timeout issues present

1 participant