Skip to content

Fix Worker Transport Stability#18

Merged
brwyatt merged 6 commits intomainfrom
14-transport-disconnection
Mar 8, 2026
Merged

Fix Worker Transport Stability#18
brwyatt merged 6 commits intomainfrom
14-transport-disconnection

Conversation

@brwyatt
Copy link
Owner

@brwyatt brwyatt commented Mar 8, 2026

Description

Fix/improve the stability of Worker transport connection, especially on RabbitMQ. This introduces a bit of a refactor and resolves issues with connection handling and async generator issues.

Additionally lowers the default assignment timeout (to 10 Seconds) and "collapses" queued messages for the same job so we don't flap the state unnecessarily (and instead only treat the last message as the state we should action on).

Fixes #14

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Documentation update
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Breaking Changes

  • If this is a breaking change, please describe the migration path or dual-mode operation plan.

Non-breaking, unless someone wants the longer assignment timeouts.

Checklist:

Code Quality

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have checked that my changes do not introduce new linting errors

Testing

  • I have added tests for critical code paths and functionality
  • New and existing unit tests pass locally with my changes

Architecture & Philosophy

  • Async/Await: Verified all I/O is non-blocking.
  • Database: Verified proper handling of database types across engines (e.g., ULID storage as TEXT in SQLite).
  • Statelessness: Ensured changes align with stateless/path-blind philosophy.
  • Plugins: If adding a new plugin, confirmed it belongs in the core repo (vs external package).
  • Models: Used Pydantic V2 for models/config.

brwyatt added 6 commits March 5, 2026 20:13
... and shorten re-assignment interval.
Fixes some problems where connections wouldn't really reconnect or would
spin up duplicate connections while the original one reconnected,
causing "fun" conflicts. The solution here is "just let the underlying
library handle it", more or less, and removes a lot of our manual
handling attempts that were conflicting. It also consolidates a lot of
the underlying connection handling.
The idea here is we try and "condense" job messages so we aren't
flapping jobs. Like if we drop from the queue, a job is assigned, then
cancelled, then re-assigned to us, we don't start it, stop it (sending a
cancelation!), and then start the (now-canceled!) job, and instead pay
attention to the current state that the coordinator has sent us, and act
on that ONLY.
The problem here was timing out the async generator function which...
resulted in breaking things after the timeout, meaning we never got
further messages. Oops.
@brwyatt brwyatt added this to the 0.1.2 milestone Mar 8, 2026
@brwyatt brwyatt self-assigned this Mar 8, 2026
@brwyatt brwyatt linked an issue Mar 8, 2026 that may be closed by this pull request
5 tasks
@brwyatt
Copy link
Owner Author

brwyatt commented Mar 8, 2026

I've been testing and running this in my environment (and that's how I actually caught the problem with timing out the async generators), and it has been running stable.

@brwyatt brwyatt merged commit f145ed3 into main Mar 8, 2026
4 checks passed
@brwyatt brwyatt deleted the 14-transport-disconnection branch March 8, 2026 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Worker can silently disconnect from RabbitMQ and fail to reconnect in some cases

1 participant