Skip to content

Persist pending confirms in SQLite#11

Open
it-education-md wants to merge 1 commit intoentrius:testfrom
it-education-md:feat/persist-pending-confirms-sqlite
Open

Persist pending confirms in SQLite#11
it-education-md wants to merge 1 commit intoentrius:testfrom
it-education-md:feat/persist-pending-confirms-sqlite

Conversation

@it-education-md
Copy link
Copy Markdown

Summary

Persist PendingConfirmQueue to local SQLite so queued confirms survive validator restarts.

Problem

PendingConfirmQueue is currently in-memory only. That works while a validator process stays alive, but queued confirms are lost on restart.

This matters when a source transaction has been accepted but is still waiting for confirmations. If several validators restart during that window, each restarted validator drops its local pending confirms and stops contributing toward the later vote_initiate() step for those items.

The most realistic case is a coordinated restart, such as a deploy or auto-update affecting multiple validators at once. In that case, pending confirms can be wiped across enough validators that progress toward initiation stalls until the user retries or validators re-accept the transaction.

What Changed

  • kept existing PendingConfirmQueue() caller usage
  • swapped the queue backing store from in-memory state to local SQLite
  • initialize the store with a single CREATE TABLE IF NOT EXISTS
  • purge expired entries on read using the validator's current block
  • keep the existing one-entry-per-miner queue semantics

Tests

Added targeted coverage for:

  • persistence across queue instances
  • overwrite behavior for the same miner
  • enqueue/remove lifecycle
  • purge-on-read for expired entries
  • basic thread safety for enqueue/remove

Verification:

pytest -q tests/test_pending_confirm_queue.py

Copy link
Copy Markdown
Collaborator

@LandynDev LandynDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution — the PR description is clear and the intent is solid.
The only way to prevent the potential loss of data would require halting + update windows, so this is a good contribution.

A few things to address before this is merged:

  1. I don't think we need the Protocol abstraction. PendingConfirmStore as a separate Protocol with an injectable store parameter is more indirection than we need — there's only one implementation and we don't anticipate another. Fold the SQLite logic directly into PendingConfirmQueue, same as the original kept the dict logic inline. Tests can pass db_path to the queue constructor instead.

  2. Connection leaks can happen. _get_connection() creates a new sqlite3.connect() on every call. The with conn: context manager handles transaction commit/rollback but does not close the connection — so each upsert, remove, list_all etc. leaks an open file handle. Open a single connection in init, reuse it across calls (protected by the lock), and close it if/when the queue is torn down.

  3. I notice the Lock on some methods but not others. upsert(), remove(), and purge_expired() acquire self._lock, but list_all(), has(), and size() don't. Either the lock is needed for thread safety or it isn't — the inconsistency makes it hard to reason about correctness. With a single persistent connection, just lock every public method, same pattern as the original code.

  4. Simplify upsert(). The dynamic SQL construction (iterating field tuples to build column names, placeholders, and update clauses) is ~25 lines for a fixed
    schema. A hardcoded INSERT OR REPLACE INTO pending_confirms (...) VALUES (?, ?, ...) does the same thing and is easier to read. The separate SELECT
    existence check + SELECT COUNT(*) before the upsert can also be simplified — one count query, then the insert.

  5. Thread safety test is order-dependent. The assertion assert removed_hotkeys == ['miner-1', 'miner-2'] relies on a 10ms sleep in the writer to guarantee
    miner-1 is removed before miner-2. That's a timing assumption — on a loaded machine or slow CI runner both items could be enqueued before the remover
    processes either, making removal order non-deterministic. Can we use set(removed_hotkeys) == {'miner-1', 'miner-2'} instead to prevent edge cases?

  6. While we're at it with these changes, can we drop MAX_QUEUE_SIZE. The queue is keyed by miner_hotkey — one entry per miner, so it's naturally bounded by the number of miners on the subnet. With SQLite on disk it's no longer needed and could silently reject valid confirms if the subnet grows past 50 active miners. This is really just debt, can you remove the constant, the size check in enqueue(), and have enqueue() always return True (or just return None may be better, depends on exception handling).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants