Persist pending confirms in SQLite#11
Conversation
LandynDev
left a comment
There was a problem hiding this comment.
Thanks for the contribution — the PR description is clear and the intent is solid.
The only way to prevent the potential loss of data would require halting + update windows, so this is a good contribution.
A few things to address before this is merged:
-
I don't think we need the Protocol abstraction. PendingConfirmStore as a separate Protocol with an injectable store parameter is more indirection than we need — there's only one implementation and we don't anticipate another. Fold the SQLite logic directly into PendingConfirmQueue, same as the original kept the dict logic inline. Tests can pass db_path to the queue constructor instead.
-
Connection leaks can happen.
_get_connection()creates a newsqlite3.connect()on every call. The with conn: context manager handles transaction commit/rollback but does not close the connection — so each upsert, remove, list_all etc. leaks an open file handle. Open a single connection in init, reuse it across calls (protected by the lock), and close it if/when the queue is torn down. -
I notice the Lock on some methods but not others.
upsert(),remove(), andpurge_expired()acquireself._lock, butlist_all(),has(), andsize()don't. Either the lock is needed for thread safety or it isn't — the inconsistency makes it hard to reason about correctness. With a single persistent connection, just lock every public method, same pattern as the original code. -
Simplify
upsert(). The dynamic SQL construction (iterating field tuples to build column names, placeholders, and update clauses) is ~25 lines for a fixed
schema. A hardcodedINSERT OR REPLACE INTO pending_confirms (...) VALUES (?, ?, ...)does the same thing and is easier to read. The separateSELECT
existence check +SELECT COUNT(*)before the upsert can also be simplified — one count query, then the insert. -
Thread safety test is order-dependent. The assertion
assert removed_hotkeys == ['miner-1', 'miner-2']relies on a 10ms sleep in the writer to guarantee
miner-1 is removed before miner-2. That's a timing assumption — on a loaded machine or slow CI runner both items could be enqueued before the remover
processes either, making removal order non-deterministic. Can we useset(removed_hotkeys) == {'miner-1', 'miner-2'}instead to prevent edge cases? -
While we're at it with these changes, can we drop
MAX_QUEUE_SIZE. The queue is keyed by miner_hotkey — one entry per miner, so it's naturally bounded by the number of miners on the subnet. With SQLite on disk it's no longer needed and could silently reject valid confirms if the subnet grows past 50 active miners. This is really just debt, can you remove the constant, the size check inenqueue(), and haveenqueue()always returnTrue(or just returnNonemay be better, depends on exception handling).
Summary
Persist
PendingConfirmQueueto local SQLite so queued confirms survive validator restarts.Problem
PendingConfirmQueueis currently in-memory only. That works while a validator process stays alive, but queued confirms are lost on restart.This matters when a source transaction has been accepted but is still waiting for confirmations. If several validators restart during that window, each restarted validator drops its local pending confirms and stops contributing toward the later
vote_initiate()step for those items.The most realistic case is a coordinated restart, such as a deploy or auto-update affecting multiple validators at once. In that case, pending confirms can be wiped across enough validators that progress toward initiation stalls until the user retries or validators re-accept the transaction.
What Changed
PendingConfirmQueue()caller usageCREATE TABLE IF NOT EXISTSTests
Added targeted coverage for:
Verification: