Skip to content

feat(peer): provisional peer verification subsystem#143

Open
adequatelimited wants to merge 1 commit intomasterfrom
feature/provisional-peers
Open

feat(peer): provisional peer verification subsystem#143
adequatelimited wants to merge 1 commit intomasterfrom
feature/provisional-peers

Conversation

@adequatelimited
Copy link
Copy Markdown
Collaborator

Architectural Overview: Provisional Peer Verification

Background: What We Had Before

The Mochimo node maintains a Recent Peer List (Rplist, 64 entries) used for all network operations -- peer discovery, quorum formation, block propagation, and chain synchronization. Previously, when the node received a peer list from another node via OP_SEND_IPL, those IP addresses were added directly to Rplist via addrecent() -- no verification that the IPs were actually running Mochimo nodes.

This created two problems:

  1. Stale peer propagation: Nodes that went offline months ago remain in peer lists indefinitely. Every node shares its Rplist with every peer that asks, so stale IPs propagate across the entire network. A significant portion of advertised peers on the current network are unreachable.

  2. IP flooding attack surface: A malicious node could respond to OP_GET_IPL with fabricated IP addresses, filling the requesting node's Rplist with garbage. The node would then waste time trying to contact unreachable IPs during quorum formation and sync operations, and would propagate those garbage IPs to other nodes that ask for its peer list.

What Changed

Peer IPs received from network responses now go through a provisional verification pipeline before being added to Rplist. The pipeline has three stages:

Stage 1 -- Intake (addprovisional): IPs from OP_SEND_IPL responses are placed in a provisional list (4096 entries) instead of Rplist. Each entry records the candidate IP, the source IP that advertised it, and a status field. Before appending, the function deduplicates against existing provisional entries and Rplist, and checks the source's reputation.

Stage 2 -- Verification (background thread): A dedicated thread processes provisional entries in batches of 32. For each pending entry whose retry time has passed, it attempts a callserver() handshake. If the handshake succeeds, the entry is marked VERIFIED. If it fails, the fail counter increments and the next retry is scheduled with exponential backoff. After 5 failures, the entry is marked EXPIRED.

Stage 3 -- Harvest (harvest_provisional): Called periodically from the main server loop. Scans for VERIFIED entries, promotes them to Rplist via addrecent(), then compacts the list by removing all EXPIRED entries.

Race Condition Handling

The provisional list is protected by a RWLock (from the extended-c threading library):

  • Parent thread (main loop): Takes write lock for addprovisional() (append) and harvest_provisional() (promote + compact). These are fast operations -- no blocking I/O under the lock.
  • Verification thread: Takes read lock to scan for candidates, releases it, performs the blocking callserver() attempt (3-second timeout, no lock held), then takes write lock briefly to update the entry's status/fail_count. The lock is never held during network I/O.
  • OpenMP threads in scan_quorum(): Call addprovisional() which takes write lock. The RWLock handles concurrent writers correctly.

The verification thread checks Running and Provrunning flags between every operation and every sleep second, ensuring clean shutdown without deadlock.

Blocking Situation Analysis

  • addprovisional(): Only holds write lock during in-memory array operations. No I/O, no network. Worst case is scanning 4096 entries for dedup + reputation -- microseconds.
  • harvest_provisional(): Same -- in-memory scan and compact under write lock. No I/O.
  • Verification thread: The callserver() call blocks for up to 3 seconds (INIT_TIMEOUT) per peer. With batches of 32, worst case is ~96 seconds per pass. This runs in a dedicated background thread -- never in the main server loop. Between batches, the thread sleeps for 30 seconds (checking Running every second).
  • Main server loop: Zero new blocking. harvest_provisional() is a fast in-memory operation.

Source Reputation Management

When addprovisional() evaluates whether to accept an IP from a given source, it tallies that source's track record from existing provisional entries:

  • Counts all entries from this source_ip that are EXPIRED (failed verification) and whose last attempt was within the last hour (3600 seconds)
  • Counts all PENDING entries from this source toward the total
  • If the source has >= 10 entries total and >= 80% are recent failures, the new IP is silently dropped

Time-windowed decay: The reputation check only considers failures from the last hour. This is critical because:

  • On a fresh node joining the network, many legitimate peers share stale IP lists accumulated over years. Without decay, every source would quickly hit the threshold and the node would stop accepting peer lists from anyone.
  • With the 1-hour window, old failures age out. A source that shared bad IPs an hour ago gets a fresh chance.
  • A truly malicious source that continuously floods garbage IPs will keep hitting the threshold every hour -- but gets at most 10 entries per hour into the provisional list before being throttled.

Tunable Parameters

All defined in types.h alongside existing peer configuration:

Parameter Value Purpose
PROVPEERSLEN 4096 Maximum provisional list entries
PROVBATCHSIZE 32 Peers verified per thread pass
PROVMAXFAILS 5 Failures before entry expires
PROVBACKOFF 300 Base backoff seconds (multiplied by fail count)
PROVREPUTHR 10 Minimum entries before evaluating source reputation
PROVREPUFAIL 80 Failure percentage threshold to reject a source
PROVREPUTIME 3600 Reputation window in seconds (1 hour)

Behavior Under Normal Conditions

  1. Node starts, completes initial sync via resync()
  2. Verification thread starts after init
  3. During scan_quorum(), peer IPs from responses go to both netplist (immediate scanning) and addprovisional() (long-term verification)
  4. During steady-state refresh_ipl(), peer IPs go only to addprovisional()
  5. Verification thread confirms reachable peers
  6. harvest_provisional() promotes verified peers to Rplist
  7. Rplist gradually fills with confirmed-reachable peers

Behavior Under IP Flooding Attack

A malicious node responds to OP_GET_IPL with 64 fabricated IPs:

  1. All 64 IPs enter the provisional list
  2. Verification thread attempts handshakes -- all fail
  3. After 5 failures each (~75 minutes of backoff), entries are marked EXPIRED
  4. Next harvest compacts them out
  5. Source reputation degrades: 64 expired entries, 100% failure rate
  6. Next time this source sends peer IPs, addprovisional() silently drops them all
  7. After 1 hour, old failures age out, source gets another chance
  8. If source sends garbage again, the cycle repeats -- at most 64 entries per hour of overhead

Impact on node operation: Zero. Rplist is never polluted.

Behavior With Stale Network Peer Lists

  1. Fresh node joins, receives peer lists with many stale IPs
  2. Stale IPs go to provisional, most fail verification
  3. Source reputation accumulates failures, temporarily throttles sources with high failure rates
  4. The 1-hour decay window means sources are not permanently blacklisted
  5. Over time, Rplist fills with only confirmed-reachable peers
  6. Stale IPs never enter Rplist -- they fail verification and expire

Files Changed

File Change
src/types.h PROVPEER struct (20 bytes, 4-byte aligned), 7 config defines, 3 status constants
src/peer.h 5 function prototypes
src/peer.c Full implementation (~250 lines): intake, harvest, reputation, verification thread, lifecycle
src/network.c scan_quorum() and refresh_ipl() route received peer IPs through addprovisional()
src/bin/mochimo.c Thread start after init, harvest on refresh timer, thread stop on shutdown
src/test/peer-provisional.c Unit test conforming to _assert.h / make test / make coverage conventions

Testing

Unit test (make test-peer-provisional): Tests basic add, deduplication against provisional list and Rplist, capacity limit (4096 entries), source reputation with good sources, purge, multiple sources with cross-source dedup, harvest compaction, rapid add/harvest cycles (100 iterations), thread start/stop lifecycle, and concurrent add + harvest from separate threads. All assertions use the standard _assert.h framework. Passes via make test and is included in make coverage.

Build verification: Clean compile with -Wall -Werror -Wextra -Wpedantic on GCC 13 (Ubuntu x64). All existing tests unaffected.

What This Does NOT Change

  • Peers that complete a real protocol interaction with our node (incoming OP_FOUND, OP_GET_BLOCK, OP_TX, etc.) are still added directly to Rplist via addrecent() -- they have already proven they are real nodes by talking to us
  • The scan_quorum() working list (netplist) still receives IPs immediately for the current scan -- provisional verification is for long-term Rplist inclusion, not for blocking initial peer discovery
  • No changes to quorum formation, sync, or consensus paths
  • No changes to pink list handling
  • Provisional data is in-memory only -- lost on restart, no disk persistence needed

Introduces a provisional peer list that holds unverified IP addresses
received from network peers. A background thread verifies candidates
by attempting handshakes, and only verified peers are promoted to the
active recent peer list (Rplist). Includes source reputation tracking
with time-windowed decay to mitigate IP flooding attacks while
tolerating the stale peer lists common on the existing network.

New in types.h: PROVPEER struct, configuration defines, status values
New in peer.h: function prototypes for provisional peer management
New in peer.c: addprovisional(), harvest_provisional(), source
  reputation logic, background verification thread, purge
Modified network.c: scan_quorum() and refresh_ipl() now route
  received peer IPs through addprovisional() instead of addrecent()
Modified mochimo.c: thread lifecycle (start/harvest/stop) integrated
  into server init, main loop, and shutdown
New test: src/test/peer-provisional.c (make test-peer-provisional)
@adequatelimited
Copy link
Copy Markdown
Collaborator Author

Here's the placeholder PR for this new feature. Will revisit it after the remaining audit-fixes are complete. @chrisdigity Would love your input on this.

@adequatelimited
Copy link
Copy Markdown
Collaborator Author

Note: Clearing EXPIRED status items from Provisional may contradict the reputation management threshold calculation. If they are cleared immediately, they won't be available for us to use to calculate that a node is a bad actor. Some re-work is needed there to determine when someone has a "bad' reputation, but the bulk of the feature is here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant