Support for Parallel Replication by vazois · Pull Request #1556 · microsoft/garnet

vazois · 2026-02-11T17:15:19Z

Multi-Log Parallel Replication Feature

This PR introduces multi-log based Append-Only File (AOF) support to Garnet, enhancing write throughput and enabling optimized parallel replication replay. The feature leverages multiple physical TsavoriteLog instances to shard write operations and parallelize log scanning, shipping, and replay across multiple connections and iterators. While designed primarily for cluster mode replication, this feature can also be used in standalone mode to improve performance when AOF is enabled.

Feature Requirements

1. Sharded AOF Architecture

Improves AOF write-throughput through key-based sharding across distinct physical TsavoriteLog instances.
Accelerates replica synchronization through parallel log scanning and shipping across the network.
Full backward compatibility with existing single-log deployments

2. Flexible Parallel Replay with Tunable Task Granularity

Introduces virtual sublog abstraction to allow for parallel replay within a given physical sublog.
Minimizes inter-task coordination to maximize parallel execution efficiency

3. Read Consistency Protocol

Per session prefix consistency through the use of timestamp-based sequence numbers.
Sketch based key-level replay status tracking for efficient and lightweight freshness validation.
Version-based prefix-consistency across replica reconfiguration operations.
Ensures monotonically increasing sequence numbers across failovers through offset tracking during replica promotion.

4. Transaction Support

Coordinates multi-exec transactions across sublogs to maintain ACID properties during parallel replay.
Preserves consistent commit ordering per session through timestamp-based sequence numbers.

5. Fast Prefix-Consistent Recovery

Multi-sublog prefix-consistent recovery within the persisted commit boundaries.
Intra-page parallelism during recovery using multiple replay tasks.

Newly Introduced Configuration Parameters

Parameter	Purpose
`AofPhysicalSublogCount`	Number of physical `TsavoriteLog` instances
`AofReplayTaskCount`	Replay tasks per physical sublog at replica
`AofRefreshPhysicalSublogTailFrequencyMs`	Background task frequency for advancing idle sublog timestamps

Implementation Plan

Phase 1: Core Infrastructure

1.1 Implement AofHeader extensions to eliminate single log overhead.
- ShardedHeader for standalone operations.
- TransactionHeader for coordinated operations.
1.2 Implement GarnetLog abstraction layer.
- SingleLog wrapper for legacy single log.
- ShardedLog implementation for multi-log.
1.3 SequenceNumberGenerator class.
- Generate monotonically increasing sequence number using timestamps.
- Ensure monotonicity at failover and recovery by using starting offset.

Phase 2: Primary Replication Stream

2.1 AofSyncDriver class.
- Single instance AofSyncDriver per attached replica.
- Multiple instances of AofSyncTask per physical sublog.
- Use dedicated AdvanceTime background task per attached replica.
2.2 AofSyncTask class.
- Independent log iterators per sublog
- Network page shipping per sublog
- Error handling and connection teardown
2.3 AdvanceTime background task.
- Primary monitors log changes by comparing last know tail address to the current tail address.
- Primary associates the current tail address snapshot with a sequence number (timestamp) that is strictly larger than all sequence numbers assigned until that moment and notifies the replica.
- Replica maintains an advance time background task that updates sublog time using the information from the primary's signal.
- Primary advances last known tail address to the observed tail address.
- The system reaches equilibrium when writes are quiesced and not more signals are send unless a new change is detected.

Phase 3: Replica Replay Stream

3.1 ReplicaReplayDriver class.
- Per-physical-sublog enqueue, scan and replay coordination
- Manages ReplicaReplayTask for parallel replay within a single physical sublog.
3.2 ReplicaReplayTask class.
- Record filtering by task affinity.
- Coordinated update of virtual sublog replay state to enable read prefix consistency.
3.3 Standalone operation replay
- Each operation executes within its appropriate context (BasicContext or TransactionalContext).
- The virtual sublog replay state is updated prior to replay to maintain prefix consistency for read operations.
3.4 Multi-exec transaction replay
- Transaction operations are distributed across replay tasks based on key affinity.
- Upon encountering the TxnCommit marker, each participating task acquires exclusive locks for its assigned keys.
- The associated virtual sublog replay state gets updated following the standalone operation replay.
- All participating tasks synchronize at a barrier before commit, which releases locks and makes results visible.
- The commit marker advances time prior to execution, ensuring timestamp consistency while locks are still held.
3.5 Custom transaction procedure replay
- Similar to multi-exec transaction with the exception of having a single thread execute the custom procedure.
- Virtual sublog replay state gets updated prior to lock acquisition.
- Exclusive lock acquisition ensures that transaction partial results are not exposed to readers.

Phase 4: Read Consistency Protocol

4.1 ReadConsistencyManager class
- VirtualSublogReplayState struct using sketch arrays for key freshness tracking and sequence number frontier computation.
- Provides APIs for updating sequence numbers at key or virtual sublog granularity.
- Tracks version to maintain prefix consistency during replica reconfiguration events.
4.2 Session based prefix consistency enforcement
- Implement ConsistentReadGarnetApi and TransactionalConsistentReadGarnetApi to allow the jitter to optimize operational calls.
- Define callbacks to enforce consistent read protocol (e.g. ValidateKeySequenceNumber, UpdateKeySequenceNumber).
- Session level ReplicaReadSessionContext struct used to maximumSessionSequenceNumber metadata (i.e. sessionVersion, lastHash, lastVirtualSublogIdx) to enforce prefix consistency when is stable or during recovery

Phase 6: Prefix consistent recovery

5.1 Commit operation
- Occurs in unison across alls sublogs. AutoCommit disabled and triggered at the GarnetLog layer instead of within TsavoriteLog to control across sublogs commit.
- Commit adds cookie tracking the timestamp value of when commit occurred to enforce prefix consistent recovery.
5.2 RecoverLogDriver implementation
- Independent iterators with shared bounds.
- Record filtering by sequenceNumber < untilSequenceNumber.
- Build ReadConsistencyManager state at recovery to initialize SequenceNumberGenerator.
- Allow intra-page parallel recovery using scan, BulkConsume interface.

Phase 6: Testing & Validation

6.1 Replication base tests passing with multi-log enabled
6.2 Replication diskless sync tests passing with multi-log enabled

NOTES

Prefix Consistent Single Key Read Protocol

Each session tracks the maximum observed sequence number $T_{ms}$ and only proceeds when the key frontier $T_k$ (max of key and sublog sequence numbers) exceeds that value, guaranteeing visibility of earlier writes.
After the read, refresh $T_ms$ with the key's latest sequence number; timestamps are strictly increasing, so doing this post-read remains safe even though freshness validation occurred beforehand, and boundary reads never slip through.

Prefix Consistent Batch Read

For every key $K_i$ in the batch, ensure $T_{ms} < T_{k_i}$, then compute $T_max = max(T_{k_1}..T_{k_n})$ before issuing the batched read.
Once the batch returns, verify each key still satisfies $T_{k_i} \leq T_max$; if any key advanced beyond $T_max$, redo the batch since a concurrent update happened. Because freshness gating blocks boundary reads, caching just $T_max$ is sufficient to detect drift.

TODO

… for SpanByteAllocator. The AddressType change is a breaking on-disk format change: it shuffles bits around in RecordInfo to add an additional bit adjacent to the old ReadCache bit to mark an address as: - 00: Reserved - 11: ReadCache - 10: InMemory portion of the main log - 01: On-Disk

…hem "in" rather than "ref"

* wip * wip * wip * Added unified store session * Correcting generic typing * Added MEMORY USAGE + TYPE to unified ops * Added TTL, EXPIRETIME and EXISTS to unified ops * implemented DEL in unified ops * wip - expire & persist (broken) * wip - adding expire to unified ops * wip - expire * add cref to server-side replication inter-node commands * fix server-side BeginRecoverReplica * wip * Fix transaction key locking * format * Some test fixes * Fixing tests * reverting a couple of unnecessary changes * Eliminating more multi-context methods from API * Removed some unnecessary stuff * Some more cleanup to TransactionManager * merge tedhar/storage-v2 (ObjectAllocator serialization updates) * Updating memory usage values * format * Handling wrong type ops * Revert "Updating memory usage values" This reverts commit 88ba307. * fix no-object-log case * Fixes for Tsavorite UTs * Fixes for Tsavorite UTs (mostly ReadCache, TsavoriteLog, Compaction) * Tsavorite Iterator work and UT re-enabling * Fixes to Object iteration, LogRecord.ToString() * Add RecordMetadata.ETag * Readding --no-obj config * fix * test fix * Prep for Recovery * wip * wip * fix * More fixes for UT (mostly Recovery, Migration) * Moving DELIFEXPIM to unified store * ObjectLogTail in Recovery, and more UT fixes (Migration record serialization, ReadCache size and tailAddress verification calculations, etc.) * More Tsavorite recovery tests * Removing unnecessary isObject flag from record serialization --------- Co-authored-by: Vasileios Zois <vazois@microsoft.com> Co-authored-by: TedHartMS <15467143+TedHartMS@users.noreply.github.com>

… into tedhar/storage-v2

… correctly initialized SegmentSize for ObjectLogDevice; other fixes to UTs to work with UnifiedStore

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 165 out of 169 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

libs/server/AOF/ReadConsistency/VirtualSublogReplayState.cs

libs/common/Synchronization/LeaderBarrier.cs

libs/server/AOF/Recover/AofRecover.cs

vazois · 2026-03-05T00:24:44Z

@copilot, investigate why ClusterMigrateWrite would be failing
https://github.com/microsoft/garnet/actions/runs/22695023356/job/65799614188#step:7:712

Copilot · 2026-03-05T00:24:52Z

@vazois I've opened a new pull request, #1609, to work on those changes. Once the pull request is ready, I'll request review from you.

…k-based replica sync (#1633) In TryReplicateDiskbasedSync, ExecuteClusterInitiateReplicaSync was sending beginAddress.Span for both the aofBeginAddress and aofTailAddress parameters. This was introduced in commit 6fb99e5 when converting from ToByteArray() to Span-based calls. The primary uses the replica's tail address to compute the AOF sync replay range. With both parameters being the begin address, the primary couldn't determine where the replica's AOF actually ended, causing the replica to never receive AOF records and remain stuck at offset 64 (kFirstValidAofAddress). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…nInjection

TedHartMS and others added 30 commits May 5, 2025 20:15

Restore TsavoriteAof naming back to TsavoriteLog

b4a942f

Merge remote-tracking branch 'origin/main' into tedhar/storage-v2

c7666db

merge 'tedhar/storage-v2'

48a31d4

merge 'main'

83781a1

merge 'main'

3f06a7a

Merge 'main'; also fix a couple DbScan tests I missed with AddressType

539db55

Fix a few more cursor tests (int => long)

36633c7

Fixes for review comments (except for "ref RecordSizeInfo")

f31ce8a

TSourceLogRecord and RecordSizeInfo parameters are readonly so make t…

11fdf3d

…hem "in" rather than "ref"

Fix some missing ref => in for TSourceLogRecord

ba3eed2

Merge remote-tracking branch 'origin/main' into tedhar/storage-v2

62f4a73

merge main part 1: it builds

7c15749

Merge fixes; MigrateKeysWithObjects test still fails

437058e

Fix Object store migration typo

057e7b8

Merge main

2f66f0b

merge 'main'

fe657ce

format

cde0520

ObjectAllocator serialization - not yet complete

3114c48

Merge 'main' to tedhar/storage-v2

18e48c7

Removing unnecessary output flag

afc4beb

Fixes for StateMachineTests (Recovery)

6692bd5

Merge branch 'tedhar/storage-v2' of https://github.com/microsoft/Garnet…

224a33f

… into tedhar/storage-v2

Fixes, mostly for Replication to include object log for UnifiedStore;…

9350b08

… correctly initialized SegmentSize for ObjectLogDevice; other fixes to UTs to work with UnifiedStore

fixes

d0967b7

re-org and cleanup

0675bfe

add support for embedded bench in resp.bench

6870a6e

enable cluster and AOF flags

24435fb

add more server config options

4ddd94a

vazois added 8 commits February 23, 2026 20:47

improve robustness of background advance time consumer

ad85828

remove unused playground project

f84435c

cleanup clientname setup

d9dd1a2

fix tsavorite ISessionFunctions

63241c9

remove unused assembly

d6925a5

simplify AofAddress and check for correctness

6fb99e5

merge latest dev

6d56ca5

merge latest dev

2310b62

vazois requested a review from Copilot March 3, 2026 19:05

Copilot AI reviewed Mar 3, 2026

View reviewed changes

vazois requested a review from Copilot March 3, 2026 19:52

Copilot started reviewing on behalf of vazois March 3, 2026 19:53 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

libs/server/AOF/ReadConsistency/VirtualSublogReplayState.cs Outdated Show resolved Hide resolved

libs/common/Synchronization/LeaderBarrier.cs Show resolved Hide resolved

libs/server/AOF/Recover/AofRecover.cs Show resolved Hide resolved

vazois added 5 commits March 3, 2026 16:22

fix typo

33315c8

eliminate bytes from informational message

524d929

Merge remote-tracking branch 'origin/dev' into vazois/mmrt-dev

7721b2a

simplify and cleanup leader barrier

8c79113

cleanup LeaderFollowerBarrier

08db206

Copilot AI mentioned this pull request Mar 5, 2026

Fix ClusterMigrateWrite race condition: wait for gossip propagation after Meet() #1609

Closed

ensure commit for multilog recovery

118ee66

vazois requested review from TedHartMS and badrishc March 5, 2026 19:07

vazois and others added 6 commits March 10, 2026 15:57

merged with latest dev

656b381

replace tcs with semaphore slim

a44d7b1

merge with latest dev

91ca7e2

fix ClusterReplicationCheckpointCleanupTest

26217d4

fix ClusterResetHardDuring tests by using ConfigureAwait for Exceptio…

752e402

…nInjection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Parallel Replication#1556

Support for Parallel Replication#1556
vazois wants to merge 370 commits intodevfrom
vazois/mmrt-dev

vazois commented Feb 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vazois commented Mar 5, 2026 •

edited

Loading

Uh oh!

Copilot AI commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

vazois commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Multi-Log Parallel Replication Feature

Feature Requirements

1. Sharded AOF Architecture

2. Flexible Parallel Replay with Tunable Task Granularity

3. Read Consistency Protocol

4. Transaction Support

5. Fast Prefix-Consistent Recovery

Newly Introduced Configuration Parameters

Implementation Plan

Phase 1: Core Infrastructure

Phase 2: Primary Replication Stream

Phase 3: Replica Replay Stream

Phase 4: Read Consistency Protocol

Phase 6: Prefix consistent recovery

Phase 6: Testing & Validation

NOTES

Prefix Consistent Single Key Read Protocol

Prefix Consistent Batch Read

TODO

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vazois commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vazois commented Feb 11, 2026 •

edited

Loading

vazois commented Mar 5, 2026 •

edited

Loading