Skip to content
Merged

Docs #163

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion cmake/sanitize.cmake
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
if (SANITIZER_TYPE STREQUAL "thread")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=thread -g -O1 -fno-omit-frame-pointer")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=thread -g -O1 -fno-omit-frame-pointer -Wno-error=tsan")
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fsanitize=thread")
message(STATUS "Thread Sanitizer enabled")
else()
Expand Down
12 changes: 5 additions & 7 deletions conanfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,14 +41,9 @@ def configure(self):
if self.options.shared:
self.options.rm_safe("fPIC")
if self.settings.build_type == "Debug":
if self.options.coverage and self.options.sanitize:
if self.options.coverage and self.options.sanitize != 'False':
raise ConanInvalidConfiguration("Sanitizer does not work with Code Coverage!")
if self.conf.get("tools.build:skip_test", default=False):
if self.options.coverage or self.options.sanitize:
raise ConanInvalidConfiguration("Coverage/Sanitizer requires Testing!")

def configure(self):
if self.settings.build_type != "Debug":
else:
self.options['sisl/*'].malloc_impl = 'tcmalloc'

def build_requirements(self):
Expand Down Expand Up @@ -115,6 +110,9 @@ def build(self):
jobs = self.conf.get("tools.build:jobs", default=3)
env = Environment()
env.define("CTEST_PARALLEL_LEVEL", str(jobs))
if self.options.get_safe("sanitize") == "thread":
suppression_file = join(self.source_folder, "src", "test", "tsan_suppressions.txt")
env.define("TSAN_OPTIONS", f"suppressions={suppression_file}:second_deadlock_stack=1")
with env.vars(self).apply():
cmake.test()

Expand Down
45 changes: 45 additions & 0 deletions docs/craft/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# CRAFT — Client Assisted RAFT

**CRAFT** (Client Assisted RAFT) is the replication protocol for NuBlox 2.0. It separates the
data path from the consensus path: clients broadcast writes directly to all replicas at
client-assigned LSNs, while RAFT is used only for leader election, login synchronization, and
recovery bookkeeping. Write data never flows through the RAFT log.

## Documents

| File | Contents |
|---|---|
| [protocol.md](protocol.md) | Full protocol — leader election, login, IO phase, failure/resync |
| [api.md](api.md) | HomeBlocks C++ CRAFT API (`CraftReplDev` methods) |
| [rpcs.md](rpcs.md) | All 8 RPCs (client↔server and server↔server) |
| [states.md](states.md) | LSN state machines (client view and replica view) |
| [subtasks.md](subtasks.md) | Implementation sub-task breakdown (SDSTOR-22382 children) |

## Glossary

| Term | Definition |
|---|---|
| **CRAFT** | Client Assisted RAFT — the NuBlox 2.0 replication protocol |
| **dLSN** | Data LSN — a monotonically increasing sequence number in the **data journal** of a single partition/replica-set. Per-volume in NuBlox. |
| **gLSN** | Global LSN — monotonically increasing across all partitions of a volume. |
| **rLSN** | RAFT LSN — the index within the RAFT log. Distinct from dLSN. |
| **term** | RAFT term number, incremented on every new client login. Used by replicas to reject stale IOs. |
| **commit_lsn** | Highest dLSN whose data has been applied to the state machine (index + block map). A committed write is readable. |
| **last_append_lsn** | Highest dLSN whose data has been written to the data journal (possibly not yet committed). |
| **Replica Set (RS)** | The set of HomeBlocks nodes that hold copies of one partition. Typically 3 nodes. |
| **Partition** | A contiguous region of a Volume, replicated across one Replica Set. In NuBlox, partition ≈ volume. |
| **CraftReplDev** | New HomeBlocks replication device class (parallel to `ReplDisk`) that implements CRAFT. |
| **CraftConnector** | New HomeBlocks RPC frontend (parallel to `ScstConnector`) that translates NubloxProto RPCs to `CraftReplDev` API calls. |
| **SyncRSCommitLSN** | A RAFT log entry type. On apply, each replica fetches any missing data up to the encoded dLSN and advances `commit_lsn`. |
| **InternalLogin** | A RAFT log entry type. On apply, stores the new `client_token` and enforces single-writer exclusivity. |
| **Missing** | A dLSN slot that a replica knows about (from a peer or from the RAFT log) but has not yet received data for. |
| **Empty** | A dLSN that was never received by any replica and is not discoverable during resync. Treated as a no-op hole. |

## Key design properties

- **Single writer**: only one client at a time owns a partition (enforced by `InternalLogin` RAFT entry).
- **Leaderless data path**: after login, the RAFT leader has no special role for writes or reads.
- **Client drives commit**: replicas do not commit until told by the client (via `commit` RPC or `min_commit_lsn` in a `read` RPC).
- **Server-side resync**: `SyncRSCommitLSN` lets replicas catch up from each other without client involvement.
- **No HomeStore changes needed**: `CraftReplDev` is built entirely on top of existing HomeStore journal/index/block primitives.
- **Full replacement**: `CraftReplDev` replaces the existing solo `ReplDev` for all volumes. There are no non-CRAFT (ReplDisk/solo) volumes in the final design.
219 changes: 219 additions & 0 deletions docs/craft/api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
# HomeBlocks CRAFT C++ API

`CraftReplDev` is a new class (parallel to HomeStore's `ReplDisk`) that each CRAFT-mode
volume owns instead of a solo `repl_dev`. It exposes the following methods, which
`CraftConnector` calls 1-to-1 when translating incoming NubloxProto RPCs.

All methods are async/coroutine-style (`async_result<T>` or `async_status`) matching the
existing HomeBlocks convention.

---

## Per-partition in-memory state

```cpp
struct CraftPartitionState {
int64_t commit_lsn {-1}; // highest committed dLSN
int64_t last_append_lsn {-1}; // highest appended dLSN (may be uncommitted)
uint64_t client_token {0}; // token from last successful InternalLogin
uint64_t term {0}; // current RAFT term
};
```

This state is authoritative in memory and recovered from the journal + superblock on restart.

---

## Client-facing API

### `login`

```cpp
struct LoginResult {
std::vector<replica_endpoint> members;
int64_t dLSN; // starting LSN for new IO
int64_t gLSN; // global (volume-level) LSN
uint64_t term;
};

async_result<LoginResult>
login(uint64_t client_token, volume_id_t vol_id);
```

Leader-only. Orchestrates the full login sequence:
1. `GetRSCommitLSN` broadcast to all peers (non-RAFT)
2. `FetchData` from an ahead peer if the leader is behind (non-RAFT)
3. Propose `SyncRSCommitLSN(rs_commit_lsn)` via RAFT
4. Propose `InternalLogin(client_token, new_term)` via RAFT
5. Return `LoginResult` after both RAFT entries commit

**Preconditions:** caller is the RAFT leader.
**Postconditions:** all quorum members have `commit_lsn == rs_commit_lsn`; all reject IOs
from any token other than `client_token`.

---

### `write`

```cpp
async_status
write(uint64_t term, int64_t lsn, int64_t glsn,
lba_t lba, lba_count_t len, sisl::sg_list data);
```

Appends `data` to the data journal at slot `lsn`. Zero-copy required on the hot path.

Steps:
1. Reject if `term != state.term` → `ETERM`.
2. Write `data` to the journal at position `lsn` (may be out of order).
3. `state.last_append_lsn = max(state.last_append_lsn, lsn)`.
4. ACK.

Does **not** apply data to the LBA index; that happens on `commit`.

---

### `read`

```cpp
async_result<sisl::sg_list>
read(uint64_t term, int64_t min_commit_lsn, lba_t lba, lba_count_t len);
```

If `state.commit_lsn < min_commit_lsn`: commit inline up to `min_commit_lsn` before
serving. Then read from the committed state machine (LBA index → block read).

Rejects if `term != state.term`.

---

### `commit`

```cpp
async_status
commit(uint64_t term, int64_t lsn);
```

Advance `commit_lsn` to `lsn`: apply all journal entries in `(current_commit, lsn]` to the
state machine (update LBA index, finalize block map). After this call, LBAs covered by
those entries are readable.

---

### `keep_alive`

```cpp
async_status
keep_alive(int64_t commit_lsn);
```

Same as `commit` plus resets the client-timeout watchdog. Sent periodically by the client
even during idle periods to prevent the server from triggering `SyncRSCommitLSN`.

---

### `get_lsns`

```cpp
struct LSNPair { int64_t commit_lsn; int64_t last_append_lsn; };

async_result<LSNPair>
get_lsns(volume_id_t vol_id);
```

Returns `{commit_lsn, last_append_lsn}` for the local partition. Used by peers via
`GetRSCommitLSN` during login and by the leader during `SyncRSCommitLSN`.

---

### `truncate`

```cpp
async_status
truncate(int64_t lsn);
```

Drop all journal entries with dLSN > `lsn`. Called when a replica discovers it has
entries from a previous term that did not reach quorum (new `InternalLogin` forces
a truncate of stale appended entries). Also called during login to clean up followers
whose `last_append > agreed_dLSN`.

---

## Internal / RAFT-entry API

### `append` (propose SyncRSCommitLSN)

```cpp
async_status
append(int64_t sync_to, uint64_t client_token);
```

Proposes a `SyncRSCommitLSN` RAFT entry with value `sync_to`. Callable by the leader's
watchdog or by the client-facing `SyncRSCommitLSN` RPC. `client_token` is embedded so
followers can verify the entry belongs to the current session.

---

### `fetch_data` (for peer resync)

```cpp
async_result<std::vector<JournalSlot>>
fetch_data(std::vector<int64_t> lsns);
```

Returns raw journal data for the requested LSNs. Called server-to-server (not from the
client) during `SyncRSCommitLSN` apply when a replica discovers it is behind.

---

### `get_rs_commit_lsn` (for peer query)

```cpp
async_result<LSNPair>
get_rs_commit_lsn();
```

Alias of `get_lsns` exposed to peer servers during the `GetRSCommitLSN` broadcast.

---

## RAFT state machine entries

These are internal RAFT log entry types, not part of the public API.

### `SyncRSCommitLSN`

```
payload: { rs_commit_lsn: int64 }
```

On RAFT apply (each replica):
1. If `last_append_lsn < rs_commit_lsn`: call `fetch_data(missing)` from a peer.
2. `commit_lsn = rs_commit_lsn`.

### `InternalLogin`

```
payload: { client_token: uint64, term: uint64 }
```

On RAFT apply (each replica):
1. `state.client_token = client_token`
2. `state.term = term`
3. From this point, reject writes/reads whose `term` field != `state.term`.

---

## Replacing the existing API

`CraftReplDev` replaces the existing solo `ReplDev` for all volumes. The old
`async_read` / `async_write` surface in `home_blocks.hpp` (consumed by `ScstConnector`)
is superseded. `CraftConnector` is the new frontend; `ScstConnector` is removed.

| Old API (removed) | CRAFT replacement |
|---|---|
| `async_write(vol, addr, sgs)` | `write(term, lsn, glsn, lba, len, data)` |
| `async_read(vol, addr, sgs)` | `read(term, min_commit_lsn, lba, len)` |
| `async_unmap` (stub) | No equivalent in CRAFT v1 |
| — | `login`, `commit`, `keep_alive`, `truncate`, `fetch_data`, `get_lsns`, `append` |
Loading
Loading