diff --git a/docs/ru/13-architecture.md b/docs/ru/13-architecture.md new file mode 100644 index 00000000..cd59e81b --- /dev/null +++ b/docs/ru/13-architecture.md @@ -0,0 +1,160 @@ +# Cluster-Mode Architecture + +## Components overview + +In cluster mode, seq-db consists of two main components: + - seq-db store (seq-db instance running with `--mode=store flag`) + - seq-db proxy (seq-db instance running with `--mode=proxy flag`). + +### seq-db store +seq-db store is the stateful storage component, that keeps all the +written documents and handles both reads and writes. +All data written into seq-db eventually makes its way to one or multiple stores. + + +#### Key characteristics +- Deployed as k8s `Statefulset` +- Share-nothing architecture: a seq-db store instance is unaware of any other stores. +- Maintains in-memory and on-disk inverted indexes, allowing search on indexed fields. + + +#### File layout +seq-db store keeps all document data in three file types: + +| File type | Purpose | +|-----------|------------------------------------------------| +| `.docs` | Stores compressed batches of raw log documents | +| `.meta` | Tokenized metadata stream (used for recovery) | +| `.index` | On-disk inverted index | + + +Because the dataset is stored in these three file types, moving or restoring a +shard is straightforward: simply `cp` / `rsync` the directory +to the target node and start the pod. + +Read more about file types and their internal structure [here](internal/fractions.md). + +#### Durability +A write operation is acknowledged only after the payload is safely persisted: + +``` +write, fsync # .meta file +write, fsync # .data file +``` +That is, two write system calls followed by two fsync +calls—guaranteeing the data survives a node +crash or restart before the client receives a success response. +Indexing occurs asynchronously, so it usually takes under 1 +second before the newly written documents are available for search queries. +Note that this value may be slightly higher when bulk load spikes happen + +### seq-db proxy +seq-db proxy is a stateless coordinator for all read & write traffic. +It maintans a user-defined cluster topology, and allows changes in read-write +traffic distribution without changes to the stateful components + + +#### Key characteristics +- Deployed as k8s `Deployment` +- Performs logical replication between stores +- Routes traffic between storage tiers (hot/cold stores) + +seq-db proxy tokenizes every incoming document +and compresses batches with zstd / lz4 +before sending batches to seq-db stores. + +### Read-path & write-path (rf=2) +Let's take a look at an example architecture with 4 seq-db shards and replication-factor=2 +(each log must be stored in two separate seq-db stores). +Note that replicas of shard can be located in different availability zones. + +### Write-path +The write commits only after seq-db proxy receives an ack **from all replicas of the addressed shard**. + +```mermaid +sequenceDiagram + + participant Client + participant Proxy as seq-db proxy + + box Shard1 + participant A as seq-db store
shard1 replica A + participant B as seq-db store
shard1 replica B + end + + box Shard2 + participant C as seq-db store
shard2 replica A + participant D as seq-db store
shard2 replica B + end + + Note over Proxy,B: seq-db proxy chooses a random shard + Client->>Proxy: write(batch1) + Proxy->>A: write(batch1) + Proxy->>B: write(batch1) + + A-->>Proxy: ack + B-->>Proxy: ack + Proxy-->>Client: ack + + Note over Proxy,B: the write is done if acks received
from both replicas of a shard + + Client->>Proxy: write(batch2) + Proxy->>C: write(batch2) + Proxy->>D: write(batch2) + + C-->>Proxy: ack + D-->>Proxy: ack + + Proxy-->>Client: ack +``` + +### Read-path +While the written document must be acknowledged by all replicas +of a shard, +a read is successful when **at least one replica of each shard** returns a response. + +```mermaid +sequenceDiagram + + participant Client + participant Proxy as seq-db proxy + + box Shard1 + participant A as seq-db store
shard1 replica A + participant B as seq-db store
shard1 replica B + end + + box Shard2 + participant C as seq-db store
shard2 replica A + participant D as seq-db store
shard2 replica B + end + + Note over Proxy,C: seq-db proxy chooses
a random replica of each shard + Client->>Proxy: request 1 + Proxy->>A: request 1 + Proxy->>C: request 1 + + A-->>Proxy: response 1 (shard1 replica A) + C-->>Proxy: response 1 (shard2 replica A) + Note over Proxy: seq-db proxy merges the returned responses + Proxy-->>Client: merge(res1_s1rA, res1_s2rA) + + + Client->>Proxy: request 2 + Proxy->>B: request 2 + Proxy->>D: request 2 + + B-->>Proxy: response 2 (shard1 replica B) + D-->>Proxy: response 2 (shard2 replica B) + + Proxy-->>Client: merge(res2_s1rB, res2_s2rB) +``` + +## Notes about replication & consistency +seq-db doesn't have any mechanism to keep replicas consistent between each other. +That is, if a write operation succeeds on a replica of a shard and fails on another replica, the replicas +would be out of sync and won't be (automatically) synced. +The only given guarantee is that a write operation will succeed only having at least RF replicas saved on disk. +This optimization allows seq-db to have a higher than alternatives ingestion throughput +with the obvious price of the possible inconsistencies of retrieval and aggregation queries. +seq-db was designed as a database for logs/traces with this tradeoff in mind. \ No newline at end of file