Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
312 changes: 284 additions & 28 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,293 @@
# Architecture: @git-stunts/cas
# Architecture: `git-cas`

Content Addressable Store (CAS) for Git.
This document is the high-level map of the shipped `git-cas` system.

## 🧱 Core Concepts
It is intentionally not a full API reference. For command and method details,
see [docs/API.md](./docs/API.md). For crypto and security guidance, see
[SECURITY.md](./SECURITY.md).

### Domain Layer (`src/domain/`)
- **Value Objects**: `Manifest` and `Chunk` represent the structured metadata of an asset.
- **Services**: `CasService` implements streaming chunking, encryption (AES-256-GCM), and manifest generation.
## System Model

### Ports Layer (`src/ports/`)
- **GitPersistencePort**: Defines how blobs and trees are saved to Git.
- **CodecPort**: Defines how manifests are encoded (JSON, CBOR).
`git-cas` uses Git as the storage substrate, not as a user-facing abstraction.

### Infrastructure Layer (`src/infrastructure/`)
- **Adapters**: `GitPersistenceAdapter` implementation using `@git-stunts/plumbing`.
- **Codecs**: `JsonCodec` and `CborCodec`.
At a high level, the system does four things:

## 🚀 Scalability & Limits
1. turns input bytes into chunk blobs stored in Git
2. records how to rebuild those bytes in a manifest
3. emits a Git tree that keeps the manifest and chunk blobs reachable
4. optionally indexes trees by slug through a GC-safe vault ref

- **Chunk Size**: Configurable, default 256KB. Minimum 1KB.
- **Streaming**: Encryption and chunking are fully streamed. Memory usage is constant (O(1)) relative to file size.
- **Manifest Limit**: Currently, all chunk metadata is stored in a single flat `manifest` blob. For extremely large files (>100GB), the manifest itself may become unwieldy (linear growth). Future iterations may require a Merkle Tree structure for the manifest itself.
The same core supports:

## 📂 Directory Structure
- a library facade in [index.js](./index.js)
- a human CLI and TUI under `bin/`
- a machine-facing agent CLI under `bin/agent/`

```text
src/
├── domain/
│ ├── schemas/ # Zod and JSON schemas
│ ├── services/ # CasService
│ └── value-objects/ # Manifest, Chunk
├── infrastructure/
│ ├── adapters/ # GitPersistenceAdapter
│ └── codecs/ # JsonCodec, CborCodec
└── ports/ # GitPersistencePort, CodecPort
```
Those surfaces are different contracts over one shared core.

## Layer Model

### Facade

The public entrypoint is [index.js](./index.js).

`ContentAddressableStore` is a high-level facade that:

- lazily initializes the underlying services
- selects the appropriate crypto adapter for the current runtime
- resolves chunking strategy configuration
- wires persistence, ref, codec, crypto, chunking, and observability adapters
- exposes convenience methods like `storeFile()` and `restoreFile()`

The facade is orchestration glue. It is not the storage engine itself.

### Domain

The domain lives under `src/domain/`.

Current key domain pieces:

- `Manifest` and `Chunk`
- value objects that describe stored content and chunk metadata
- `CasService`
- the main content orchestration service
- handles store, restore, tree creation, manifest reads, inspection, and
recipient/key operations
- `KeyResolver`
- resolves key sources, passphrase-derived keys, and envelope recipient DEK
wrapping and unwrapping
- `VaultService`
- manages the GC-safe vault ref and its commit-backed slug index
- `rotateVaultPassphrase`
- coordinates vault-wide passphrase rotation across existing entries
- `CasError`
- the canonical domain error type with stable codes and metadata

Public API boundary:

- the package entry re-exports `Manifest`, `Chunk`, `CasService`, and
`VaultService`
- `KeyResolver`, `rotateVaultPassphrase`, and `CasError` are internal domain
implementation details, even though they are important architectural pieces

`CasService` is still the central orchestration unit for content flows. That is
current architecture truth, not a future-state claim.

### Ports

The ports live under `src/ports/`.

They define the seams the domain depends on:

- `GitPersistencePort`
- blob and tree read/write operations
- `GitRefPort`
- ref resolution, commit creation, and compare-and-swap ref updates
- `CodecPort`
- manifest encoding and decoding
- `CryptoPort`
- hashing, encryption, decryption, random bytes, and KDF operations
- `ChunkingPort`
- strategy interface for fixed-size and content-defined chunking
- `ObservabilityPort`
- metrics, logs, and spans without binding the domain to Node event APIs

### Infrastructure

The infrastructure layer lives under `src/infrastructure/`.

Current shipped adapters include:

- `GitPersistenceAdapter`
- `GitRefAdapter`
- `NodeCryptoAdapter`
- `BunCryptoAdapter`
- `WebCryptoAdapter`
- `JsonCodec`
- `CborCodec`
- `FixedChunker`
- `CdcChunker`
- `SilentObserver`
- `EventEmitterObserver`
- `StatsCollector`

There are also small adapter helpers such as:

- `createCryptoAdapter`
- runtime-adaptive crypto selection
- `resolveChunker`
- chunker construction from config
- `FileIOHelper`
- file-backed convenience helpers for the facade

## Storage Model

### Chunks

Stored content is broken into chunks and written as Git blobs.

The manifest records the authoritative ordered chunk list, including:

- chunk index
- chunk size
- SHA-256 digest
- backing blob OID

The manifest, not the tree layout, is the source of truth for reconstruction
order and repeated chunk occurrences.

### Manifests

Manifests are encoded through the configured codec:

- JSON by default
- CBOR when configured

Small and medium assets use a single manifest blob.

Large assets already use Merkle-style manifests. When chunk count exceeds
`merkleThreshold`, `createTree()` writes:

- a root manifest with `version: 2`
- an empty top-level `chunks` array
- `subManifests` references pointing at additional manifest blobs

`readManifest()` resolves those sub-manifests transparently and reconstructs the
flat logical chunk list for callers.

Merkle manifests are shipped behavior, not future work.

### Trees

`createTree()` emits a Git tree that keeps the asset reachable.

For non-Merkle assets the tree contains:

- `manifest.<ext>`
- one blob entry per unique chunk digest, in first-seen order

For Merkle assets the tree contains:

- `manifest.<ext>`
- `sub-manifest-<n>.<ext>` blobs
- one blob entry per unique chunk digest, in first-seen order

Chunk blobs are deduplicated at the tree-entry level by digest. The manifest
still remains authoritative for repeated-chunk order and multiplicity.

### Vault

The vault is a GC-safe slug index rooted at `refs/cas/vault`.

It is implemented as a commit chain. Each vault commit points to a tree
containing:

- one tree entry per stored slug, mapped to that asset's tree OID
- `.vault.json` metadata for vault configuration

`VaultService` owns:

- slug validation
- vault initialization
- add, update, list, resolve, remove, and history-oriented state reads
- compare-and-swap ref updates with retry on conflict
- vault metadata validation

Vault metadata can include passphrase-derived encryption configuration and
related counters, but the vault still fundamentally acts as the durable
slug-to-tree index for stored assets.

## Core Flows

### Store

The store path looks like this:

1. resolve key source or recipient envelope settings
2. optionally gzip the input stream
3. choose a chunking strategy
4. optionally encrypt the processed stream
5. write chunk blobs to Git
6. build a manifest
7. optionally emit a Git tree and add it to the vault

Important current behavior:

- encryption and recipient envelope setup are mutually exclusive
- CDC is supported, but encryption removes CDC dedupe benefits because
ciphertext is pseudorandom
- observability ports receive metrics and warnings throughout the flow

### Restore

The restore path:

1. reads a manifest from a tree or receives one directly
2. resolves decryption key material if needed
3. reads and verifies chunk blobs by SHA-256 digest
4. either streams plaintext chunks directly or buffers for decrypt/decompress
5. returns bytes or writes them to disk through the facade helper

For unencrypted and uncompressed assets, restore can operate as true chunk
streaming. Encrypted or compressed restores currently use a buffered path with
explicit size guards.

### Vault Mutation

Vault mutation is separate from the core chunk store.

`VaultService` updates `refs/cas/vault` through compare-and-swap semantics,
creating a new commit for each successful mutation and retrying on conflicts.

That keeps slug resolution durable across `git gc` while leaving the content
store itself in ordinary Git objects.

## Runtime Model

`git-cas` targets multiple JavaScript runtimes.

The core architecture is designed so the domain does not care whether it is
running on Node, Bun, or a Web Crypto-capable environment. Runtime differences
are isolated in the infrastructure adapters and selected by the facade or CLI
bootstrapping code.

The repo enforces this with a real Node, Bun, and Deno test matrix.

## Honest Pressure Points

The main architectural pressure point today is `CasService`.

It already benefits from some meaningful extractions:

- `KeyResolver`
- `VaultService`
- `rotateVaultPassphrase`
- chunker and crypto adapter factories
- file I/O helpers

But it still owns a broad content-orchestration surface:

- store and restore
- manifest and tree handling
- lifecycle inspection helpers
- recipient mutation and key rotation

That is good candidate pressure for future decomposition work, but it is not yet
a completed architectural split.

## Reading This With Other Docs

Use this document for the current system shape.

Use these docs for adjacent truth:

- [README.md](./README.md)
- positioning, feature overview, and release highlights
- [docs/API.md](./docs/API.md)
- library and CLI reference
- [SECURITY.md](./SECURITY.md)
- crypto and security guidance
- [docs/THREAT_MODEL.md](./docs/THREAT_MODEL.md)
- threat model, assets, and trust boundaries
- [WORKFLOW.md](./WORKFLOW.md)
- current planning and delivery model
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- **`git cas agent rotate`** — added a machine-facing rotation flow so Relay can rotate recipient keys by slug or detached tree OID and expose the resulting tree and vault side effects explicitly.
- **`git cas agent vault rotate`** — added a machine-facing vault passphrase rotation flow so Relay can rotate encrypted vault state with explicit commit, KDF, and rotated/skipped-entry results.
- **`git cas agent vault init|remove`** — added machine-facing vault lifecycle commands so Relay can initialize encrypted or plaintext vaults and remove entries without scraping human CLI output.
- **Threat model doc** — added [docs/THREAT_MODEL.md](./docs/THREAT_MODEL.md) as the canonical statement of attacker models, trust boundaries, exposed metadata, and explicit non-goals.
- **Workflow model** — added [WORKFLOW.md](./WORKFLOW.md), explicit legends/backlog/invariants directories, and a cycle-first planning model for fresh work.
- **Review automation baseline** — added `.github/CODEOWNERS` with repo-wide ownership for `@git-stunts`.
- **Release runbook** — added `docs/RELEASE.md` and linked it from `CONTRIBUTING.md` as the canonical patch-release workflow.
Expand All @@ -23,6 +24,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Changed

- **Planning lifecycle clarified** — live backlog items now exclude delivered work, archive directories now hold retired backlog history and reserved retired design space, landed cycle docs use explicit landed status, and the design/backlog indexes now reflect current truth instead of stale activity.
- **Architecture map repaired** — [ARCHITECTURE.md](./ARCHITECTURE.md) now describes the shipped system instead of an older flat-manifest-only model, including Merkle manifests, the extracted `VaultService` and `KeyResolver`, current ports/adapters, and the real storage layout for trees and the vault.
- **Architecture navigation clarified** — [ARCHITECTURE.md](./ARCHITECTURE.md) now distinguishes the public package boundary from internal domain helpers and links directly to [docs/THREAT_MODEL.md](./docs/THREAT_MODEL.md) as adjacent truth.
- **GitHub Actions runtime maintenance** — CI and release workflows now run on `actions/checkout@v6` and `actions/setup-node@v6`, clearing the Node 20 deprecation warnings from GitHub-hosted runners.
- **Ubuntu-based Docker test stages** — the local/CI Node, Bun, and Deno test images now build on `ubuntu:24.04`, copying runtime binaries from the official upstream images instead of inheriting Debian-based runtime images directly, and the final test commands now run as an unprivileged `gitstunts` user.
- **Test conventions expanded** — `test/CONVENTIONS.md` now documents Git tree filename ordering, Docker-only integration policy, pinned integration `fileParallelism: false`, and direct-argv subprocess helpers.
Expand Down
Loading
Loading