diff --git a/doc/plans/canonical-tar.md b/doc/plans/canonical-tar.md new file mode 100644 index 00000000..1249077e --- /dev/null +++ b/doc/plans/canonical-tar.md @@ -0,0 +1,186 @@ +# Canonical Tar Format + +This document defines a canonical, reproducible tar serialization for composefs filesystem trees. This is a prerequisite for pushing images after an [incremental pull](incremental-pulls.md) and complements the [standardized EROFS metadata](standardized-erofs-meta.md) work. + +## Motivation + +In the [incremental pull](incremental-pulls.md) model, a composefs-aware client fetches only the content objects it doesn't already have, using the EROFS metadata as a table of contents. The client does not download or store the original tar layer bytes. To push this image to another registry, or to verify the OCI `diff_id` if needed, the client must be able to regenerate a byte-identical tar stream from the EROFS metadata and local object store. + +Without a canonical tar format, the regenerated tar will almost certainly differ from the original (different header encoding, different entry ordering, different padding), producing different digests. + +## Conceptual Model + +The canonical tar format is defined as a mapping from composefs dumpfile to tar. The dumpfile is a human-readable textual format that represents a complete filesystem tree and can be converted to/from EROFS. By defining dumpfile-to-tar, we complete a triangle of deterministic conversions: + +``` +dumpfile ──→ canonical tar + ↑ │ + │ ↓ + └── EROFS ←────┘ + (via standardized-erofs-meta.md) +``` + +A client that has an EROFS can convert to dumpfile, then to canonical tar. A builder that has a tar can convert to dumpfile, then to EROFS. + +## Specification + +### Header Format: pax (POSIX.1-2001) + +The canonical format uses pax extended headers exclusively. pax supports long filenames, large file sizes, nanosecond timestamps, arbitrary xattrs, and large uid/gid values without the ambiguities of GNU extensions. + +Each entry consists of: +1. *(If pax records are needed)* A pax extended header entry (type `x`) followed by its data blocks +2. The ustar header entry followed by any content data blocks + +The pax extended header entry's name is `PaxHeaders.0/` where `` is the entry's filename component (truncated to 100 bytes if necessary). + +### Global Header + +The archive begins with a single pax global extended header (typeflag `g`) containing one record: + +``` +composefs.canonical-tar=1 +``` + +This allows any client to detect canonical tar format by reading the first entry. Non-composefs tools will ignore the unknown key. No other global extended headers are permitted in the archive. + +### Entry Ordering + +Entries appear in depth-first pre-order with children sorted by filename using byte-wise comparison. This matches the ordering produced by iterating a `BTreeMap`, which is the in-memory representation used by composefs. + +Example: +``` +./ +./a/ +./a/x +./a/y +./b/ +./b/z +./c +``` + +The root directory entry comes first. Directories are emitted before their children. + +### Path Encoding + +All paths are relative to the archive root, prefixed with `./`. Directories have a trailing `/`. For example, the dumpfile path `/usr/bin/sh` becomes `./usr/bin/sh` in the tar stream; the dumpfile path `/usr/lib/` becomes `./usr/lib/`. + +Paths that fit within 100 bytes are stored entirely in the ustar `name` field. Paths longer than 100 bytes use a pax `path` record; the ustar `name` field is filled with a truncated form and the ustar `prefix` field is left empty. The ustar prefix/name split is never used, as different implementations split at different `/` boundaries, making it a source of non-reproducibility. + +### Ustar Header Fields + +All header fields use the ustar format (magic `ustar\0`, version `00`). + +| Field | Size | Encoding | Notes | +|-------|------|----------|-------| +| name | 100 | Bytes, null-terminated | See path encoding above | +| mode | 8 | Octal, zero-padded, null-terminated | Permission bits only (no file-type bits). E.g. `0000755\0` | +| uid | 8 | Octal, zero-padded, null-terminated | Values > 2,097,151 overflow to pax | +| gid | 8 | Octal, zero-padded, null-terminated | Values > 2,097,151 overflow to pax | +| size | 12 | Octal, zero-padded, null-terminated | File content size. 0 for directories, symlinks, devices, fifos. Values > 8 GiB overflow to pax | +| mtime | 12 | Octal, zero-padded, null-terminated | Seconds since epoch. Values > 8,589,934,591 overflow to pax | +| chksum | 8 | Octal, zero-padded, null-terminated + space | Unsigned sum of all header bytes with chksum field treated as spaces | +| typeflag | 1 | ASCII | See entry types below | +| linkname | 100 | Bytes, null-terminated | Symlink/hardlink target; longer targets use pax `linkpath` | +| magic | 6 | `ustar\0` | | +| version | 2 | `00` | | +| uname | 32 | Empty (null-filled) | Not stored in EROFS; omitted | +| gname | 32 | Empty (null-filled) | Not stored in EROFS; omitted | +| devmajor | 8 | Octal, zero-padded, null-terminated | For block/char devices only; 0 otherwise | +| devminor | 8 | Octal, zero-padded, null-terminated | For block/char devices only; 0 otherwise | +| prefix | 155 | Empty (null-filled) | Never used; long paths use pax `path` instead | + +Unused header bytes are zero-filled. + +### Entry Types + +| Dumpfile entry | typeflag | Notes | +|----------------|----------|-------| +| Regular file | `0` | Content follows header | +| Directory | `5` | Size 0, path has trailing `/` | +| Symlink | `2` | Target in linkname (or pax `linkpath`) | +| Hardlink | `1` | Target in linkname as relative `./`-prefixed path | +| Block device | `4` | devmajor/devminor set | +| Char device | `3` | devmajor/devminor set | +| FIFO | `6` | | + +### Pax Extended Headers + +Pax records are used only when a value overflows the ustar header capacity. The canonical format does not unconditionally emit pax headers for values that fit in ustar fields. + +Pax records are emitted in the following order when present: + +1. `path` (if name exceeds ustar prefix/name capacity) +2. `linkpath` (if linkname exceeds 100 bytes) +3. `size` (if > 8 GiB) +4. `uid` (if > 2,097,151) +5. `gid` (if > 2,097,151) +6. `mtime` (if > 8,589,934,591, or if sub-second precision is needed) +7. `SCHILY.xattr.*` records, sorted by full key name (byte-wise) + +Each pax record is formatted as ` =\n` per POSIX.1-2001. The length field is the total byte count of the record including itself. + +#### Xattr Encoding + +Extended attributes are encoded as `SCHILY.xattr.` pax records. Values are binary-safe (the pax record length field handles arbitrary bytes). Xattr records are sorted by the full key string (`SCHILY.xattr.security.selinux` before `SCHILY.xattr.user.foo`), using byte-wise comparison. + +The following xattrs are NOT included in the canonical tar, as they are composefs implementation details: +- `trusted.overlay.metacopy` +- `trusted.overlay.redirect` +- `user.overlay.metacopy` +- `user.overlay.redirect` + +#### Timestamp Precision + +If the dumpfile timestamp has a non-zero nanosecond component, the `mtime` pax record is emitted as `.` (nanoseconds without trailing zeros). If the timestamp is integer seconds and fits in the ustar mtime field, no pax record is emitted. + +### Content and Padding + +File content is the raw bytes from the object store (for external files, identified by fsverity digest) or the inline bytes (for files ≤ 64 bytes). + +Content is followed by zero-padding to the next 512-byte block boundary. The padding bytes are all zero. + +### End of Archive + +The archive ends with two consecutive 512-byte blocks of zeros, per POSIX. + +### Hardlink Handling + +When the dumpfile contains hardlinks (multiple paths sharing the same leaf ID), the first path encountered in depth-first sorted order is emitted as a regular entry with full content. Subsequent paths referencing the same leaf are emitted as hardlink entries (typeflag `1`) with the first path as the linkname target. + +The hardlink target path uses the same `./`-prefixed encoding as all other paths. + +### Whiteout Representation + +For per-layer (non-merged) tars, OCI whiteouts are represented as standard whiteout entries: + +- **File deletion**: a zero-length regular file named `.wh.` in the parent directory +- **Opaque directory**: a zero-length regular file named `.wh..wh..opq` in the directory + +Whiteout entries appear in sorted order alongside regular entries. Their mode is `0000644`, uid/gid are 0, mtime is 0. + +For merged/flattened tars, whiteouts do not appear (they have already been processed). + +## Compression + +This specification defines the uncompressed tar byte stream only. Compression (gzip, zstd, composefs-chunked framing) is a separate concern. The composefs-chunked format described in [incremental-pulls.md](incremental-pulls.md) applies zstd frame boundaries on top of this canonical ordering without changing the entry order or content. + +## Implementation Notes + +The [tar-core](https://github.com/composefs/tar-core) crate provides the building blocks for producing canonical tar output. It supports both pax and GNU extension modes, deterministic numeric encoding, and pax record construction. The canonical tar generator would use tar-core's `EntryBuilder` in pax mode (`ExtensionMode::Pax`), calling `build_pax_data()` to emit extended headers only when ustar fields overflow. + +tar-core does not impose entry ordering; the caller (composefs) controls the order by walking the dumpfile/EROFS tree in sorted depth-first order. + +## Relationship to Other Specs + +The dumpfile is the canonical filesystem representation that bridges tar and EROFS. This spec defines dumpfile to tar; [standardized-erofs-meta.md](standardized-erofs-meta.md) defines dumpfile to EROFS. Together they enable round-trip conversion. + +The OCI layer format (`application/vnd.oci.image.layer.v1.tar`) requires a standards-compliant tar stream. A canonical tar produced by this specification is a valid OCI layer. The `diff_id` is the SHA-256 of the uncompressed canonical tar stream. + +## References + +- [Incremental pulls](incremental-pulls.md): the primary consumer of canonical tar +- [Standardized EROFS metadata](standardized-erofs-meta.md): the other direction of the round-trip +- [tar-core](https://github.com/composefs/tar-core): sans-IO tar library used by composefs +- [OCI image layer spec](https://github.com/opencontainers/image-spec/blob/main/layer.md): OCI tar layer requirements +- [POSIX.1-2001 pax format](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html): pax extended header specification diff --git a/doc/plans/incremental-pulls.md b/doc/plans/incremental-pulls.md new file mode 100644 index 00000000..34fdf6fb --- /dev/null +++ b/doc/plans/incremental-pulls.md @@ -0,0 +1,136 @@ +# Incremental Pulls via EROFS-alongside + +Status: Provisional + +There's two large things missing from OCI: + +- dm-verity like integrity +- standard incremental fetching and deltas + +The composefs artifact model fixes the first. This proposal builds on top of the composefs artifact, giving a model for incremental fetches. + +## Core proposal + +Existing approaches to incremental container image pulls (zstd:chunked and eStargz) embed a JSON table of contents (TOC) inside the compressed layer blob. The client reads the TOC, determines which file chunks it already has locally, and fetches missing chunks via HTTP range requests. + +The two formats handle diff_id verification differently. zstd:chunked also embeds tar-split reconstruction data in the blob, allowing the client to reassemble the exact original uncompressed tar stream and verify its SHA-256 digest against the OCI `diff_id`. eStargz does *not* include tar-split, which means it cannot verify the diff_id at all; clients must set `insecure_allow_unpredictable_image_contents` to use it. This is a significant practical limitation of eStargz. + +Composefs changes this picture fundamentally. In erofs-alongside mode, the EROFS metadata image (shipped as a separate OCI artifact) already contains the complete filesystem tree with fsverity digests for every content object. Composefs-based clients know the objects they already have, and using the metadata EROFS can compute which ones they are missing. + +All that is needed then is a mapping between the fsverity digests and the location in the tar stream. + +When the EROFS is trusted (via kernel fsverity signature or the OCI manifest signature chain covering the composefs digest), the `diff_id` verification becomes redundant: the composefs digest already cryptographically covers the complete filesystem tree. This eliminates the need for tar-split metadata entirely and simplifies the pull, verification, and push paths. + +### Comparison with existing approaches + +| Aspect | zstd:chunked | eStargz | composefs incremental | +|--------|-------------|---------|----------------------| +| TOC format | JSON in zstd skippable frame | JSON in gzip member | EROFS metadata (separate OCI artifact) | +| TOC reuse | Discarded after pull | Discarded after pull | Mounted directly by the kernel | +| Tar-split | Embedded in blob | Not available | Not needed | +| diff_id verification | Yes (via tar-split) | No (`insecure_allow_unpredictable_image_contents`) | Redundant (composefs digest covers the tree) | +| Content digests | SHA-256 | SHA-256 | fsverity (SHA-256 or SHA-512 Merkle tree) | +| Dedup granularity | Sub-file chunks (~64 KiB, rolling checksum) | Per-file | Whole files (by fsverity digest) | +| Kernel integration | None (userspace only) | None (userspace only) | EROFS + overlayfs + fsverity | +| Push after incremental pull | Reconstruct via tar-split | Cannot reconstruct original tar | Canonical tar generation (see below) | + +## Design + +### Layer Format: composefs-chunked + +A composefs-chunked layer is a valid `tar+zstd` blob that any OCI client can pull and decompress normally. The difference is in how the zstd compression is structured internally: large files are compressed as independent zstd frames, making them individually addressable via byte offset. + +Tar entries are in **canonical order**, the same deterministic ordering defined by the [canonical tar format](canonical-tar.md). This is essential: a client that does an incremental pull must be able to regenerate byte-identical tar for push, so the entry ordering cannot be compression-driven. + +The zstd frame boundaries are an overlay on top of the canonical ordering. For files above a size threshold (e.g. 4 KiB), the compressor closes and restarts the zstd frame around the file's payload, making it independently decompressible. Files below the threshold are simply compressed together with their neighbors in whatever order they naturally appear. The threshold aligns with the filesystem block size. + +Files ≤ 64 bytes are already inline in the EROFS metadata (`INLINE_CONTENT_MAX`) and are never fetched from the tar layer during an incremental pull, regardless of framing. + +Unlike zstd:chunked, there are no trailing skippable frames (no embedded JSON TOC, no tar-split data). The EROFS in the composefs artifact serves as the TOC. + +Unlike zstd:chunked, there is no sub-file content-defined chunking. Composefs deduplicates at the whole-file level (by fsverity digest), so rolling-checksum chunk boundaries provide no dedup benefit. This simplifies the format and the offset map. + +### Offset Map + +The offset map tells the client where each individually-framed file lives within the compressed layer blob. It is stored as an additional layer in the composefs OCI artifact, with media type `application/vnd.composefs.v1.offset-map`. + +For each individually-compressed file, the map contains: + +``` +{ fsverity_digest, layer_index, byte_offset, compressed_size } +``` + +- `fsverity_digest`: the fsverity digest of the file content (matches the EROFS inode's content reference) +- `layer_index`: position in the image manifest's `layers` array (0-indexed) +- `byte_offset`: byte offset of the payload zstd frame within the compressed blob +- `compressed_size`: size of the compressed zstd frame in bytes + +Only files above the individually-framed threshold have entries in the offset map. Files below the threshold that a client needs must be fetched by downloading the surrounding range or falling back to a full layer fetch (acceptable since these files are small by definition). + +The format should be compact. A sorted array of fixed-size records (digest + u32 layer index + u64 offset + u64 size) works well and enables binary search by digest. For a layer with 10,000 individually-framed files using SHA-512 fsverity digests, the offset map is roughly 10,000 × (64 + 4 + 8 + 8) = ~820 KiB uncompressed, which compresses well. + +### Pull Protocol + +**Full pull (non-composefs client).** The layer is a valid tar+zstd blob. Pull, decompress, extract. Standard OCI behavior, no awareness of composefs needed. + +**Incremental pull (composefs-aware client):** + +1. Fetch the composefs artifact (EROFS layers + offset map + optional signatures) +2. Walk the EROFS metadata to extract the set of fsverity digests for all non-inline content objects +3. Query the local object store: which of these digests do we already have? +4. For missing digests, look up byte ranges in the offset map +5. Merge adjacent/nearby ranges to reduce HTTP requests (same optimization as zstd:chunked) +6. Issue HTTP range requests against the layer blob(s) to fetch missing objects +7. Decompress each frame independently, write to the object store, enable fsverity +8. Verify each object: the computed fsverity digest must match what the EROFS references + +No tar reassembly, no diff_id verification, no tar-split. Trust is rooted in the EROFS (signed or digest-verified via the manifest chain), and each content object is independently verified by its fsverity digest. + +### Push After Incremental Pull + +An incrementally-pulled image does not have the original tar layer bytes stored locally. To push the image to another registry, the client must regenerate the tar layer. For the pushed image to be identical to the original (same layer digests, same manifest), this regeneration must be deterministic. + +This requires a **canonical tar format**: a well-defined, reproducible mapping from filesystem metadata (EROFS or dumpfile) + content objects to a tar byte stream. See [canonical-tar.md](canonical-tar.md) for this specification. + +With a canonical tar: +- The original image builder produces the tar using the canonical format +- An incrementally-pulling client can regenerate byte-identical tar from EROFS + object store +- The pushed image has the same layer digests and diff_id as the original +- The canonical tar can also be used to lazily verify the diff_id if needed, without storing tar-split + +### Composefs Artifact Integration + +The offset map is an additional layer in the existing composefs OCI artifact. In erofs-alongside mode with incremental pull support, the artifact layers are ordered: + +1. N EROFS metadata layers (one per image layer, `application/vnd.composefs.v1.erofs+zstd` or `application/vnd.composefs.v1.erofs`) +2. N offset map layers (one per image layer, `application/vnd.composefs.v1.offset-map`) +3. *(Optional)* Signature layers (`application/vnd.composefs.signature.v1+pkcs7`) + +Each offset map layer carries a `composefs.offset-map.type: "layer"` annotation and a `composefs.offset-map.layer-index` annotation identifying which manifest layer it corresponds to. + +Layers that are not composefs-chunked (e.g. standard tar+gzip layers in a mixed image) simply have no offset map entry. A missing offset map for a layer means the client must fall back to a full fetch for that layer. + +## Security Considerations + +**Trust model.** The EROFS is the root of trust for the filesystem tree. Each content object fetched via range request is verified independently by computing its fsverity digest and comparing it to the EROFS reference. An attacker who controls the registry cannot serve incorrect content without detection, since the fsverity digest is a Merkle tree hash that the kernel enforces on every read after `FS_IOC_ENABLE_VERITY`. + +**No tar-split, no diff_id.** By not verifying the diff_id, we are explicitly trusting the composefs digest chain rather than the OCI config's `rootfs.diff_ids`. This is a stronger verification (fsverity Merkle tree of the complete filesystem vs. flat SHA-256 of an opaque tar stream) but it does mean that a composefs-aware client and a non-composefs client may disagree if the tar and EROFS are inconsistent. The erofs-alongside consistency check (tar vs EROFS semantic comparison) at image seal time prevents this. + +**Offset map integrity.** The offset map is part of the composefs artifact, which is covered by the artifact's manifest digest and optionally by signatures. A tampered offset map could point to wrong byte ranges, but the client verifies each fetched object's fsverity digest, so tampered offsets result in verification failure, not incorrect data. + +## Future Directions + +**Registry-level compression.** The [OCI distribution-spec proposal for registry-level compression](https://github.com/opencontainers/distribution-spec/issues/235) would allow registries to handle compression/decompression, serving uncompressed byte ranges from compressed blobs. This would eliminate the need for independent zstd framing entirely; the client could request raw byte ranges of uncompressed file content. The offset map would then contain offsets into the *uncompressed* tar stream, which are easier to compute (they fall out of tar generation directly). + +**Sub-file chunking.** The current design operates at whole-file granularity. For images with very large files that change incrementally between versions (e.g. RPM databases, locale archives), sub-file content-defined chunking could reduce transfer sizes. The offset map format is extensible to support multiple entries per file. This is deferred as a non-goal for the initial design. + +**Cross-layer dedup.** The composefs object store already deduplicates across layers (objects are stored by fsverity digest). The incremental pull protocol naturally benefits from this: if layer A and layer B share a file, pulling layer A populates the object store, and layer B's pull skips that file. No additional mechanism is needed. + +## References + +- [OCI sealing specification](oci-sealing-spec.md): erofs-alongside mode and composefs artifacts +- [Canonical tar format](canonical-tar.md): reproducible tar generation for push after incremental pull +- [Standardized EROFS metadata](standardized-erofs-meta.md): canonical EROFS generation (separate concern) +- [composefs/composefs#294](https://github.com/composefs/composefs/issues/294): original design discussion +- [zstd:chunked implementation](https://github.com/containers/storage/tree/main/pkg/chunked): reference for partial pull mechanics +- [OCI distribution-spec #235](https://github.com/opencontainers/distribution-spec/issues/235): registry-level compression proposal diff --git a/doc/plans/oci-sealing-spec.md b/doc/plans/oci-sealing-spec.md index 98d000bf..11b03684 100644 --- a/doc/plans/oci-sealing-spec.md +++ b/doc/plans/oci-sealing-spec.md @@ -8,164 +8,547 @@ Container images need cryptographic verification that efficiently covers the ent Hence verifying the integrity of an individual file would require re-synthesizing the entire tarball (using tar-split or equivalent) and computing its digest. +## Related projects + +- **[containerd EROFS snapshotter](https://github.com/containerd/containerd/blob/main/docs/snapshotters/erofs.md)**: Converts OCI layers to EROFS blobs with optional fsverity protection. Supports `enable_fsverity = true` to enable fs-verity on layer blobs. Uses reproducible builds with erofs-utils 1.8+ (`-T0 --mkfs-time`). dm-verity integration is planned but not yet implemented. + ## Solution The core primitive of composefs is fsverity, which allows incremental online verification of individual files. The complete filesystem tree metadata is itself stored as a file which can be verified in the same way. The critical design question is how to embed the composefs digest within OCI image metadata such that external signatures can efficiently cover the entire filesystem tree. -## Design Goals +## Core Design -The OCI sealing specification aims to provide efficient verification where a signature on an OCI manifest cryptographically covers the entire filesystem tree without re-hashing content. The specification defines standardized metadata locations for composefs digests and supports future format evolution without breaking existing images. +"composefs digest" here means the fsverity digest of the EROFS metadata file. fsverity is configurable based on digest algorithm (SHA-256 or SHA-512 currently) and block size (4k or 64k). -Incremental verification must be supported, enabling verification of individual layers or the complete flattened filesystem. The design accommodates both registry-provided sealed images and client-side sealing workflows while maintaining backward compatibility with existing OCI tooling and registries. +For standardized short form of the combination, a string of the form `fsverity-${DIGEST}-${BLOCKSIZEBITS}` is used. The `fsverity-` prefix makes clear this is an fsverity Merkle tree digest, not a simple hash: -## Core Design +- `fsverity-sha256-12` (SHA-256, 4k block size, 2^12) +- `fsverity-sha512-12` (SHA-512, 4k block size) +- `fsverity-sha256-16` (SHA-256, 64k block size, 2^16) +- `fsverity-sha512-16` (SHA-512, 64k block size) -### Composefs Digest Storage +Digests are encoded as lowercase hexadecimal. -The composefs fsverity digest is stored as a label in the OCI image config: +### EROFS Provisioning Modes -```json -{ - "config": { - "Labels": { - "containers.composefs.fsverity": "sha256:a3b2c1d4e5f6..." - } - } -} +There are two modes for how the EROFS metadata image is obtained by a client. erofs-alongside is the primary mode and the focus of this specification. canonical-EROFS is a future evolution that builds on it. + +#### EROFS-alongside mode (primary) + +In this mode, the EROFS metadata image is built server-side as part of a composefs OCI artifact which is also stored on the registry. It's important to emphasize that this process can happen independent of the image build; it operates similarly to a signature. Clients unaware of composefs work as before. + +This is the primary mode because: + +- It works today without cross-implementation EROFS standardization — the exact EROFS bytes are authored by the image publisher, so there is no need for multiple implementations to agree on a bit-for-bit identical layout. +- EROFS is a natural metadata format for incremental pulls and content-addressed object stores (see [Incremental Pulls](#incremental-pulls-via-erofs-alongside) in Future Directions). Any incremental fetch mechanism needs a separate metadata format, and EROFS — natively supported by the Linux kernel with multiple userspace parsers — is a strong fit. +- The EROFS here is just metadata; the tar layer is still required for content. + +EROFS metadata layers in the artifact may optionally be compressed with zstd for wire transfer. See [EROFS Layer Compression](#erofs-layer-compression) below for details. See [Composefs Artifact Structure](#composefs-artifact-structure) below for more information about the layout. + +To prevent the "representational ambiguity" problem — what happens when the tar layer and the prebuilt EROFS disagree — the client MUST verify consistency: + +1. Fetch the composefs artifact and verify that it has a 1-to-1 correspondence with the source image manifest: each layer in the manifest must have exactly one matching EROFS metadata entry in the artifact (identified by position). A mismatch in count is a fatal error. +2. For each layer, verify the metadata correspondence between the tar layer and the EROFS: + - Parse the tarball to extract a filesystem tree representation (file paths, modes, ownership, xattrs, and fsverity content digests) + - Walk the corresponding EROFS metadata to extract the same representation + - Compare the two — they must agree on all filesystem metadata and content references. Any disagreement is a fatal error. + +This consistency check operates at the semantic filesystem level, not at the EROFS byte level. It does not require a canonical EROFS specification, but it does require agreement on how tar entries map to filesystem metadata (see [doc/oci.md](../oci.md) for OCI-to-composefs conversion decisions). + +**Security consideration: parsing untrusted EROFS.** In this mode, the EROFS image is data fetched from a registry. When fsverity signatures are present, the EROFS signature is verified before mount — trust in the EROFS is trust in the publisher, the same as any signed artifact. However, the userspace consistency check (step 2 above) still parses the EROFS before signature verification, and in the unsigned/digest-only case, the EROFS is entirely attacker-controlled at parse time. This is an attack surface distinction from canonical-EROFS mode, where the EROFS is locally generated from trusted inputs. + +To mitigate this, EROFS parsing code — both userspace and in-kernel — should be written in memory-safe languages or otherwise hardened. The composefs-rs userspace parser is written in Rust. The Linux kernel's EROFS implementation is fuzz-tested via syzbot and has been hardened over multiple release cycles. Implementations SHOULD validate EROFS structural integrity (superblock magic, bounds checks, inode consistency) before performing the semantic consistency check or mounting. + +#### Canonical-EROFS mode (future) + +This mode is not yet usable — it is blocked on the EROFS standardization work described in [standardized-erofs-meta.md](standardized-erofs-meta.md). + +In this mode, no EROFS metadata is shipped on the wire. The client and server generate the EROFS using a standardized canonical process: + +``` +tar layer → dumpfile → EROFS metadata ``` -The config represents the container's identity rather than transport metadata. Manifests are transport artifacts that can vary across different distribution mechanisms. Adding the composefs label creates a new config and thus a new manifest, establishing the sealed image as a distinct artifact. This means sealing an image produces a new image with a different config digest, where the original unsealed image and sealed image coexist as separate artifacts that registries treat as distinct versions. +This requires a finalized canonical EROFS specification that guarantees byte-for-byte identical output across implementations given identical input. Without this guarantee, fsverity digests computed by different implementations would not match, and signatures would fail to verify. + +In this mode, the composefs digest annotations on the image manifest (or in the composefs artifact) serve as the sole reference. The client generates the EROFS, computes its fsverity digest, and verifies it matches the annotation. No EROFS bytes need to be stored on the registry. + +Canonical-EROFS is best understood as a future tightening of erofs-alongside: once a canonical EROFS specification is defined, erofs-alongside artifacts could be required to use the canonical layout. This would allow clients to verify the EROFS against the tar layer by regenerating it locally, without needing to parse the shipped EROFS at all. In effect, the shipped EROFS would become a cache of a deterministic computation. -### Digest Type +#### Digest-only mode (future, requires canonical-EROFS) -The primary digest is the fs-verity digest of the EROFS image containing the merged, flattened filesystem. This digest provides fast verification at mount time through kernel fs-verity checks and is deterministic: the same input layers always produce the same EROFS digest. The digest covers the complete filesystem tree including all metadata such as permissions, timestamps, and extended attributes. +Once canonical-EROFS is available, a further simplification becomes possible: **no composefs artifact at all**. The composefs digest is placed directly on the image manifest layer annotations (see [Composefs Digest Storage](#composefs-digest-storage)), and the client generates the canonical EROFS locally, verifying its fsverity digest against the annotation. -### Merged Filesystem Representation +This is the cleanest end state — the OCI image carries only standard tar layers with a composefs digest annotation, and composefs is purely a client-side optimization. No separate artifact, no EROFS on the wire, no signatures beyond whatever already covers the manifest (cosign, sigstore, etc.). -The config label contains the digest of the merged, flattened filesystem. This represents the final filesystem state after extracting all layers in order, applying whiteouts (`.wh.` files), merging directories where the most-derived layer wins for metadata, and building the final composefs EROFS image. +This mode is a natural consequence of canonical-EROFS and does not require additional specification beyond what is already defined for manifest annotations and canonical EROFS generation. + +### Recommended default algorithm + +The suggested default is `fsverity-sha512-12` - this maximizes compatibility as +not every system can support higher page sizes, and also maximizes security (there are +post-quantum crypto arguments against SHA-256). + +### Composefs Digest Storage -### Per-Layer Digests (Future Extension) +Composefs digests — the fsverity digests of EROFS metadata images — can be stored as annotations. This is most relevant in canonical-EROFS mode, where the digest is the primary mechanism for verifying a locally-generated EROFS. In erofs-alongside mode, the EROFS metadata itself is shipped in the composefs artifact and the digest can be computed from it directly, so annotations serve mainly as a convenience for discovery. -Per-layer composefs digests may be added as manifest annotations: +Digests can appear in two locations: + +1. **Composefs artifact** (primary): As annotations on the composefs artifact layers. This is the recommended approach because it allows signing existing unmodified OCI images — the original manifest is never touched. + +2. **Manifest annotations** (optional): As annotations on the image manifest layers. This is a convenience for tools that want to verify composefs digests without fetching a separate artifact. When both are present, they MUST agree. + +When using manifest annotations, in [the manifest](https://github.com/opencontainers/image-spec/blob/main/manifest.md), +each layer may have an annotation with a composefs digest. + +```json +{ + "layers": [ + { + "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip", + "digest": "sha256:9834876dcfb05cb167a5c24953eba58c4ac89b1adf57f28f2f9d09af107ee8f0", + "size": 32654, + "annotations": { + "composefs.layer.fsverity-sha512-12": "3abb6677af34ac57c0ca5828fd94f9d886c26ce59a8ce60ecf6778079423dccff1d6f19cb655805d56098e6d38a1a710dee59523eed7511e5a9e4b8ccb3a4686" + } + }, + { + "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip", + "digest": "sha256:3c3a4604a545cdc127456d94e421cd355bca5b528f4a9c1905b15da2eb4a4c6b", + "size": 16724, + "annotations": { + "composefs.layer.fsverity-sha512-12": "63e22ec2fbeebabf005e58fbfb0eee607c4aa417045a68a0cc63767b048e3559268d35e72f367d3b2dbd5dbddf12fc4397762ba149260b3795a0391713bddcd7" + } + }, + { + "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip", + "digest": "sha256:ec4b8955958665577945c89419d1af06b5f7636b4ac3da7f12184802ad867736", + "size": 73109, + "annotations": { + "composefs.layer.fsverity-sha512-12": "2b59d179d9815994f687383a886ea34109889756efca5ab27318cc67ce2a21261d12fa6fee6b8c716f72214ead55ee0d789d6c35cff977d40ef5728ba9188a80" + } + } + ] +} +``` + +Additionally, an optional merged digest may be provided on the **final layer only**, representing the *flattened* merged filesystem tree of the complete stack of all layers. The rationale is that it makes it easier for a runtime to avoid the overhead of individual mounts if it chooses to do so. This is especially suitable for e.g. a "base image" whose stack of mounts would commonly be shared with higher level applications. ```json { - "manifests": [ - { - "layers": [ - { - "digest": "sha256:...", - "annotations": { - "containers.composefs.layer.fsverity": "sha256:..." - } - } - ] + "layers": [ + { + "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip", + "digest": "sha256:9834876dcfb05cb167a5c24953eba58c4ac89b1adf57f28f2f9d09af107ee8f0", + "size": 32654, + "annotations": { + "composefs.layer.fsverity-sha512-12": "3abb6677af34ac57c0ca5828fd94f9d886c26ce59a8ce60ecf6778079423dccff1d6f19cb655805d56098e6d38a1a710dee59523eed7511e5a9e4b8ccb3a4686" + } + }, + { + "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip", + "digest": "sha256:3c3a4604a545cdc127456d94e421cd355bca5b528f4a9c1905b15da2eb4a4c6b", + "size": 16724, + "annotations": { + "composefs.layer.fsverity-sha512-12": "63e22ec2fbeebabf005e58fbfb0eee607c4aa417045a68a0cc63767b048e3559268d35e72f367d3b2dbd5dbddf12fc4397762ba149260b3795a0391713bddcd7", + "composefs.merged.fsverity-sha512-12": "d015f70f8bee6cf6453dd5b771eec18994b861c646cec18e2a9dfdec93f631fbb9030e60cfc82b552d33b9a134312a876ef4e519bffe3ef872aefbd84e6198b3" + } } ] } ``` -Per-layer digests enable incremental verification during pull, create caching opportunities where shared layers have known composefs digests, and enable runtime choice between flattened versus layered mounting strategies. +Note: The `composefs.merged.fsverity-sha512-12` annotation appears only on the final layer and represents the complete flattened filesystem of all layers merged together. + +#### Whiteout Handling in Merged Filesystem + +The merged EROFS represents a fully flattened filesystem and is designed to be mounted directly, not stacked with other EROFS layers via overlayfs. During the merge process, OCI whiteouts (`.wh.*` files and opaque directory markers) are fully processed: files and directories marked for deletion in upper layers are removed from the merged result. The final merged EROFS contains no whiteout entries — it is a clean, whiteout-free snapshot of the complete filesystem tree as it would appear after all layers are applied. + +### EROFS Layer Compression + +EROFS metadata layers in the composefs artifact MAY be compressed with zstd for wire transfer. Compression is indicated by the media type suffix, following the same convention as OCI tar layers (`tar+gzip`, `tar+zstd`): + +- `application/vnd.composefs.v1.erofs` — uncompressed EROFS metadata (the blob is a raw EROFS image) +- `application/vnd.composefs.v1.erofs+zstd` — zstd-compressed EROFS metadata (the blob is a zstd-compressed EROFS image) + +The compressed blob is a single zstd frame containing the complete EROFS image. Consumers decompress the blob before storing locally. The decompressed result is identical to what would have been shipped with the uncompressed media type. + +Producers SHOULD use `application/vnd.composefs.v1.erofs+zstd` when pushing EROFS metadata layers. Consumers MUST accept both compressed and uncompressed variants. + +All layers within a single composefs artifact SHOULD use the same compression (either all compressed or all uncompressed), but consumers MUST handle mixed compression within an artifact by inspecting each layer's media type individually. + +The `composefs.digest` annotation on each layer always refers to the fsverity digest of the *uncompressed* EROFS image, regardless of whether the layer is compressed on the wire. This ensures digest stability: the same EROFS image produces the same composefs digest whether pushed compressed or uncompressed. This is analogous to how OCI tar layers have both a `digest` (of the compressed blob, used for registry fetching) and a `diff_id` (of the uncompressed tar, used for identity) — except that `composefs.digest` is a fsverity Merkle tree digest rather than a flat hash, giving it the additional property of enabling continuous kernel-enforced verification. + +#### Motivation + +EROFS metadata compresses well with zstd, achieving a consistent 2.7-3.6:1 ratio across tested container images. For full image pulls the savings are modest (EROFS metadata is typically 0.4-1.3% of content size), but for incremental pull scenarios the compressed size becomes significant. When a client already has most content objects locally and only needs to fetch a small delta, the EROFS metadata — which describes the *complete* filesystem tree — can exceed the content delta in size. Compression reduces this cost substantially: for example, a 55 MiB merged EROFS for a desktop image compresses to ~15 MiB. + +See the [incremental pulls design](incremental-pulls.md) for the full context on why EROFS transfer size matters for update scenarios. + +### Signatures + +#### Linux kernel fsverity signatures (recommended) + +The primary signature mechanism is Linux kernel [fsverity built-in signature verification](https://docs.kernel.org/filesystems/fsverity.html#built-in-signature-verification). The kernel's `FS_IOC_ENABLE_VERITY` ioctl accepts a PKCS#7 signature that is verified against the `.fs-verity` keyring. This provides a clear chain of trust: the same component that controls data access (the kernel) also validates the signature. The kernel additionally integrates with the [IPE](https://docs.kernel.org/admin-guide/LSM/ipe.html) (Integrity Policy Enforcement) subsystem. -### Trust Chain +The recommended delivery mechanism for these signatures is a separate OCI artifact using the Referrer pattern, described below. This enables signing existing unmodified OCI images. -The trust chain for composefs-verified OCI images flows from external signatures through the manifest to the complete filesystem: +Signatures MAY also be embedded as manifest annotations using a `.signature` suffix on digest annotations (e.g. `composefs.layer.fsverity-sha512-12.signature` with base64-encoded PKCS#7), though this requires modifying the image manifest. + +#### Digest-only verification (alternative) + +Kernel-based signing is not required. An implementation may instead rely on external trust in the composefs digests themselves — for example, by trusting the OCI manifest (verified via cosign/sigstore/GPG) and treating the composefs digest annotations as authoritative. In this model: ``` External signature (cosign/sigstore/GPG) ↓ signs -OCI Manifest (includes config descriptor) - ↓ digest reference -OCI Config (includes containers.composefs.fsverity label) - ↓ fsverity digest -Composefs EROFS image - ↓ contains -Complete merged filesystem tree +OCI Manifest (includes composefs digest annotations) + ↓ +Composefs EROFS image (verified against digest) + ↓ +Complete filesystem tree ``` -## Verification Process +The userspace tooling performing this verification must be trusted. A key benefit of composefs is that verification of large data is on-demand and continuous via the kernel's fsverity — the composefs digest covers the complete filesystem tree, so verifying it is cheap even though the underlying data may be large. + +#### Replacing diff_id validation + +The OCI image specification requires a `diff_id` in the [image config](https://github.com/opencontainers/image-spec/blob/main/config.md) for each layer, which is the digest of the uncompressed tar stream. This is expensive to validate after extraction and provides no path to continual kernel-enforced verification. With composefs, validating `diff_id` becomes redundant: the composefs digest already cryptographically covers the complete filesystem tree derived from the layer. + +#### Composefs Artifact Structure + +Composefs data — signatures and optionally prebuilt EROFS metadata (erofs-alongside mode) — is stored as a separate OCI artifact, discoverable via the OCI referrer pattern. This follows the same approach as cosign: the composefs artifact references the sealed image through the `subject` field and can be found via the `/referrers` API. + +Signature layers are raw PKCS#7 DER-encoded blobs — exactly the format expected by `FS_IOC_ENABLE_VERITY`. No JSON wrapping or base64 encoding. Prebuilt EROFS layers (when present) are EROFS images, optionally zstd-compressed (see [EROFS Layer Compression](#erofs-layer-compression)). + +##### Artifact Manifest + +The composefs artifact is an OCI image manifest following the [artifacts guidance](https://github.com/opencontainers/image-spec/blob/main/artifacts-guidance.md) pattern (empty config, content in layers): + +The provisioning mode is indicated by the `artifactType`: + +- `application/vnd.composefs.erofs-alongside.v1` — the artifact contains prebuilt EROFS metadata layers alongside optional signatures +- `application/vnd.composefs.canonical.v1` *(future)* — the artifact contains only signatures; the client generates the EROFS locally + +This allows clients to discover which mode is available via the referrers API filtered by `artifactType`. + +**EROFS-alongside example** (prebuilt EROFS on registry, zstd-compressed): + +```json +{ + "schemaVersion": 2, + "mediaType": "application/vnd.oci.image.manifest.v1+json", + "artifactType": "application/vnd.composefs.erofs-alongside.v1", + "config": { + "mediaType": "application/vnd.oci.empty.v1+json", + "digest": "sha256:44136fa355b3678a1146ad16f7e8649e94fb4fc21fe77e8310c060f61caaff8a", + "size": 2 + }, + "layers": [ + { + "mediaType": "application/vnd.composefs.v1.erofs+zstd", + "digest": "sha256:fff...", + "size": 2816, + "annotations": { + "composefs.erofs.type": "layer", + "composefs.digest": "3abb6677af34ac57...layer-1-composefs-digest..." + } + }, + { + "mediaType": "application/vnd.composefs.v1.erofs+zstd", + "digest": "sha256:ggg...", + "size": 1408, + "annotations": { + "composefs.erofs.type": "layer", + "composefs.digest": "63e22ec2fbeeba...layer-2-composefs-digest..." + } + }, + { + "mediaType": "application/vnd.composefs.v1.erofs+zstd", + "digest": "sha256:hhh...", + "size": 4096, + "annotations": { + "composefs.erofs.type": "merged", + "composefs.digest": "d015f70f8bee6c...merged-composefs-digest..." + } + }, + { + "mediaType": "application/vnd.composefs.signature.v1+pkcs7", + "digest": "sha256:aaa...", + "size": 456, + "annotations": { + "composefs.signature.type": "manifest", + "composefs.digest": "ab12...manifest-fsverity-digest..." + } + }, + { + "mediaType": "application/vnd.composefs.signature.v1+pkcs7", + "digest": "sha256:bbb...", + "size": 789, + "annotations": { + "composefs.signature.type": "config", + "composefs.digest": "cd34...config-fsverity-digest..." + } + }, + { + "mediaType": "application/vnd.composefs.signature.v1+pkcs7", + "digest": "sha256:ccc...", + "size": 1234, + "annotations": { + "composefs.signature.type": "layer", + "composefs.digest": "3abb6677af34ac57...layer-1-composefs-digest..." + } + }, + { + "mediaType": "application/vnd.composefs.signature.v1+pkcs7", + "digest": "sha256:ddd...", + "size": 1234, + "annotations": { + "composefs.signature.type": "layer", + "composefs.digest": "63e22ec2fbeeba...layer-2-composefs-digest..." + } + }, + { + "mediaType": "application/vnd.composefs.signature.v1+pkcs7", + "digest": "sha256:eee...", + "size": 1234, + "annotations": { + "composefs.signature.type": "merged", + "composefs.digest": "d015f70f8bee6c...merged-composefs-digest..." + } + } + ], + "subject": { + "mediaType": "application/vnd.oci.image.manifest.v1+json", + "digest": "sha256:5b0bcabd1ed22e9fb1310cf6c2dec7cdef19f0ad69efa1f392e94a4333501270", + "size": 7682 + }, + "annotations": { + "composefs.algorithm": "fsverity-sha512-12" + } +} +``` + +**Canonical-EROFS example** *(future — not yet usable)*: + +```json +{ + "schemaVersion": 2, + "mediaType": "application/vnd.oci.image.manifest.v1+json", + "artifactType": "application/vnd.composefs.canonical.v1", + "config": { + "mediaType": "application/vnd.oci.empty.v1+json", + "digest": "sha256:44136fa355b3678a1146ad16f7e8649e94fb4fc21fe77e8310c060f61caaff8a", + "size": 2 + }, + "layers": [ + { + "mediaType": "application/vnd.composefs.signature.v1+pkcs7", + "digest": "sha256:aaa...", + "size": 456, + "annotations": { + "composefs.signature.type": "manifest", + "composefs.digest": "ab12...manifest-fsverity-digest..." + } + }, + { + "mediaType": "application/vnd.composefs.signature.v1+pkcs7", + "digest": "sha256:bbb...", + "size": 789, + "annotations": { + "composefs.signature.type": "config", + "composefs.digest": "cd34...config-fsverity-digest..." + } + }, + { + "mediaType": "application/vnd.composefs.signature.v1+pkcs7", + "digest": "sha256:ccc...", + "size": 1234, + "annotations": { + "composefs.signature.type": "layer", + "composefs.digest": "3abb6677af34ac57...layer-1-composefs-digest..." + } + }, + { + "mediaType": "application/vnd.composefs.signature.v1+pkcs7", + "digest": "sha256:ddd...", + "size": 1234, + "annotations": { + "composefs.signature.type": "layer", + "composefs.digest": "63e22ec2fbeeba...layer-2-composefs-digest..." + } + }, + { + "mediaType": "application/vnd.composefs.signature.v1+pkcs7", + "digest": "sha256:eee...", + "size": 1234, + "annotations": { + "composefs.signature.type": "merged", + "composefs.digest": "d015f70f8bee6c...merged-composefs-digest..." + } + } + ], + "subject": { + "mediaType": "application/vnd.oci.image.manifest.v1+json", + "digest": "sha256:5b0bcabd1ed22e9fb1310cf6c2dec7cdef19f0ad69efa1f392e94a4333501270", + "size": 7682 + }, + "annotations": { + "composefs.algorithm": "fsverity-sha512-12" + } +} +``` -Verification begins by fetching the manifest from the registry and verifying the external signature on the manifest. The config descriptor is extracted from the manifest, and the config is fetched and verified to match the descriptor digest. The `containers.composefs.fsverity` label is extracted from the config, and the composefs image is mounted with fsverity verification. The kernel verifies the EROFS matches the expected fsverity digest. +##### Layer Ordering -The security property is that signature verification happens once, while filesystem verification is delegated to kernel fs-verity with lazy or eager verification depending on mount options. +Each layer carries annotations that identify its role. Signature layers use `composefs.signature.type`; EROFS metadata layers (erofs-alongside mode only) use `composefs.erofs.type`. Both carry `composefs.digest` with the fsverity digest. This makes the artifact self-contained — a consumer can verify composefs digests using only the composefs artifact and the image layers, without requiring composefs annotations on the original image manifest. -## Metadata Schema +The layers MUST appear in this order: -### Config Labels +1. **(erofs-alongside only)** N EROFS metadata entries with `composefs.erofs.type: "layer"` — one per manifest layer, in manifest order. Each is an EROFS metadata image, either raw or zstd-compressed per its media type. +2. **(erofs-alongside only)** Zero or one EROFS metadata entry with `composefs.erofs.type: "merged"` — the flattened merged EROFS for the complete image. +3. **(Optional)** One signature with `composefs.signature.type: "manifest"` — signature for the sealed image manifest, stored as a file with fsverity +4. **(Optional)** One signature with `composefs.signature.type: "config"` — signature for the image config, stored as a file with fsverity +5. N signature entries with `composefs.signature.type: "layer"` — one per manifest layer, in manifest order. Each signature is applied to the EROFS blob via `FS_IOC_ENABLE_VERITY`. +6. Zero or one signature with `composefs.signature.type: "merged"` — if present, this is the signature for the merged EROFS representing the complete flattened filesystem. -The image config contains the following labels: +Position within each group determines which source object the entry corresponds to. The number of `layer`-type entries (both EROFS and signature) MUST equal the number of layers in the source manifest. When an erofs-alongside EROFS layer and its corresponding signature layer both carry `composefs.digest`, they MUST agree. -The `containers.composefs.fsverity` label (string) contains the fsverity digest of the merged composefs EROFS in the format `:` where algorithm is `sha256` or `sha512`. +This design enables signing existing unmodified OCI images: compute composefs digests for each layer, sign them, and push the composefs artifact as a referrer. The original image is never touched. -The `containers.composefs.version` label (string, optional) contains the seal format version such as `1.0`. +##### Signature Format -### Descriptor Annotations +Each layer blob is a raw PKCS#7 signature encoded using [DER](https://en.wikipedia.org/wiki/X.690#DER_encoding) (Distinguished Encoding Rules, ITU-T X.690) over the kernel's `fsverity_formatted_digest`: -A descriptor may have the following annotation: +```c +struct fsverity_formatted_digest { + char magic[8]; /* "FSVerity" */ + __le16 digest_algorithm; + __le16 digest_size; + __u8 digest[]; +}; +``` -The `containers.composefs.layer.fsverity` annotation (string, optional) contains the fsverity digest of that individual layer. +Composefs algorithm identifiers map to kernel constants with no salt: +- `fsverity-sha512-12` → `FS_VERITY_HASH_ALG_SHA512`, 4096-byte blocks +- `fsverity-sha256-12` → `FS_VERITY_HASH_ALG_SHA256`, 4096-byte blocks +- `fsverity-sha512-16` → `FS_VERITY_HASH_ALG_SHA512`, 65536-byte blocks +- `fsverity-sha256-16` → `FS_VERITY_HASH_ALG_SHA256`, 65536-byte blocks -### Label versus Annotation Semantics +All entries in a single composefs artifact MUST use the same algorithm. The algorithm is declared in the `composefs.algorithm` annotation on the composefs artifact manifest (e.g. `fsverity-sha512-12`). -Config labels store the authoritative digest because the config represents container identity while the manifest is a transport artifact. Labels are part of the container specification and create a new artifact (sealed image) rather than mutating metadata. Manifest annotations are retained for discovery purposes, allowing registries to identify sealed images without parsing configs and enabling clients to optimize pull strategies. +For manifest and config signatures, the fsverity digest is computed over the exact JSON bytes as stored in the registry. These files are stored locally with fsverity enabled so that reads are kernel-verified. -## Verification Modes +##### Discovery and Verification -### Eager Verification +Discovery uses the standard [OCI Distribution Spec referrers API](https://github.com/opencontainers/distribution-spec/blob/main/spec.md#listing-referrers): +``` +GET /v2//referrers/?artifactType=application/vnd.composefs.erofs-alongside.v1 +GET /v2//referrers/?artifactType=application/vnd.composefs.canonical.v1 +``` -Eager verification occurs during image pull. The composefs image is immediately created and its digest is verified against the config label. This makes the container ready to mount immediately after pull and is suitable for boot scenarios where operations should be read-only. +Verification depends on the mode: -### Lazy Verification +**EROFS-alongside** (`artifactType: application/vnd.composefs.erofs-alongside.v1`): +1. Check `subject` matches the sealed image manifest digest +2. Extract EROFS metadata layers from the artifact, decompressing if zstd-compressed (determined by media type) +3. Fetch and unpack each tar layer; generate a canonical in-memory metadata representation (e.g. composefs dumpfile) from the tar and compare against the EROFS metadata — disagreement is fatal +4. The EROFS metadata is used directly (no local generation needed) +5. If signature layers are present, apply them via `FS_IOC_ENABLE_VERITY` to the EROFS files +6. If the source manifest has composefs digest annotations, verify they match the artifact's `composefs.digest` values -Lazy verification defers composefs creation until first mount. The pull operation stores layers and config but doesn't build the composefs image. On mount, the composefs image is built and verified against the label. This mode is suitable for application containers where many images may be pulled but only some are actually used. +**Canonical-EROFS** *(future)* (`artifactType: application/vnd.composefs.canonical.v1`): +1. Check `subject` matches the sealed image manifest digest +2. Read `composefs.digest` annotations from signature layers (or from the source manifest annotations) to learn the expected fsverity digests +3. Generate the EROFS locally from the tar layers using the canonical process +4. Compute the fsverity digest of the locally generated EROFS and verify it matches the expected digest +5. If signature layers are present, apply them via `FS_IOC_ENABLE_VERITY` to the EROFS files -## Security Model +In both modes, the kernel handles PKCS#7 validation when signatures are used — failed verification prevents reading the file. -### Registry-Provided Sealed Images +``` +External CA/Keystore + ↓ issues certificate for .fs-verity keyring +PKCS#7 signatures (from artifact layers) + ↓ applied via FS_IOC_ENABLE_VERITY to each file +Manifest JSON, Config JSON, EROFS layer blobs + ↓ kernel fsverity enforcement on every read +Runtime file access +``` -For images sealed by the registry or vendor, the seal is computed during the build process and the seal label is embedded in the published config. An external signature covers the manifest. Clients verify the chain: signature → manifest → config → composefs. Trust is placed in the image producer and the signature key. +##### Implementation Considerations -### Client-Sealed Images +Kernel-level signature verification depends on Linux kernel fsverity (CONFIG_FS_VERITY, CONFIG_FS_VERITY_BUILTIN_SIGNATURES). Signature validation and file access enforcement are handled by the Linux kernel. -For images sealed locally by the client, the client pulls an image that may be unsigned and computes the seal locally. The client stores the sealed config in its local repository. On boot or mount, the client can re-fetch the manifest from the network to verify freshness. Trust is placed in the network fetch (TLS) and local verification. +When signatures are present, the manifest and config signature entries MUST also be present — there is no reason to sign individual layers without also signing the manifest and config that reference them. The merged entry remains optional. -## Attack Mitigation +In erofs-alongside mode, the EROFS `layer` group MUST always be present (that is the primary purpose of the artifact). Signature layers are optional — an erofs-alongside artifact without signatures is valid and supports digest-only verification. This is the expected common case: a composefs artifact is attached to an existing image to provide EROFS metadata, without requiring the publisher to have signing keys. -### Digest Mismatch +In canonical-EROFS mode, the composefs artifact exists only to carry signatures (the EROFS is generated locally). If an implementation uses digest-only verification, it does not need a composefs artifact at all — the `composefs.layer.*` annotations on the image manifest are sufficient. -If a config label doesn't match the actual EROFS, the mount operation fails the fsverity check. Verification APIs can detect this condition before mounting. +Clients that pull images with composefs artifacts are expected to also store the artifact locally alongside the image (it's just a small amount of metadata), and to attach the signatures to the corresponding files at the Linux kernel level. This enables offline verification and allows fsverity signatures to be applied when files are later accessed. However, local storage of the artifact is not strictly required — a client could re-fetch the artifact from the registry when needed, or operate in digest-only mode where the composefs digests themselves are trusted without kernel signature verification. -### Signature Bypass +Implementations should focus on erofs-alongside mode, which works today. Once the canonical EROFS specification is finalized, implementations SHOULD support both modes. -Any attempt to modify the config label without updating the signature fails because the signature covers the manifest, which covers the config digest. Any config change produces a new digest, breaking the signature chain. +##### Media Types -### Rollback Attack +- `application/vnd.composefs.erofs-alongside.v1`: Artifact type for erofs-alongside composefs artifacts (EROFS metadata + optional signatures) +- `application/vnd.composefs.canonical.v1`: Artifact type for canonical-EROFS composefs artifacts (signatures only) +- `application/vnd.composefs.v1.erofs`: Layer media type for uncompressed prebuilt EROFS metadata images (erofs-alongside only) +- `application/vnd.composefs.v1.erofs+zstd`: Layer media type for zstd-compressed prebuilt EROFS metadata images (erofs-alongside only). See [EROFS Layer Compression](#erofs-layer-compression). +- `application/vnd.composefs.signature.v1+pkcs7`: Layer media type for PKCS#7 DER signature blobs -For application containers, re-fetching the manifest on boot checks for freshness. For host systems, embedding the manifest in the boot artifact prevents rollback. +## Storage model -### Layer Confusion +It is recommended to store the config, manifest and unpacked layers. -Per-layer fsverity annotations allow verification before merging. Implementations that maintain digest maps can link layer SHA256 digests to fsverity digests. +In erofs-alongside mode, the prebuilt EROFS is fetched from the registry and stored directly. In canonical-EROFS mode, the EROFS is generated locally on-demand or cached (indexed by manifest digest). In either case, the composefs artifact itself should be stored locally to enable offline signature verification. ## Relationship to Booting with composefs OCI sealing is independent from but complementary to composefs boot verification (UKI, BLS, etc.). These are separate mechanisms operating at different stages of the system lifecycle with different trust models. -OCI sealing provides runtime verification of container images distributed through registries. The trust chain typically flows from external signatures (cosign, GPG) through OCI manifests to composefs digests. +It is expected that boot-sealed images would *also* be OCI sealed, although this is not strictly required. + +### Bootable composefs UKI and kernel command line -Boot verification is designed to be rooted in extant hardware mechanisms such as Secure Boot. The composefs digest is embedded directly in boot artifacts (UKI `.cmdline` section, BLS entry `options` field) and verified during early boot by the initramfs. +The default model implemented is that the UKI's kernel command line includes the fsverity digest of a slightly modified EROFS (without `/boot` among other things). This currently relies on canonical-EROFS mode since the digest must match between what the UKI embeds at build time and what the client generates at boot time. -These mechanisms work together in a complete workflow where a sealed OCI image can be pulled from a registry, verified through OCI sealing, and then used to build a boot artifact with the composefs digest embedded for boot verification. However, each mechanism operates independently with its own trust anchor and threat model. +With erofs-alongside mode, it would also be possible to instead load signing keys into the kernel fsverity chain from the initramfs (which may be the same or different keys used for application images), and use the composefs artifact signature scheme for mounting the root filesystem from the initramfs. This would remove the dependency on canonical EROFS generation for boot. ## Future Directions +### Incremental Pulls via EROFS-alongside + +In erofs-alongside mode, the EROFS metadata contains fsverity digests of all content objects, so the client can determine which objects it already has locally and only fetch the missing ones from the tar layer. The EROFS effectively acts as a table of contents — a metadata format that is natively supported by the Linux kernel and has multiple userspace parsers. + +A key advantage over existing approaches (zstd:chunked, eStargz) is that the composefs digest eliminates the need to verify the OCI `diff_id`, which in turn eliminates the need for tar-split metadata. The tar layer becomes purely a content delivery mechanism — each fetched object is verified independently by its fsverity digest against the trusted EROFS. + +To push an incrementally-pulled image, the client must regenerate the tar layer deterministically. This requires a canonical tar format — see [canonical-tar.md](canonical-tar.md). + +See [incremental-pulls.md](incremental-pulls.md) for the full design, including the composefs-chunked layer format, offset map structure, and pull protocol. + ### Dumpfile Digest as Canonical Identifier -The fsverity digest ties implementations to a specific EROFS format. A dumpfile digest (SHA256 of the composefs dumpfile format) would enable format evolution. This would be stored as an additional label `containers.composefs.dumpfile.sha256` alongside the fsverity digest. +The fsverity digest ties implementations to a specific EROFS format; for more details on this, see [this issue](https://github.com/composefs/composefs/issues/198). A dumpfile digest (classic SHA or fsverity digest) of the composefs dumpfile format would enable format evolution. -The dumpfile format is format-agnostic, meaning the same dumpfile can generate different EROFS versions. This simplifies standardization since the dumpfile format is simpler than EROFS and provides future-proofing to migrate to composefs-over-squashfs or other formats. +This would also be stored as an annotation: -The challenge is that verification becomes slower as it requires parsing a saved EROFS from disk to dumpfile format. Caching the dumpfile digest to fsverity digest mapping introduces complexity and security implications. A use case split might apply dumpfile digests to application containers (for format flexibility) while using fsverity digests for host boot (for speed with minimal skew). +```json +{ + "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip", + "digest": "sha256:9834876dcfb05cb167a5c24953eba58c4ac89b1adf57f28f2f9d09af107ee8f0", + "size": 32654, + "annotations": { + "composefs.layer.fsverity-sha512-12": "3abb6677af34ac57c0ca5828fd94f9d886c26ce59a8ce60ecf6778079423dccff1d6f19cb655805d56098e6d38a1a710dee59523eed7511e5a9e4b8ccb3a4686", + "composefs.layer.fsverity-sha512-12.signature": "MIIBkgYJKo...base64-encoded-pkcs7...", + "composefs.dumpfile.sha512": "62d4b68bc4d336ff0982b93832d9a1f1d40206b49218299e5ac2e50f683d23f17bb99a1f3805339232abebd702eeda204827cfde244bf833e42b67a2fe632dc0" + } +} +``` + +A downside though is that because the mapping from the tar layer to the EROFS was not pre-computed server side, there is no way to attach a kernel-native signature. However, it does still allow efficient validation of the complete filesystem tree, given only the saved metadata (e.g. tar-split or splitstream) in combination with the fsverity digests of content. ### Integration with zstd:chunked @@ -173,10 +556,6 @@ Both zstd:chunked and composefs add new digests to OCI images. The zstd:chunked Adding fsverity to zstd:chunked TOC entries would allow using the TOC digest as a canonical composefs identifier. This would support a direct TOC → dumpfile → composefs pipeline, with a single metadata format serving both zstd:chunked and composefs use cases. -### Three-Digest Model - -To support both flattened and layered mounting strategies, three digests could be stored per image: a base image digest, a derived layers digest, and a flattened digest. This would enable mounting a single flattened composefs for speed, mounting base and derived separately to avoid metadata amplification, or verifying the base from upstream while only rebuilding derived layers. This aligns with the existing `org.opencontainers.image.base.digest` standard. - ## References **Design discussion**: [composefs/composefs#294](https://github.com/composefs/composefs/issues/294) @@ -192,8 +571,7 @@ To support both flattened and layered mounting strategies, three digests could b **Standards**: - [OCI Image Specification](https://github.com/opencontainers/image-spec) -- [Canonical JSON](https://wiki.laptop.org/go/Canonical_JSON) ## Contributors -This specification synthesizes ideas from Colin Walters (original design proposals and iteration), Allison Karlitskaya (implementation and practical refinements), and Alexander Larsson (security model and non-root mounting insights). Significant assistance from Claude Sonnet 4.5 was used in synthesis. +This specification synthesizes ideas from Colin Walters (original design proposals and iteration), Allison Karlitskaya (implementation and practical refinements), Alexander Larsson (security model and non-root mounting insights), and Giuseppe Scrivano (across the board) with assistance from Claude Sonnet 4.5 and Claude Opus 4. diff --git a/doc/plans/standardized-erofs-meta.md b/doc/plans/standardized-erofs-meta.md new file mode 100644 index 00000000..0413a01a --- /dev/null +++ b/doc/plans/standardized-erofs-meta.md @@ -0,0 +1,83 @@ +# Standardized EROFS Metadata Serialization + +This document outlines the goal of standardizing how composefs serializes filesystem trees to EROFS metadata images. + +## Relationship to OCI Sealing Modes + +The [OCI sealing specification](oci-sealing-spec.md) defines two EROFS provisioning modes. This standardization work is specifically required for **canonical-EROFS mode**, where the client generates the EROFS locally and must produce a byte-identical result to what the server (or any other implementation) would generate. + +**EROFS-alongside mode** does not require this standardization because the publisher ships the exact EROFS bytes to clients. EROFS-alongside can be used today without solving the problems described here. + +However, even in erofs-alongside mode, a canonical dumpfile representation is valuable for the consistency check between the tar layer and the prebuilt EROFS (see erofs-alongside verification in the OCI sealing spec). + +## Goal + +Standardize how a filesystem tree, expressed canonically as a composefs dumpfile (or equivalent representation), is serialized to EROFS metadata. This enables reproducible EROFS generation across implementations and is a prerequisite for canonical-EROFS mode in the OCI sealing specification. + +## Conceptual Model + +The canonical transformation model is: + +``` +tar layer → dumpfile → EROFS metadata +``` + +Even when implementations optimize by going directly from tar to EROFS for efficiency, the canonical model remains tar → dumpfile → EROFS. This means: + +1. Two implementations processing the same tar layer should produce equivalent dumpfiles +2. Two implementations processing the same dumpfile MUST produce byte-identical EROFS images +3. Therefore, two implementations processing the same tar layer should produce byte-identical EROFS images + +The dumpfile serves as the canonical intermediate representation that defines the filesystem tree independent of serialization format. + +## Why This Matters + +- **Canonical-EROFS OCI sealing**: Canonical-EROFS mode in the OCI sealing specification depends entirely on this standardization. Without it, fsverity digests computed by different implementations would not match, and signatures would fail to verify. +- **Reproducible EROFS generation**: Given identical inputs, composefs-c, composefs-rs, and any future implementations must produce byte-for-byte identical EROFS images +- **Ecosystem compatibility**: Container runtimes, build tools, and registries can use different implementations interchangeably +- **UKI boot**: The sealed UKI boot model embeds a composefs digest in the kernel command line, which must match the EROFS generated at boot time — this is inherently a canonical-EROFS use case + +Note: EROFS-alongside mode provides an alternative path that avoids these requirements, at the cost of shipping EROFS metadata on the registry. See [oci-sealing-spec.md](oci-sealing-spec.md) for a comparison. + +## Current State + +This standardization is a work in progress: + +- **[composefs/composefs#423](https://github.com/composefs/composefs/discussions/423)**: Discussion on compatible EROFS output across implementations +- **[composefs-rs PR #225](https://github.com/composefs/composefs-rs/pull/225)**: Initial reimplementation of composefs-c in Rust, with compatible EROFS output as a key goal + +## Open Questions + +The following details need to be standardized (future work): + +### EROFS Format Options +- EROFS format version and feature flags +- Block size (currently 4096) +- Compression settings (composefs uses uncompressed metadata) + +### Inode Representation +- Compact vs extended inode format +- Inode numbering scheme +- Handling of hardlinks (inode sharing) + +### Metadata Ordering +- Inode table ordering (depth-first? breadth-first? by path?) +- Directory entry ordering within directories +- Xattr key ordering within an inode +- Shared xattr table construction algorithm + +### Content Handling +- Inline data threshold (currently ~64 bytes for external, but exact cutoff matters) +- External file references via overlay metacopy xattrs +- Symlink target storage + +### OCI-Specific Concerns +- Whiteout representation (should not appear in final EROFS — processed during merge) +- Root inode metadata normalization (copying from `/usr`) +- Timestamp precision (seconds only, matching tar limitations) + +## References + +- [Splitstream binary format](../splitstream.md) — related binary format for storing tar data +- [OCI sealing specification](oci-sealing-spec.md) — depends on reproducible EROFS generation +- [EROFS documentation](https://docs.kernel.org/filesystems/erofs.html) — kernel filesystem documentation