Version: 1.0 Copyright (c) 2026 BitConcepts, LLC. All rights reserved.
Commercial licensing: info@bitconcepts.tech
All multi-byte integers are little-endian unless noted otherwise.
Magic: "CP" (0x43 0x50)
Offset Size Field
────── ──── ─────────────────────────────
0 2 magic ("CP")
2 1 version (1)
3 2 flags (reserved, 0x0000)
5 1 backend_id
6 4 original_size (LE u32)
10 2 dag_descriptor_len (LE u16)
12 N dag_descriptor (N = dag_descriptor_len)
12+N ... compressed_payload
Minimum header: 12 bytes.
0x00— Raw (passthrough)0x01— Zstd0x02— Brotli0x03— Gzip (RFC 1952, deflate)0x04— LZMA (xz format)
When dag_descriptor_len > 0, the descriptor encodes the transform chain:
Offset Size Field
────── ──── ─────────────────────
0 1 transform_count
1 N transform_ids (1 byte each)
1+N ... per-transform: meta_len (LE u16) + meta_bytes
Transform IDs are defined in cpac-dag/src/registry.rs.
Magic: "CPBL" (0x43 0x50 0x42 0x4C)
Offset Size Field
────── ──── ─────────────────────────────
0 4 magic ("CPBL")
4 1 version (1)
5 4 block_count (LE u32)
9 8 original_size (LE u64)
17 4×N block_size_table (LE u32 per block)
17+4N ... block_payloads (concatenated)
Each block payload is a complete CP frame, independently decompressible. Default block size: 1 MiB. Auto-engaged for inputs > 256 KiB.
Magic: "TP" (0x54 0x50)
Offset Size Field
────── ──── ─────────────────────────────
0 2 magic ("TP")
2 1 version (1)
3 1 transform_count
4 N transform_ids (1 byte each)
4+N ... per-transform: param_len (LE u16) + params
... ... payload (transformed data)
CPAC supports 11 transforms with SIMD acceleration:
0x01— Delta (params: stride as LE u16)0x02— ZigZag (params: none)0x03— Transpose (params: element_width as LE u16)0x04— ROLZ (params: none, self-framed)0x05— FloatSplit (params: none, self-framed)0x06— FieldLZ (params: none, self-framed)0x07— RangePack (params: min, max as LE u64)0x08— Tokenize (params: none, self-framed)0x09— PrefixStrip (params: prefix_len as LE u16)0x0A— Dedup (params: none, self-framed)0x0B— ParseInt (params: none, self-framed)
SIMD runtime dispatch (best to worst): AVX-512 → AVX2 → SSE4.1 → SSE2 → NEON → scalar
Transforms are applied in order during compression and reversed during decompression. If no TP magic is present, data is treated as raw.
Magic: "CS" (0x43 0x53)
Offset Size Field
────── ──── ─────────────────────────────
0 2 magic ("CS")
2 1 version (1)
3 4 num_blocks (LE u32)
7 8 original_size (LE u64)
15 4 block_size (LE u32)
19 ... per-block: compressed_len (LE u32) + block_data
Each block is independently compressed via the standard CP pipeline. Supports both sequential and parallel decompression.
Magic: "CPHE" (0x43 0x50 0x48 0x45)
Offset Size Field
────── ──── ─────────────────────────────
0 4 magic ("CPHE")
4 1 version (1)
5 32 ephemeral_x25519_public
37 2 mlkem_ciphertext_len (LE u16)
39 M mlkem_ciphertext (M = mlkem_ciphertext_len)
39+M 1 aead_nonce_len
40+M K aead_nonce (K = aead_nonce_len)
40+M+K ... aead_ciphertext (ChaCha20-Poly1305)
Key derivation: HKDF-SHA256 over concatenation of X25519 shared secret
and ML-KEM-768 shared secret, with salt "CPHE-hybrid-salt" and info
"CPHE-hybrid-v1".
Magic: "CPAR" (0x43 0x50 0x41 0x52)
Offset Size Field
────── ──── ─────────────────────────────
0 4 magic ("CPAR")
4 1 version (1)
5 1 flags (reserved, 0x00)
6 4 num_entries (LE u32)
10 ... entries (sequential)
Each entry:
Offset Size Field
────── ──── ─────────────────────────────
0 2 path_len (LE u16)
2 P path (UTF-8, forward slashes)
2+P 8 original_size (LE u64)
10+P 8 compressed_size (LE u64)
18+P 1 flags (reserved)
19+P 8 timestamp (Unix epoch seconds, LE u64)
27+P C compressed_data (C = compressed_size, CP frame)
CPAC includes comprehensive test infrastructure:
Transform-specific golden vectors ensure deterministic behavior across:
- 11 transforms × 3 backends (Zstd, Brotli, Raw) = 33 golden vector sets
- Generated via
tests/generate_transform_goldens.rs - Stored in
tests/goldens/directory
Validation suite (tests/determinism.rs) ensures:
- Empty input handling
- Single-byte reproducibility
- All-zeros compression
- Random data consistency
- Cross-thread determinism
- Multi-iteration stability
CPAC is validated against industry-standard corpora:
- Canterbury Corpus (11 files, ~3 MB)
- Silesia Corpus (12 files, ~211 MB)
- Calgary Corpus (18 files, ~3 MB)
- Large Canterbury (6 files, ~10 MB)
- enwik8/enwik9 (Wikipedia text)
- Maximum Compression (44 files, ~400 MB)
- SQuash Corpus (17 files, ~34 GB)
Corpus management via YAML configs with automatic HTTP/ZIP/TAR.GZ downloads.
- 289+ total tests: regression, golden vectors, property-based, determinism
- 33 transform golden vectors: all transforms × all backends
- 6 determinism tests: cross-thread, multi-iteration, edge cases
- Zero test failures maintained across all platforms
Runtime CPU detection with fallback chain:
- x86_64: AVX-512 → AVX2 → SSE4.1 → SSE2 → scalar
- aarch64: NEON → scalar
SIMD kernels implemented for:
- Delta encode/decode
- ZigZag encode/decode
- Transpose (element-wise)
Additional kernels (NEON aarch64): crates/cpac-transforms/src/simd/neon.rs
Automated PGO build system (scripts/pgo-build.ps1):
- Instrumented build with PGO flags
- Training run on representative corpus
- Optimized rebuild using profile data
- 5-15% performance improvement typical
Multi-platform CI/CD (.github/workflows/ci.yml):
- Platforms: Ubuntu (x86_64), Windows (x86_64), macOS (x86_64/aarch64)
- Tests: 289+ test suite, clippy (deny warnings), rustfmt
- Coverage: cargo-tarpaulin with codecov upload
- Matrix: stable/nightly Rust versions
Automatic corpus management (crates/cpac-engine/src/corpus.rs):
- HTTP downloads with progress bars (indicatif)
- ZIP/TAR.GZ extraction
- YAML corpus configuration (serde_yaml)
- Feature-gated (
downloadfeature, optional dependencies) - 18+ curated benchmark datasets
CPAC supports five entropy coding backends:
- Implementation:
zstd-safecrate binding to libzstd - Compression levels: 1-22 (default: 3)
- Features: dictionary compression, streaming, parallel threads
- Use case: general-purpose, fast decompression
- Implementation:
brotlicrate (pure Rust) - Compression levels: 0-11 (default: 11)
- Features: quality/window size tuning, streaming
- Use case: maximum compression ratio, web content
- Implementation:
flate2crate (deflate algorithm, RFC 1952) - Compression levels: 0-9 (default: 6)
- Features: streaming, CRC32 checksums
- Use case: compatibility, widespread tool support, fast decompression
- Implementation:
xz2crate (wraps liblzma) - Compression presets: 0-9 (default: 6)
- Features: high compression ratios, dictionary-based, full level control
- Use case: maximum compression, archival, slow decompression acceptable
- Implementation:
xz2crate (wraps liblzma, XZ container format) - Compression presets: 0-9 (default: 6 at all CPAC levels)
- Features: LZMA2 algorithm with XZ framing, CRC checksums
- Use case: archival, maximum compression with standardized container
- Implementation:
lz4_flexcrate (pure Rust) - Features: block-level compression with prepended size
- Use case: maximum throughput, real-time compression/decompression
- Implementation:
snapcrate (pure Rust) - Features: single-speed compression, no level control
- Use case: lowest latency, hot-path data, real-time streaming
- Implementation:
lzhamcrate (FFI to C++) - Compression levels: 0-4 (FASTEST to UBER)
- Features: high compression ratios, dictionary-based
- Use case: high-ratio archival, offline compression
- Implementation: vendored C source via
cccrate - Compression levels: 10-49 across 4 modes (fastLZ4, LIZv1, +Huffman variants)
- Features: LZ4-derivative with optional Huffman entropy coding
- Use case: tunable speed/ratio tradeoff
- Implementation:
libz-ng-syscrate (FFI to zlib-ng) - Compression levels: 1-9
- Features: gzip-compatible output with optimized zlib-ng engine
- Use case: gzip-compatible compression with better performance
- Implementation: delegates to Zstd (future: OpenZL C++ DAG via FFI)
- Compression levels: same as Zstd (1-19)
- Features: CPAC datacenter compression initiative
- Use case: datacenter workloads, managed compression
- Passthrough mode (no compression)
- Used for pre-compressed data or incompressible streams
- Zero overhead for already-compressed inputs
- Use case: transform-only pipelines, testing
All backends support:
- Single-threaded and parallel modes
- Memory-mapped I/O (files > 64 MB)
- Streaming with progress callbacks
- Compression/decompression verification
CPAC uses DAG (Directed Acyclic Graph) profiles to compose transform chains:
- Auto — SSR-based auto-selection (default)
- Fast — minimal transforms, optimize for speed
- Balanced — moderate transform chain
- Max — maximum compression, all applicable transforms
- Text — text-optimized (tokenize, prefix-strip, dedup)
- Transform chains specified in
cpac-dag/src/profiles.rs - Auto-selection via SSR analysis (
cpac-ssr) - Runtime compilation and execution
- Transform metadata stored in DAG descriptor (CP frame offset 12+N)
CPAC includes specialized handlers for common data types:
- CSV — column-aware compression, header detection
- JSON — structure-aware, key deduplication
- XML — tag folding, attribute optimization
- YAML — indentation normalization
- Logs — timestamp extraction, pattern recognition
Handlers automatically engaged via file extension or SSR analysis.
CAS infers data constraints for optimization:
- Range constraints: min/max value detection
- Enum detection: categorical value sets
- Constant detection: repeated values
- Monotonic detection: sorted sequences
- Functional dependencies: column correlations
Used by transform auto-selection and RangePack transform.
- v1.0 (2026-03-02) — Comprehensive specification with all features, backends, transforms, testing infrastructure, performance tools, corpus management