CPAC Development Ledger

Session-by-session record of significant changes, investigations, and decisions.

Session 28 — 2026-03-12 (Security Fixes + CI Hardening)

Focus

Resolve all Dependabot and CodeQL code-scanning alerts on the repository.

Dependabot Alerts (3 moderate — ml-dsa)

All three ml-dsa vulnerabilities were already patched — the dependency was at 0.1.0-rc.7, newer than all three patched versions (rc.2, rc.4, rc.5). Dependabot couldn't verify this because Cargo.lock was gitignored.

Fix: Removed Cargo.lock from .gitignore and committed the lockfile. Dependabot can now resolve versions and auto-close the alerts.

Patched vulnerabilities:

Timing side-channel in ML-DSA decomposition (patched in rc.2)
UseHint off-by-two error when r0 equals zero (patched in rc.5)
Signature verification accepts repeated hint indices (patched in rc.4)

CodeQL Code-Scanning Alerts (9 findings — missing workflow permissions)

All 9 alerts were for missing permissions blocks in GitHub Actions workflows.

ci.yml — Added top-level permissions: contents: read (least privilege)
release.yml — Already had permissions: contents: write on develop

Files Modified

.gitignore — Removed Cargo.lock exclusion
Cargo.lock — Committed to repository (5,140 lines)
.github/workflows/ci.yml — Added permissions block

Session 27 — 2026-03-12 (Benchmarking + Profile Tuning)

Focus

Full corpus benchmarking of all 6 ratio-improvement phases. Fix profile timeouts on large Silesia files, run targeted retry, disk cleanup.

Benchmark Results — Balanced Profile (773/777 OK)

Ran benchmark-all with profile_balanced.yaml. 4 Silesia files timed out (nci, samba, webster, mozilla) at the default 900 s timeout.

Key corpus averages (best ratio per file):

loghub2_2k: 16.63× (Brotli@11 most common best backend; Zstd Best 15.25×)
nasa_logs: 8.56×
canterbury: 5.84×
silesia (excl. timed-out): 4.30×
calgary: 4.03×
enwik8: 3.75×
cloud_configs: 3.63×
kodak: 1.08× (near-incompressible images)

Silesia Retry (12/12 OK)

Created profile_silesia_retry.yaml with timeout: 3600 and large_file_threshold: 15 MB. All 12 Silesia files completed:

nci: 20.68×
samba: 5.74×
webster: 4.94×
mozilla: 3.83×

Profile Changes

profile_balanced.yaml — timeout 900 → 3600 s, large_file_threshold 50 → 15 MB
profile_silesia_retry.yaml — new profile targeting Silesia corpus only

Disk Cleanup

Removed target/debug (~33 GB freed, ~30.8 GB now free).

Files Modified

benches/cpac/profiles/profile_balanced.yaml — timeout + threshold
benches/cpac/profiles/profile_silesia_retry.yaml — new

Session 26 — 2026-03-11 (Phases 3–6: Dictionary, Conditioned BWT, Backend Selection, CAS Bridge)

Focus

Implement remaining 4 phases of the Compression Ratio Improvement Plan in a single session. All 4 phases pass presubmit (build + test + clippy).

Phase 3 — Auto-Dictionary for Parallel Blocks

CPBL v3 wire format — Extends v2 with a shared zstd dictionary: dict_len(4B) + dict_data in the CPBL header. V1/v2 remain readable.
Dictionary training — compress_parallel() collects the first N blocks (min 3, max 8, max 64 KB total dict size) and trains a zstd dictionary via cpac-dict. Dict is stored once in the CPBL header and applied to all blocks via compress_with_dict() / decompress_with_dict().
Dependency — Added cpac-dict to cpac-engine/Cargo.toml.

Phase 4 — Conditioned BWT Composition

New transform — ConditionedBwtTransform (ID = 26) in cpac-transforms/src/conditioned_bwt.rs. Partitions input via cpac_conditioning::partition(), applies BWT + MTF + RLE0 per qualifying stream. Reassembles with a length-prefixed partition table.
Registry — Registered in TransformRegistry::with_builtins() in cpac-dag/src/registry.rs (now 26 transforms total).

Phase 5 — Per-Block Backend Selection

Fix — Replaced hardcoded Track::Track2 with per-block track derived from block_config.cached_ssr in compress_parallel(). Each block now runs auto_select_backend() using its own SSR analysis rather than a single file-level decision.

Phase 6 — CAS Bridge for MSN Fields

TypedColumns — New struct TypedColumns + MsnResult::typed_columns() in cpac-msn/src/lib.rs. Exposes MSN-extracted fields as typed columns (numeric, string, timestamp, boolean) for downstream CAS analysis.
CAS constraint bridge — compress_parallel() calls typed_columns() on MSN results, feeds columns into CAS constraint inference, and applies per-column transforms when the cost model accepts.

Files Modified

cpac-engine/src/parallel.rs — CPBL v3 dict, per-block backend, CAS bridge
cpac-engine/src/lib.rs — dict-aware compress path wiring
cpac-engine/Cargo.toml — cpac-dict dependency
cpac-transforms/src/conditioned_bwt.rs — new: ConditionedBwtTransform
cpac-transforms/src/lib.rs — module registration
cpac-dag/src/registry.rs — registered transform ID 26
cpac-msn/src/lib.rs — TypedColumns, typed_columns()

Validation

Build: shell.ps1 build ✓
Tests: full workspace (all suites) ✓
Clippy: shell.ps1 clippy (0 warnings) ✓

Session 25 — 2026-03-11 (Phase 2: MSN Cross-Block Metadata Deduplication)

Focus

Implement Phase 2 of the Compression Ratio Improvement Plan: store MSN metadata once in the CPBL header instead of duplicating it in every parallel block frame.

Implementation

Type system — Added msn_metadata_external: bool to CompressConfig and msn_applied: bool to CompressResult in cpac-types/src/lib.rs. Updated all CompressResult construction sites (engine + streaming).
Engine compress() — When msn_metadata_external=true and MSN applies, the per-block frame is CP v1 (no inline metadata), original_size is set to the residual length, and msn_applied=true signals the caller.
CPBL v2 wire format — New format in parallel.rs adds: shared_meta_len(4B) after the v1 header, plus block_flags(1B×N) and shared_metadata between the block size table and payloads. V1 emitted when no MSN metadata (backward compatible).
Compress path — MSN probe in compress_parallel() now sets msn_metadata_external=true on the block config, collects per-block msn_applied flags, and writes CPBL v2 with shared metadata.
Decompress path — decompress_parallel() accepts both v1 and v2. For v2, decodes shared metadata once, then reconstructs MSN-flagged blocks via metadata.with_residual() + cpac_msn::reconstruct().
Block size cap — When MSN is enabled, block size is capped at MAX_DOMAIN_EXTRACT_SIZE (8 MB) so per-block MSN extraction stays within domain handler limits. Probe sample also truncated.

Files Modified

cpac-types/src/lib.rs — New config/result fields
cpac-engine/src/lib.rs — External MSN path in compress()
cpac-engine/src/parallel.rs — CPBL v2 format, compress + decompress
cpac-streaming/src/lib.rs — Updated CompressResult construction
cpac-engine/tests/phase2_msn_dedup.rs — New: 4 roundtrip tests (JSON v2, YAML v1, binary v1, XML v2)

Validation

Build: shell.ps1 build ✓
Tests: full workspace (all suites including new Phase 2 tests) ✓
Clippy: shell.ps1 clippy (0 warnings) ✓

Key Discovery

Adaptive block sizing could produce blocks larger than MSN domain handlers accept (BLOCK_SIZE_LARGE=32 MB > MAX_DOMAIN_EXTRACT_SIZE=8 MB). The MSN extraction silently returned not_applied on oversized blocks, causing the parallel path to emit CPBL v1 even when MSN would have succeeded. Fixed by capping block size at the domain limit when MSN is enabled.

Session 24 — 2026-03-11 (Phase 1: Fix Parallel Smart Transform Roundtrip)

Focus

Execute Phase 1 of the Compression Ratio Improvement Plan: enable smart transforms (primarily BWT) on the parallel compression path.

Investigation Findings

Original bug no longer reproduces — The "corrupted output" reported in Sessions 21/22 was caused by an earlier pipeline issue that has since been fixed by other session changes. The skip_expensive_transforms = true guard in compress_parallel() prevented the bug from manifesting but also killed all ratio improvement from transforms.
BWT roundtrips correctly at block sizes — Tested BWT on 4 MB and 17 MB blocks (single-stream and parallel) with full roundtrip verification. BWT metadata is only 4 bytes (the original index), well within the u16 DAG descriptor limit.
Normalize u16 hypothesis (H4) confirmed but moot — The normalize transform generates hundreds of KB to MB of metadata on large blocks (one diff per whitespace removal). The u16 guard at normalize.rs:317 correctly bails out, and the smart_preprocess cost check would also reject it because uncompressed metadata overhead exceeds savings. A future phase can add inline descriptor compression to make normalize viable.

Fix Applied

Removed block_config.skip_expensive_transforms = true from compress_parallel() in cpac-engine/src/parallel.rs. BWT now runs on parallel sub-blocks where the analyzer recommends it (≥ 16 MB blocks, ascii_ratio > 0.85, entropy < 5.5).

Files Modified

cpac-engine/src/parallel.rs — Removed skip_expensive_transforms override
cpac-engine/tests/phase1_bwt_parallel.rs — New: 2 roundtrip tests at 17 MB block size (plain text + JSON) verifying smart transforms work
docs/ROADMAP.md — Updated known issues: marked parallel roundtrip as RESOLVED

Validation

Build: shell.ps1 build ✓
Tests: full workspace (95 cpac-msn + 77 cpac-engine + all integration) ✓
Clippy: shell.ps1 clippy (0 warnings) ✓
Phase 1 investigation tests: 2 new tests pass at 17 MB block size ✓

Expected Impact

+15–45% compression ratio on large text files (≥32 MB) that trigger the parallel path. Verified on synthetic test data; real-world corpus benchmarks pending.

Session 23 — 2026-03-11 (MSN Large-File Regression Fix + Ratio Improvement Plan)

Focus

Fix the Silesia large-file MSN regression (double-copy on passthrough, XML O(N×K) blowup, no size limits on domain extractors). Investigate compression ratio improvement opportunities and non-Rust component impact.

MSN Regression Fix (3 root causes)

Double-copy on passthrough — MsnResult::passthrough(data) cloned all data, then the engine's bypass path cloned again (2× wasted allocation for non-matching files). Fix: added MsnResult::not_applied() zero-copy sentinel.
No size limits — All 19 domain extractors ran extract() on arbitrarily large buffers. Fix: added MSN_MAX_EXTRACT_SIZE (16 MB) top-level guard, MAX_DOMAIN_EXTRACT_SIZE (8 MB) per-domain guard, XML-specific 2 MB guard.
XML extraction O(N×tags) — 4× String::replace() per tag on full string, then savings gate rejected the result (all work wasted). Fix: 2 MB size guard short-circuits before expensive work.

Files Modified (19 files)

cpac-msn/src/lib.rs — not_applied(), MSN_MAX_EXTRACT_SIZE, MAX_DOMAIN_EXTRACT_SIZE
cpac-engine/src/lib.rs — replaced passthrough(data) with not_applied()
cpac-msn/src/domains/text/{xml,json,csv,yaml}.rs — per-domain size guards
cpac-msn/src/domains/logs/{syslog,apache,http,java,json_log,bgl,healthapp,proxifier,hpc,w3c,openstack}.rs — per-domain size guards
cpac-msn/src/domains/binary/avro.rs — size guard + CpacError import fix
cpac-msn/tests/msgpack_plain_text.rs — updated for not_applied() contract

Ratio Improvement Plan Created

Formal 6-phase plan: "CPAC Compression Ratio Improvement Plan"

Phase 1: Fix parallel smart transform roundtrip (P0, +15–45% on large text)
Phase 2: MSN cross-block metadata deduplication (P1, +0.5–2%)
Phase 3: Auto-dictionary for parallel blocks (P1, +3–8%)
Phase 4: Conditioning + BWT composition (P2, +2–10% hypothesis)
Phase 5: Per-block backend selection (P2, +1–5% on heterogeneous)
Phase 6: CAS bridge for MSN fields (P3, +5–20% on structured data)

Non-Rust Component Assessment

Identified 6 statically linked C/C++ entropy codecs (zstd, lz4, xz, lzham, lizard, zlib-ng) + 2 pure Rust codecs (brotli, snappy). None are pipeline bottlenecks — FFI overhead is negligible. Python (cpac.py) is build-only. Actual bottlenecks are in pure Rust (smart_preprocess trials, BWT screening, MSN string operations).

Validation

Build: shell.ps1 build ✓
Tests: cargo test -p cpac-msn (95 pass) ✓
Tests: cargo test -p cpac-engine (77 + all integration suites) ✓
Clippy: shell.ps1 clippy (0 warnings) ✓

Session 22 — 2026-03-10 (Bug Fix Planning + Session Save)

Focus

Document the parallel + smart transforms roundtrip bug for handoff to a clean session. Deep-dive into the compress/decompress parallel architecture to formulate root cause hypotheses.

Key Analysis

Architecture Trace

Traced the full parallel compress/decompress pipeline:

compress_parallel() splits data into blocks, each block independently runs the full CPAC pipeline (SSR → MSN → smart transforms → entropy → frame)
Each compressed block is a self-contained CPAC frame with its own DAG descriptor
decompress_parallel() extracts blocks, decompresses each independently, concatenates results
Individual transforms (BWT chain, normalize) roundtrip correctly even at 5MB

Root Cause Hypotheses (Ranked)

H4 (HIGH): Normalize transform metadata overflow — on ~2.5MB text blocks, whitespace positions metadata could reach ~2MB, exceeding the per-step u16 length prefix in DAG descriptor wire format. smart_preprocess checks total descriptor size but may not catch per-step overflow.
H2 (MEDIUM): DAG descriptor serialization overflow/truncation at u16 boundary
H5 (MEDIUM): Frame original_size vs post-transform size mismatch
H1 (LOW): Block boundary splitting transform-sensitive patterns
H3 (RULED OUT for test): MSN cached metadata — test uses default enable_msn: false

Investigation Plan

Capture exact error from failing test (size mismatch vs content mismatch)
Isolate which transform (normalize vs bwt_chain) causes the failure
Check serialize_dag_descriptor per-step metadata u16 handling
Check normalize metadata size on ~2.5MB text blocks
Fix root cause
Validate all tests pass + clippy clean

Plan Created

Formal plan document created: "Fix Parallel + Smart Transforms Roundtrip Bug" with full architecture trace, 5 hypotheses, 6 investigation steps, post-fix benchmark plan, and all key file references with line numbers.

No Code Changes

This session was analysis and documentation only.

Session 21 — 2026-03-10 (Transform Roundtrip Investigation)

Focus

Investigate why CPAC's SSR/MSN/smart transforms are NOT producing better compression ratios than standalone codecs in benchmarks.

Key Findings

1. Smart Transforms DO Improve Ratios — But Decompression Is Broken

The bench_file path (forced backend, enable_smart_transforms: true) shows dramatically better ratios on large text files — but fails roundtrip verification:

File	CPAC (Zstd forced)	Standalone zstd-3	Improvement	Verified
silesia/nci	17.07x	11.76x	+45%	NO
silesia/webster	3.96x	3.41x	+16%	NO
silesia/reymont	3.92x	3.40x	+15%	NO
silesia/dickens	2.84x	2.77x	+2.5%	NO
enwik8	2.85x	2.81x	+1.4%	NO

The smart transforms (primarily bwt_chain and normalize) produce excellent forward compression but the reconstructed data doesn't match the original. The decompress path runs (output is correct size) but content is corrupted.

2. MSN IS Working on Log Files

The bench_file_auto path with MSN enabled shows verified ratio improvements on structured log data:

File	T1(SSR/Zstd)	T1(MSN/Zstd)	Improvement	Verified
Thunderbird_2k	10.56x	11.62x	+10.0%	YES
Spark_2k	13.83x	14.46x	+4.5%	YES
Hadoop_2k	22.00x	22.92x	+4.2%	YES
Mac_2k	7.02x	7.21x	+2.7%	YES
OpenStack_2k	11.59x	11.73x	+1.2%	YES
HealthApp_2k	9.65x	9.83x	+1.9%	YES

3. Parallel Path Interaction

The roundtrip bug manifests specifically when:

File > 4 MiB (triggers compress_parallel)
Smart transforms are enabled (default)
Text data with ascii_ratio > 0.80 (triggers normalize + bwt_chain)

Individual transform roundtrip tests pass at 100KB and 5MB. The failure occurs in the parallel compression path, likely due to DAG descriptor interaction with block boundaries.

4. `compress_parallel` Always Reports Track2

compress_parallel() hardcodes track: Track::Track2 in its CompressResult, regardless of actual block content. This means benchmark labels like "T2(SSR/Zstd)" for large text files are misleading — the blocks may actually be Track1.

Tests Added

roundtrip_smart_transforms_large_text — 50KB text, single-block, smart transforms
roundtrip_bwt_chain_direct_large — 100KB BWT chain encode/decode
roundtrip_bwt_chain_direct_5mb — 5MB BWT chain encode/decode
roundtrip_normalize_direct_large — 100KB normalize encode/decode
roundtrip_smart_transforms_parallel_text — 5MB+ text through parallel path (FAILS — reproduces the bug)

Next Steps (Priority Order)

Fix parallel + smart transforms roundtrip — The parallel path's interaction with DAG descriptors is producing corrupt output on large text. This blocks all ratio improvement claims.
Make production path (bench_file_auto) leverage transforms — After fix, ensure the auto-route applies transforms that improve ratio.
Re-benchmark with fixed transforms to produce verified ratio wins.

Files Modified

crates/cpac-engine/src/lib.rs — Added 5 new roundtrip tests

Session 20 — 2026-03-10 (Pipeline Validation + Calibration)

Full pipeline validation: 134+ tests passing, 0 errors, 0 warnings. Completed: file reorganization, xz/snappy external benchmarks, benchmark reporting rules, THESIS.md, ROADMAP.md, OpenZL feature parity, zstd-12/zstd-19 baselines, clippy fixes, calibration system, dictionary compression, preset matrix (Turbo/Balanced/Maximum/Archive/MaxRatio).

FilesExpand file tree

LEDGER.md

Latest commit

History

LEDGER.md

File metadata and controls

CPAC Development Ledger

Session 28 — 2026-03-12 (Security Fixes + CI Hardening)

Focus

Dependabot Alerts (3 moderate — ml-dsa)

CodeQL Code-Scanning Alerts (9 findings — missing workflow permissions)

Files Modified

Session 27 — 2026-03-12 (Benchmarking + Profile Tuning)

Focus

Benchmark Results — Balanced Profile (773/777 OK)

Silesia Retry (12/12 OK)

Profile Changes

Disk Cleanup

Files Modified

Session 26 — 2026-03-11 (Phases 3–6: Dictionary, Conditioned BWT, Backend Selection, CAS Bridge)

Focus

Phase 3 — Auto-Dictionary for Parallel Blocks

Phase 4 — Conditioned BWT Composition

Phase 5 — Per-Block Backend Selection

Phase 6 — CAS Bridge for MSN Fields

Files Modified

Validation

Session 25 — 2026-03-11 (Phase 2: MSN Cross-Block Metadata Deduplication)

Focus

Implementation

Files Modified

Validation

Key Discovery

Session 24 — 2026-03-11 (Phase 1: Fix Parallel Smart Transform Roundtrip)

Focus

Investigation Findings

Fix Applied

Files Modified

Validation

Expected Impact

Session 23 — 2026-03-11 (MSN Large-File Regression Fix + Ratio Improvement Plan)

Focus

MSN Regression Fix (3 root causes)

Files Modified (19 files)

Ratio Improvement Plan Created

Non-Rust Component Assessment

Validation

Session 22 — 2026-03-10 (Bug Fix Planning + Session Save)

Focus

Key Analysis

Architecture Trace

Root Cause Hypotheses (Ranked)

Investigation Plan

Plan Created

No Code Changes

Session 21 — 2026-03-10 (Transform Roundtrip Investigation)

Focus

Key Findings

1. Smart Transforms DO Improve Ratios — But Decompression Is Broken

2. MSN IS Working on Log Files

3. Parallel Path Interaction

4. compress_parallel Always Reports Track2

Tests Added

Next Steps (Priority Order)

Files Modified

Session 20 — 2026-03-10 (Pipeline Validation + Calibration)

4. `compress_parallel` Always Reports Track2