Feature Request / Improvement
Parent: #829 (v2 spec completion)
All building blocks for compaction exist:
Key gap: delete file removal
overwriteFiles.deletedEntries() at snapshot_producers.go explicitly filters to EntryContentData only. After compaction, position/equality delete files that covered the rewritten data files are orphaned in manifests. The overwrite producer needs to handle delete file removal alongside data file replacement.
Nice to have: CLI
$ iceberg compact analyze db.events
Compaction Plan for db.events
Files scanned: 1,247
Files to rewrite: 89 (7.1%)
Compaction groups: 12
Est. size change: 2.3 GB → 1.8 GB (-22%)
$ iceberg compact run db.events --partial-progress
Compacting db.events...
[1/12] date=2024-01-15: 12 files → 2 files ✓
[2/12] date=2024-01-16: 8 files → 1 file ✓
Done. Rewrote 89 → 15 files. Removed 23 delete files.
Related
Compaction is the v2 approach to read-perf degradation under deletes. The v3 approach (deletion vectors) is tracked in #589 — puffin reader/writer exists, scanner integration TBD. Both are complementary: DVs reduce write amplification, compaction is still needed for file consolidation.
Feature Request / Improvement
Parent: #829 (v2 spec completion)
All building blocks for compaction exist:
PlanFileswith delete file matching (scanner.go)ReadTasksfor pre-planned tasks (feat(table): add Scan.ReadTasks for reading pre-planned file scan tasks #781) — materializes rows with position (fix(table): goroutine leak in positionDeleteRecordsToDataFiles #825, fix(table): fix refcount leak in enrichRecordsWithPosDeleteFields #762) and equality (feat(table): equality delete read path in scanner #818) deletes appliedWriteRecordswith partitioned fanout (perf(table): optimize partitioned write throughput #622) and rolling file size (feat(table): roll parquet files based on actual compressed size #759)ReplaceDataFilesWithDataFiles/AddDataFiles(feat: add functions for add and replacing data directly with datafiles #723) for atomic commitsSlicePackerfor bin-packing (internal/utils.go)What's missing is the top-level API that wires them together, and the ability to remove delete files in the same commit.
Without compaction, tables under Update/Delete workloads accumulate equality delete files (feat(table): equality delete write path #809, feat(table): equality delete writing for partitioned tables #823) and read performance degrades with every commit.
Key gap: delete file removal
overwriteFiles.deletedEntries()atsnapshot_producers.goexplicitly filters toEntryContentDataonly. After compaction, position/equality delete files that covered the rewritten data files are orphaned in manifests. The overwrite producer needs to handle delete file removal alongside data file replacement.Nice to have: CLI
Related
Compaction is the v2 approach to read-perf degradation under deletes. The v3 approach (deletion vectors) is tracked in #589 — puffin reader/writer exists, scanner integration TBD. Both are complementary: DVs reduce write amplification, compaction is still needed for file consolidation.