Skip to content

feat: table compaction (RewriteDataFiles) #832

@laskoviymishka

Description

@laskoviymishka

Feature Request / Improvement

Parent: #829 (v2 spec completion)
All building blocks for compaction exist:

Key gap: delete file removal

overwriteFiles.deletedEntries() at snapshot_producers.go explicitly filters to EntryContentData only. After compaction, position/equality delete files that covered the rewritten data files are orphaned in manifests. The overwrite producer needs to handle delete file removal alongside data file replacement.

Nice to have: CLI

    $ iceberg compact analyze db.events
    Compaction Plan for db.events
      Files scanned:        1,247
      Files to rewrite:        89   (7.1%)
      Compaction groups:        12
      Est. size change:      2.3 GB → 1.8 GB  (-22%)
    $ iceberg compact run db.events --partial-progress
    Compacting db.events...
      [1/12] date=2024-01-15: 12 files → 2 files ✓
      [2/12] date=2024-01-16:  8 files → 1 file  ✓
    Done. Rewrote 89 → 15 files. Removed 23 delete files.

Related

Compaction is the v2 approach to read-perf degradation under deletes. The v3 approach (deletion vectors) is tracked in #589 — puffin reader/writer exists, scanner integration TBD. Both are complementary: DVs reduce write amplification, compaction is still needed for file consolidation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions