Skip to content

blog: Introducing OLake Fusion#408

Open
siddharth-chevella wants to merge 8 commits intodatazip-inc:masterfrom
siddharth-chevella:fusion-intro-blog
Open

blog: Introducing OLake Fusion#408
siddharth-chevella wants to merge 8 commits intodatazip-inc:masterfrom
siddharth-chevella:fusion-intro-blog

Conversation

@siddharth-chevella
Copy link
Copy Markdown
Collaborator

Blog: Introducing OLake Fusion.

  1. Explaining problems faced while maintaining Iceberg tables.
  2. Explaining what is OLake Fusion and how it solves those problems.

Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx Outdated
Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx Outdated
Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx Outdated
Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx

**Object storage costs creep up silently.** Cloud storage doesn't just charge for how much data you store — it also charges per API request. More files means more reads, more listings, more API calls on every operation. You won't notice it until the bill shows up, and by then you've been overpaying for weeks.

![Storage Costs problem diagram](/img/blog/2026/5/storage-cost.webp)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of mentioning same data size can we mention optimal data size that sounds better

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the data size remains same (rather supposed to be) in both the cases, only difference is that without compaction, data is fragmented, isnt it?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same data size can also mean small files. In the sense if you have 200mb files all same size its not optimal. Compaction is used to make sure to bring the files to optimal size

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you mean. But this is addressing the 'total table size' and not individual file size, which is why it explicitly mentions 'Same data file, fewer files' and 'Same data size, more files'. Still if you think this can be confusing, we can change the image.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK compaction reduces the overall storage volume as well, though marginally given parquets have their own metadata in form of headers and all.

Copy link
Copy Markdown
Collaborator Author

@siddharth-chevella siddharth-chevella Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Altering the text and only keeping 'Fewer files = lower api costs' and 'more files = higher api costs'. Please let me know if this is fine.

Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx Outdated
Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx Outdated
Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx Outdated
Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx Outdated
Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx
Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx Outdated
Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx Outdated
Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx Outdated
Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx Outdated
Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx Outdated
Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx Outdated
Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx Outdated

**Running one-size-fits-all compaction when different situations call for different approaches.** A table that was just heavily fragmented by a burst of CDC events needs something different than a table with a moderate accumulation of small files over several hours. Spark's `rewrite_data_files` doesn't differentiate — it processes whatever files fall within your size bounds, regardless of whether that's the right level of intervention for the current table state.

**Having no visibility into whether compaction is actually working.** Most custom setups log to files or Airflow. You might know that a job ran and didn't fail, but you don't know how many files it consolidated, what the table's health looks like now, or whether the compaction was actually effective. When a query is still slow after compaction, diagnosing why is painful.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<When a query is still slow after compaction, diagnosing why is painful. > i think this will still exist as long as ingestion happens parallely , we need to better schedule compaction jobs as a solution to this. Whereas visibility into what compaction did to the files is totally different from the query becoming slow, so IMO lets not mix the two

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'we need to better schedule compaction jobs as a solution to this' user would need to know why query is still slow after compaction. We provide solution for it. For example Health Score is a meta-level metric which would show table is still not healthy in the case. Then after seeing that user would know they need to better schedule the jobs (they might have scheduled lite once a day, or just scheduled lite on a CDC heavy table). So the whole emphasis here is on the visibility

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this we can mention explicitly right in the above point. lets explain this better above

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this sentence.

Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx Outdated

### Tiered Optimization: The Right Level of Work at the Right Time

The most important thing about Fusion's approach is that it doesn't treat all compaction as the same operation. It offers three optimization tiers that you can schedule independently, each designed for a different kind of table maintenance need.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets always address it as OLake Fusion will be better for seo and keyword search

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx

This is the kind of visibility that makes the difference between proactively maintaining your tables and reactively debugging performance issues after users are already complaining.

## Setting It Up
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again why is this section required ? this is not a how to configure fusion blog

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The title might imply that but the content doesn't. It emphasizes on the simplicity of the fusion.
Will change the title.

Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx Outdated
Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx
Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx
Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx Outdated
Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx Outdated

**Running one-size-fits-all compaction when different situations call for different approaches.** A table that was just heavily fragmented by a burst of CDC events needs something different than a table with a moderate accumulation of small files over several hours. Spark's `rewrite_data_files` doesn't differentiate — it processes whatever files fall within your size bounds, regardless of whether that's the right level of intervention for the current table state.

**Having no visibility into whether compaction is actually working.** Most custom setups log to files or Airflow. You might know that a job ran and didn't fail, but you don't know how many files it consolidated, what the table's health looks like now, or whether the compaction was actually effective. When a query is still slow after compaction, diagnosing why is painful.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this we can mention explicitly right in the above point. lets explain this better above

Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx Outdated

### Faster and cheaper than Spark compaction

Compared to Apache Spark’s `rewrite_data_files` on comparable infrastructure, Fusion is about **2× faster** end-to-end and lands around **50% lower cost** for the same compaction workload—without giving up table layout quality. Run-by-run timings, query checks, methodology, and cost breakdown are covered in [OLake Fusion vs Spark compaction benchmark](https://olake.io/blog/iceberg-compaction-spark-vs-fusion-benchmark/)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so basically because its faster that is why its cheaper. please refer the benchmarking blog properly. because olake fusion finishes in 27 min so cost is less than spark.


Fusion comes with observability built in, at two levels.

**Per-run logs and status tracking.** Every time Fusion runs an optimization, it creates a run entry that tracks the optimization type (Lite, Medium, or Full), the table being optimized, the start time, duration, and outcome (Running, Success, Failed, Cancelled, or Skipped). Within each run, you get driver logs covering how the overall optimization progressed, and sub-task logs showing details for each individual piece of work the run was broken into. When something goes wrong, you're not staring at a generic error message — you can see exactly which sub-task failed and what happened.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need it here ? this is just intro blog . why do you need so much detail with each parameter. for reference you chan check out other tools blog as well you can get a better picture

Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx

# We Built a Better Way to Maintain Apache Iceberg Tables

Apache Iceberg is the right choice for most modern lakehouses. It gives you ACID guarantees, schema evolution, time travel, and genuinely fast analytical queries — without locking you into any single vendor or engine. The adoption numbers back it up: Iceberg has quietly become the default open table format for teams building serious data infrastructure.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can link the blogs where we have compared the open table formats

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added link to https://olake.io/blog/apache-iceberg-features-benefits/


But here's what nobody tells you when you're getting started: picking the right table format is only half the job. The other half is *keeping those tables healthy*. And that part? It's a lot harder than it looks.

This post is about that second half — specifically, why Iceberg table maintenance tends to become a full-time headache, what teams are doing today to cope with it, and what we built at OLake to actually solve it.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not a post but a blog

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


![Query slowdown diagram](/img/blog/2026/5/query-slowdown.webp)

**Metadata becomes a bottleneck on its own.** Iceberg tracks every file through a chain of manifests and snapshots. As file count grows, this metadata tree becomes enormous. Even simple operations — planning a query, committing a write — start taking longer because the system has to parse and resolve a much larger metadata structure before it can do anything useful.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm we are running the following operation: "rewriteManifests" because without it though it is optimised but it will be unfair to say we solve it. Also as far as I know we are definitely not running the following command "expire_snapshots" so not a lot happening at metadata level

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to drop snapshots (since Fusion is currently compaction-only) and used plainer wording focused on manifests grow → planning/committing slows down


![Introducing OLake Fusion](/img/blog/2026/5/introducing-olake-fusion.webp)

OLake Fusion is a dedicated Iceberg table maintenance service. It handles compaction for your Iceberg tables on a cron-based schedule you configure — with tiered compaction levels, built-in metrics, and enough observability to actually understand what's happening to your tables.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we mention here that cron based schedule per table ; just to emphasise it as it is a big flexibility ? @siddharth-chevella what are your thoughts?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Made the changes.


![Runs page](/img/docs/iceberg-maintenance/runs-and-logs/runs-page.webp)

**Input vs output for each run.** After each compaction run, Fusion shows metrics for inputs and outputs: counts and sizes for data files and deletes, recorded before versus after each job. You read them straight from the UI instead of reconstructing totals only from unstructured logs.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after each job shouldn't this be after each compaction run. As there is no jobs in fusion

Copy link
Copy Markdown
Collaborator

@badalprasadsingh badalprasadsingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice blog. Minor comments.

Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx Outdated
Comment thread static/img/blog/2026/5/query-slowdown.webp
Comment thread static/img/blog/2026/5/small-file-problem.webp
Comment thread static/img/blog/2026/5/delete-file-problem.webp
Copy link
Copy Markdown
Collaborator

@badalprasadsingh badalprasadsingh May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, having "Fewer Delete Files" is not the reason for faster query performance. E.g., A single equality delete file but its value spreading across all data files is a big deal.

Make it: "unoptimized files = slow reads", "optimized files = faster reads"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not using words 'Optimization', 'Optimized' etc. Can you suggest any alternatives?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it "Many Small Files" -> Slower Reads, "Few Small Files" -> Faster Reads

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread blog/2026-04-28-olake-fusion-introduction-blog.mdx
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants