blog: Introducing OLake Fusion by siddharth-chevella · Pull Request #408 · datazip-inc/olake-docs

siddharth-chevella · 2026-04-28T11:36:34Z

Blog: Introducing OLake Fusion.

Explaining problems faced while maintaining Iceberg tables.
Explaining what is OLake Fusion and how it solves those problems.

nayanj98 · 2026-04-29T10:37:48Z

+
+**Object storage costs creep up silently.** Cloud storage doesn't just charge for how much data you store — it also charges per API request. More files means more reads, more listings, more API calls on every operation. You won't notice it until the bill shows up, and by then you've been overpaying for weeks.
+
+![Storage Costs problem diagram](/img/blog/2026/5/storage-cost.webp)


Instead of mentioning same data size can we mention optimal data size that sounds better

But the data size remains same (rather supposed to be) in both the cases, only difference is that without compaction, data is fragmented, isnt it?

same data size can also mean small files. In the sense if you have 200mb files all same size its not optimal. Compaction is used to make sure to bring the files to optimal size

I understand what you mean. But this is addressing the 'total table size' and not individual file size, which is why it explicitly mentions 'Same data file, fewer files' and 'Same data size, more files'. Still if you think this can be confusing, we can change the image.

AFAIK compaction reduces the overall storage volume as well, though marginally given parquets have their own metadata in form of headers and all.

Makes sense. Altering the text and only keeping 'Fewer files = lower api costs' and 'more files = higher api costs'. Please let me know if this is fine.

merlynm20 · 2026-04-29T10:20:27Z

+
+**Running one-size-fits-all compaction when different situations call for different approaches.** A table that was just heavily fragmented by a burst of CDC events needs something different than a table with a moderate accumulation of small files over several hours. Spark's `rewrite_data_files` doesn't differentiate — it processes whatever files fall within your size bounds, regardless of whether that's the right level of intervention for the current table state.
+
+**Having no visibility into whether compaction is actually working.** Most custom setups log to files or Airflow. You might know that a job ran and didn't fail, but you don't know how many files it consolidated, what the table's health looks like now, or whether the compaction was actually effective. When a query is still slow after compaction, diagnosing why is painful.


<When a query is still slow after compaction, diagnosing why is painful. > i think this will still exist as long as ingestion happens parallely , we need to better schedule compaction jobs as a solution to this. Whereas visibility into what compaction did to the files is totally different from the query becoming slow, so IMO lets not mix the two

'we need to better schedule compaction jobs as a solution to this' user would need to know why query is still slow after compaction. We provide solution for it. For example Health Score is a meta-level metric which would show table is still not healthy in the case. Then after seeing that user would know they need to better schedule the jobs (they might have scheduled lite once a day, or just scheduled lite on a CDC heavy table). So the whole emphasis here is on the visibility

so this we can mention explicitly right in the above point. lets explain this better above

Changed this sentence.

merlynm20 · 2026-04-29T10:22:10Z

+
+### Tiered Optimization: The Right Level of Work at the Right Time
+
+The most important thing about Fusion's approach is that it doesn't treat all compaction as the same operation. It offers three optimization tiers that you can schedule independently, each designed for a different kind of table maintenance need.


lets always address it as OLake Fusion will be better for seo and keyword search

merlynm20 · 2026-04-29T10:39:07Z

+
+This is the kind of visibility that makes the difference between proactively maintaining your tables and reactively debugging performance issues after users are already complaining.
+
+## Setting It Up


again why is this section required ? this is not a how to configure fusion blog

The title might imply that but the content doesn't. It emphasizes on the simplicity of the fusion.
Will change the title.

merlynm20 · 2026-04-29T12:28:08Z

+
+**Running one-size-fits-all compaction when different situations call for different approaches.** A table that was just heavily fragmented by a burst of CDC events needs something different than a table with a moderate accumulation of small files over several hours. Spark's `rewrite_data_files` doesn't differentiate — it processes whatever files fall within your size bounds, regardless of whether that's the right level of intervention for the current table state.
+
+**Having no visibility into whether compaction is actually working.** Most custom setups log to files or Airflow. You might know that a job ran and didn't fail, but you don't know how many files it consolidated, what the table's health looks like now, or whether the compaction was actually effective. When a query is still slow after compaction, diagnosing why is painful.


so this we can mention explicitly right in the above point. lets explain this better above

merlynm20 · 2026-04-29T12:30:55Z

+
+### Faster and cheaper than Spark compaction
+
+Compared to Apache Spark’s `rewrite_data_files` on comparable infrastructure, Fusion is about **2× faster** end-to-end and lands around **50% lower cost** for the same compaction workload—without giving up table layout quality. Run-by-run timings, query checks, methodology, and cost breakdown are covered in [OLake Fusion vs Spark compaction benchmark](https://olake.io/blog/iceberg-compaction-spark-vs-fusion-benchmark/)


so basically because its faster that is why its cheaper. please refer the benchmarking blog properly. because olake fusion finishes in 27 min so cost is less than spark.

merlynm20 · 2026-04-29T12:32:09Z

+
+Fusion comes with observability built in, at two levels.
+
+**Per-run logs and status tracking.** Every time Fusion runs an optimization, it creates a run entry that tracks the optimization type (Lite, Medium, or Full), the table being optimized, the start time, duration, and outcome (Running, Success, Failed, Cancelled, or Skipped). Within each run, you get driver logs covering how the overall optimization progressed, and sub-task logs showing details for each individual piece of work the run was broken into. When something goes wrong, you're not staring at a generic error message — you can see exactly which sub-task failed and what happened.


Why do we need it here ? this is just intro blog . why do you need so much detail with each parameter. for reference you chan check out other tools blog as well you can get a better picture

rkhameshra · 2026-04-30T06:49:45Z

+
+# We Built a Better Way to Maintain Apache Iceberg Tables
+
+Apache Iceberg is the right choice for most modern lakehouses. It gives you ACID guarantees, schema evolution, time travel, and genuinely fast analytical queries — without locking you into any single vendor or engine. The adoption numbers back it up: Iceberg has quietly become the default open table format for teams building serious data infrastructure.


You can link the blogs where we have compared the open table formats

Added link to https://olake.io/blog/apache-iceberg-features-benefits/

rkhameshra · 2026-04-30T06:50:24Z

+
+But here's what nobody tells you when you're getting started: picking the right table format is only half the job. The other half is *keeping those tables healthy*. And that part? It's a lot harder than it looks.
+
+This post is about that second half — specifically, why Iceberg table maintenance tends to become a full-time headache, what teams are doing today to cope with it, and what we built at OLake to actually solve it.


its not a post but a blog

rkhameshra · 2026-04-30T06:56:41Z

+
+![Query slowdown diagram](/img/blog/2026/5/query-slowdown.webp)
+
+**Metadata becomes a bottleneck on its own.** Iceberg tracks every file through a chain of manifests and snapshots. As file count grows, this metadata tree becomes enormous. Even simple operations — planning a query, committing a write — start taking longer because the system has to parse and resolve a much larger metadata structure before it can do anything useful.


Can you confirm we are running the following operation: "rewriteManifests" because without it though it is optimised but it will be unfair to say we solve it. Also as far as I know we are definitely not running the following command "expire_snapshots" so not a lot happening at metadata level

Updated to drop snapshots (since Fusion is currently compaction-only) and used plainer wording focused on manifests grow → planning/committing slows down

merlynm20 · 2026-05-04T12:57:01Z

+
+![Introducing OLake Fusion](/img/blog/2026/5/introducing-olake-fusion.webp)
+
+OLake Fusion is a dedicated Iceberg table maintenance service. It handles compaction for your Iceberg tables on a cron-based schedule you configure — with tiered compaction levels, built-in metrics, and enough observability to actually understand what's happening to your tables.


should we mention here that cron based schedule per table ; just to emphasise it as it is a big flexibility ? @siddharth-chevella what are your thoughts?

Agree. Made the changes.

merlynm20 · 2026-05-04T13:01:26Z

+
+![Runs page](/img/docs/iceberg-maintenance/runs-and-logs/runs-page.webp)
+
+**Input vs output for each run.** After each compaction run, Fusion shows metrics for inputs and outputs: counts and sizes for data files and deletes, recorded before versus after each job. You read them straight from the UI instead of reconstructing totals only from unstructured logs.


after each job shouldn't this be after each compaction run. As there is no jobs in fusion

badalprasadsingh

Nice blog. Minor comments.

badalprasadsingh · 2026-05-05T10:17:16Z

Also, having "Fewer Delete Files" is not the reason for faster query performance. E.g., A single equality delete file but its value spreading across all data files is a big deal.

Make it: "unoptimized files = slow reads", "optimized files = faster reads"

We are not using words 'Optimization', 'Optimized' etc. Can you suggest any alternatives?

Can we make it "Many Small Files" -> Slower Reads, "Few Small Files" -> Faster Reads

blog: Introducing OLake Fusion

28d59b3