Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions dev/wiki/apache-datafusion.wikitext
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
<!--
Draft Wikipedia article.
-->

{{Short description|Open-source query engine}}
{{Draft topics|technology|software}}
{{Infobox software
| name = Apache DataFusion
| developer = [[Apache Software Foundation]]
| programming language = [[Rust (programming language)|Rust]]
| genre = Query engine
| license = [[Apache License]]
| website = {{URL|https://datafusion.apache.org/}}
}}

'''Apache DataFusion''' is an [[open-source software|open-source]], embeddable analytical query engine written in [[Rust (programming language)|Rust]], built on [[Apache Arrow]]'s columnar memory format.<ref name="sigmod-paper">{{cite journal |last1=Lamb |first1=Andrew |last2=Shen |first2=Yijie |last3=Heres |first3=Daniel |last4=Chakraborty |first4=Jayjeet |last5=Kabak |first5=Mehmet Ozan |last6=Hsieh |first6=Liang-Chi |last7=Sun |first7=Chao |title=Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine |journal=Proceedings of the 2024 International Conference on Management of Data |year=2024 |doi=10.1145/3626246.3653368}}</ref><ref name="intro-docs">{{cite web |title=Introduction |url=https://datafusion.apache.org/user-guide/introduction.html |website=Apache DataFusion |publisher=Apache Software Foundation |access-date=2026-03-22}}</ref> It provides [[SQL]] and DataFrame interfaces for analytical query execution and is designed to be used as a library by developers building databases, query engines, and analytical tools, rather than as a standalone database server.<ref name="sigmod-paper" /><ref name="intro-docs" /> The project originated in 2017, was donated to the [[Apache Arrow]] project in 2019, and became a top-level project of the [[Apache Software Foundation]] in 2024.<ref name="donation-post">{{cite web |title=DataFusion: A Rust-native Query Engine for Apache Arrow |url=https://datafusion.apache.org/blog/2019/02/04/datafusion-donation/ |website=Apache DataFusion Blog |publisher=Apache Software Foundation |date=2019-02-04 |access-date=2026-03-22}}</ref><ref name="asf-tlp">{{cite web |title=Apache Software Foundation Announces New Top-Level Project Apache DataFusion |url=https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion |website=The ASF Blog |publisher=Apache Software Foundation |date=2024-06-11 |access-date=2026-03-22}}</ref> As of March 2026, DataFusion exceeded one million monthly downloads on crates.io.<ref name="crates-io">{{cite web |title=datafusion |url=https://crates.io/crates/datafusion |website=crates.io |access-date=2026-03-26}}</ref>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a formal page on Dataframes but there is a stub that refers to Spark, pandas, etc. After this page lands we should add a pointer to it from there. https://en.wikipedia.org/wiki/Dataframe

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I think extensible analytical query engine is clearer than embeddable analytical query engine. Extensible is what is listed on the landing page for datafusion on apache.org


== History ==

DataFusion originally authored by Andy Grove starting in 2017. It was donated to the Apache Arrow Project in February 2019.<ref name="donation-post" /> In 2024, a paper describing DataFusion was accepted to the industry track of the [[ACM SIGMOD]] conference.<ref name="sigmod-accepted">{{cite web |title=SIGMOD 2024 Industrial Track: Accepted Papers |url=https://2024.sigmod.org/industrial-list.shtml |website=SIGMOD 2024 |access-date=2026-03-22}}</ref><ref name="sigmod-paper" /> In April 2024, the project graduated from Apache Arrow and became a top-level Apache project.<ref name="asf-tlp" />

== Features ==

DataFusion is a fast, extensible query engine for building data systems. It provides a SQL interface and a DataFrame API for constructing queries programmatically, a [[query plan|query planner]] and rule-based [[query optimization|optimizer]], and a multithreaded vectorized execution engine that processes data in columnar batches rather than row by row.<ref name="sigmod-paper" /><ref name="intro-docs" />

The engine reads common analytical file formats natively, including [[Apache Parquet]], [[comma-separated values|CSV]], [[JSON]], [[Apache Avro|Avro]], and Arrow IPC, and uses [[Apache Arrow]]'s columnar memory format throughout execution, avoiding [[serialization]] overhead between stages.<ref name="sigmod-paper" />

DataFusion is designed for in-process embedding: it runs within the host application's process rather than as a separate server, using threads for parallel query execution. Its extension points allow downstream systems to add [[user-defined function|user-defined functions]], custom data sources, custom query languages, and new optimizer rules, enabling developers to build specialized database systems on top of DataFusion's planning and execution components without reimplementing them.<ref name="sigmod-paper" /><ref name="intro-docs" />

== Comparison with related systems ==

DataFusion is frequently compared with other columnar analytical systems including [[DuckDB]], [[Polars (software)|Polars]], and Velox, but these systems differ significantly in scope and intended use.<ref name="composable-dbms">{{cite journal |last1=Pedreira |first1=Pedro |last2=Erling |first2=Orri |last3=Mühleisen |first3=Hannes |last4=Muñoz |first4=Ruben |last5=Khaled |first5=Wael |last6=Dürsch |first6=Peter |title=The Composable Data Management System Manifesto |journal=Proceedings of the VLDB Endowment |volume=16 |issue=10 |year=2023 |doi=10.14778/3603581.3603604}}</ref>

=== [[DuckDB]] ===

[[DuckDB]] is an in-process [[online analytical processing|OLAP]] database for direct use by end users, with its own storage format and catalog.<ref name="duckdb-official">{{cite web |title=DuckDB |url=https://duckdb.org/ |website=DuckDB |access-date=2026-03-22}}</ref> DataFusion is a library for building such systems, providing query planning and execution components that other software can embed without a bundled persistent storage format.<ref name="bauplan">{{cite web |title=Duck Hunt: Moving Bauplan from DuckDB to DataFusion |url=https://www.bauplanlabs.com/post/duck-hunt-moving-bauplan-from-duckdb-to-datafusion |website=Bauplan |date=2025-11-05 |access-date=2026-03-22}}</ref>

=== [[Polars (software)|Polars]] ===

[[Polars (software)|Polars]] is also written in [[Rust (programming language)|Rust]] and uses the [[Apache Arrow]] memory model, but is designed as a self-contained DataFrame library for data manipulation rather than an embeddable query engine for building other systems.<ref name="polars-official">{{cite web |title=Polars |url=https://pola.rs/ |website=Polars |access-date=2026-03-22}}</ref><ref name="faq">{{cite web |title=Frequently Asked Questions |url=https://datafusion.apache.org/user-guide/faq.html |website=Apache DataFusion |publisher=Apache Software Foundation |access-date=2026-03-22}}</ref>

=== [[Apache Spark]] ===

[[Apache Spark]] is a distributed analytics framework for processing data at cluster scale.<ref name="spark-sql">{{cite web |title=Spark SQL & DataFrames |url=https://spark.apache.org/sql/ |website=Apache Spark |access-date=2026-03-22}}</ref> DataFusion executes queries within a single process and is aimed at building embedded analytics systems rather than distributed workloads.<ref name="sigmod-paper" /> Apache projects that use DataFusion to accelerate Spark include Apache DataFusion Comet, a native execution plugin for Spark's [[Java virtual machine|JVM]]-based SQL execution engine,<ref name="comet-donation">{{cite web |title=Announcing Apache Arrow DataFusion Comet |url=https://arrow.apache.org/blog/2024/03/06/comet-donation/ |website=Apache Arrow Blog |publisher=Apache Software Foundation |date=2024-03-06 |access-date=2026-03-22}}</ref> and [https://auron.apache.org/ Apache Auron], a Spark accelerator that combines the Apache Arrow-DataFusion library with the Spark distributed computing framework.<ref name="auron-intro">{{cite web |title=Introduction |url=https://auron.apache.org/introduction.html |website=Apache Auron |publisher=Apache Software Foundation |access-date=2026-03-23}}</ref>

=== Velox ===

[https://velox-lib.io/ Velox] is an execution engine library developed at [[Meta Platforms|Meta]].<ref name="velox-vldb">{{cite journal |last1=Pedreira |first1=Pedro |last2=Tan |first2=Wei |last3=Narayanan |first3=Deepak |last4=Chattopadhyay |first4=Bikramjit |last5=Erling |first5=Orri |last6=Melnik |first6=Sergey |last7=Bhagwan |first7=Ranjita |last8=Dumoulin |first8=Franck |title=Velox: Meta's Unified Execution Engine |journal=Proceedings of the VLDB Endowment |volume=15 |issue=12 |year=2022 |doi=10.14778/3554821.3554829}}</ref> Unlike DataFusion, Velox does not include a SQL frontend or query planning framework; it takes an already-optimized query plan as input and handles only execution.<ref name="velox-docs">{{cite web |title=Velox in 10 Minutes |url=https://facebookincubator.github.io/velox/velox-in-10-min.html |website=Velox |access-date=2026-03-22}}</ref>

== Adoption and reception ==

DataFusion has been adopted across a range of analytics and database products. [[Cloudflare]] used DataFusion in its Log Explorer product to execute SQL queries over log data stored in Cloudflare R2.<ref name="cloudflare">{{cite web |title=Cloudflare Log Explorer is now GA, providing native observability and forensics |url=https://blog.cloudflare.com/logexplorer-ga/ |website=The Cloudflare Blog |publisher=Cloudflare |date=2025-06-18 |access-date=2026-03-22}}</ref> [[Palantir Technologies|Palantir]] Lightweight Pipelines are powered by DataFusion.<ref name="palantir-2025">{{cite web |title=Announcements: July 2025 |url=https://www.palantir.com/docs/foundry/announcements/2025-07 |website=Palantir Foundry Documentation |publisher=Palantir Technologies |date=2025-07-29 |access-date=2026-03-22}}</ref><ref name="palantir-2024">{{cite web |title=Announcements: February 2024 |url=https://www.palantir.com/docs/foundry/announcements/2024-02 |website=Palantir Foundry Documentation |publisher=Palantir Technologies |date=February 2024 |access-date=2026-03-22}}</ref> [[InfluxDB]] 3.0 uses DataFusion as part of the FDAP stack: Apache Flight, DataFusion, Arrow, and Parquet.<ref name="influx-fdap">{{cite web |title=Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0 |url=https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/ |website=InfluxData |date=2023-10-25 |access-date=2026-03-22}}</ref> Other users described in public sources include EDB Postgres AI,<ref name="siliconangle-edb">{{cite web |title=Enterprise DB begins rolling AI features into PostgreSQL |url=https://siliconangle.com/2024/05/23/enterprise-db-begins-rolling-ai-features-postgresql/ |website=SiliconANGLE |date=2024-05-23 |access-date=2026-03-22}}</ref> Cube,<ref name="cube-pushdown">{{cite web |title=Query pushdown in Cube's semantic layer |url=https://cube.dev/blog/query-push-down-in-cubes-semantic-layer |website=Cube |date=2024-06-03 |access-date=2026-03-22}}</ref> Spice AI,<ref name="spice">{{cite web |title=How we use Apache DataFusion at Spice AI |url=https://spice.ai/blog/how-we-use-apache-datafusion-at-spice-ai |website=Spice AI |date=2026-01-17 |access-date=2026-03-22}}</ref> Pydantic Logfire,<ref name="logfire">{{cite web |title=We're changing database |url=https://github.com/pydantic/logfire/issues/408 |website=GitHub |date=2024-08-29 |access-date=2026-03-22}}</ref> and Kamu.<ref name="kamu">{{cite web |title=100X faster ingestion, and FlightSQL support for connecting BI tools |url=https://www.kamu.dev/blog/2023-09-datafusion-flightsql/ |website=Kamu Data |date=2023-09-26 |access-date=2026-03-22}}</ref>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm biased to want to include a link to rerun but we don't have a blog post calling out DataFusion even though it is all over our repo. Will work on that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will work on that.

that is the ideal answer!


In 2024, ''CRN'' included Apache DataFusion in its list of "The 10 Coolest Open-Source Software Tools Of 2024".<ref name="crn">{{cite web |title=The 10 Coolest Open-Source Software Tools Of 2024 |url=https://www.crn.com/news/software/2024/the-10-coolest-open-source-software-tools-of-2024?page=3 |website=CRN |date=2024-11-21 |access-date=2026-03-22}}</ref>

== Language support ==

DataFusion itself is written in Rust. The project also has official Python bindings and community-maintained bindings and tooling for other languages and runtimes.<ref name="readme-related">{{cite web |title=Apache DataFusion |url=https://github.com/apache/datafusion |website=GitHub |publisher=Apache Software Foundation |access-date=2026-03-22}}</ref><ref name="df-contrib-org">{{cite web |title=datafusion-contrib |url=https://github.com/datafusion-contrib |website=GitHub |access-date=2026-03-22}}</ref>

{| class="wikitable"
|+ Language support
! Language / runtime
! Project
! Notes
|-
| [[Rust (programming language)|Rust]]
| Apache DataFusion
| Core implementation
|-
| [[Python (programming language)|Python]]
| [https://github.com/apache/datafusion-python datafusion-python]
| Official Python bindings
|-
| [[Java (programming language)|Java]]
| [https://github.com/datafusion-contrib/datafusion-java datafusion-java]
| Community-maintained Java bindings
|-
| [[C (programming language)|C]]
| [https://github.com/datafusion-contrib/datafusion-c datafusion-c]
| Community-maintained C bindings
|-
| [[Ruby (programming language)|Ruby]]
| [https://github.com/datafusion-contrib/datafusion-ruby datafusion-ruby]
| Community-maintained Ruby bindings
|-
| [[WebAssembly]]
| [https://github.com/datafusion-contrib/datafusion-wasm-bindings datafusion-wasm-bindings]
| Community-maintained WebAssembly bindings
|-
| Browser tooling
| [https://github.com/datafusion-contrib/datafusion-wasm-playground datafusion-wasm-playground], [https://github.com/datafusion-contrib/datafusion-fiddle datafusion-fiddle]
| Interactive playgrounds
|}

== Ecosystem projects ==

Several projects in the broader Apache ecosystem and the community-maintained [https://github.com/datafusion-contrib datafusion-contrib] organization extend DataFusion's capabilities.<ref name="df-contrib-org" />

* [https://github.com/apache/datafusion-comet Apache DataFusion Comet], donated to the Apache Software Foundation by [[Apple Inc.|Apple]] in 2024, is a plugin that uses DataFusion to accelerate [[Apache Spark]] workloads as a drop-in replacement for Spark's JVM-based SQL execution engine<ref name="comet-donation" />
* [https://github.com/datafusion-contrib/datafusion-federation datafusion-federation], which allows DataFusion to resolve queries across remote query engines while pushing down as much compute as possible to the remote source
* [https://github.com/datafusion-contrib/datafusion-distributed datafusion-distributed], a library for bringing distributed execution capabilities to DataFusion
* [https://github.com/datafusion-contrib/datafusion-materialized-views datafusion-materialized-views], which provides incremental view maintenance and query rewriting for [[materialized view|materialized views]] in DataFusion
* [https://github.com/datafusion-contrib/datafusion-table-providers datafusion-table-providers], which provides <code>TableProvider</code> implementations for reading data from external systems such as databases and file formats not natively supported by DataFusion

== References ==

{{Reflist}}

== External links ==

* {{Official website|https://datafusion.apache.org/}}
* {{GitHub|apache/datafusion}}
* {{URL|https://arrow.apache.org/}} Apache Arrow
Loading