add first draft of wikipedia article#21105
add first draft of wikipedia article#21105gene-bordegaray wants to merge 5 commits intoapache:mainfrom
Conversation
alamb
left a comment
There was a problem hiding this comment.
Thank you @gene-bordegaray -- this looks great. I left some suggestions on how to make some of this language tighter.
Maybe we can wait a few days more and then submit to the wikipedia editors 🤔
|
also a side note. I wanted to add the DF logo but my account needs to be verified (I think will be in a day or two) 😅 |
dev/wiki/apache-datafusion.wikitext
Outdated
| | website = {{URL|https://datafusion.apache.org/}} | ||
| }} | ||
|
|
||
| '''Apache DataFusion''' is an [[open-source software|open-source]], embeddable analytical query engine written in [[Rust (programming language)|Rust]], built on [[Apache Arrow]]'s columnar memory format.<ref name="sigmod-paper">{{cite journal |last1=Lamb |first1=Andrew |last2=Shen |first2=Yijie |last3=Heres |first3=Daniel |last4=Chakraborty |first4=Jayjeet |last5=Kabak |first5=Mehmet Ozan |last6=Hsieh |first6=Liang-Chi |last7=Sun |first7=Chao |title=Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine |journal=Proceedings of the 2024 International Conference on Management of Data |year=2024 |doi=10.1145/3626246.3653368}}</ref><ref name="intro-docs">{{cite web |title=Introduction |url=https://datafusion.apache.org/user-guide/introduction.html |website=Apache DataFusion |publisher=Apache Software Foundation |access-date=2026-03-22}}</ref> It provides [[SQL]] and DataFrame interfaces for analytical query execution and is designed to be used as a library by developers building databases, query engines, and analytical tools, rather than as a standalone database server.<ref name="sigmod-paper" /><ref name="intro-docs" /> The project originated in 2017, was donated to the [[Apache Arrow]] project in 2019, and became a top-level project of the [[Apache Software Foundation]] in 2024.<ref name="donation-post">{{cite web |title=DataFusion: A Rust-native Query Engine for Apache Arrow |url=https://datafusion.apache.org/blog/2019/02/04/datafusion-donation/ |website=Apache DataFusion Blog |publisher=Apache Software Foundation |date=2019-02-04 |access-date=2026-03-22}}</ref><ref name="asf-tlp">{{cite web |title=Apache Software Foundation Announces New Top-Level Project Apache DataFusion |url=https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion |website=The ASF Blog |publisher=Apache Software Foundation |date=2024-06-11 |access-date=2026-03-22}}</ref> |
There was a problem hiding this comment.
It provides [[SQL]] and DataFrame interfaces for analytical query execution and is designed to be used as a library by developers building databases, query engines, and analytical tools, rather than as a standalone database server.
I think we can make this a bit better in the sense of introducing DataFusion and its uniqueness. Here's what I think :
Often described as the "LLVM for Databases," [Source 1] Apache DataFusion is a modular, Arrow-native query engine library designed for embedding into custom systems rather than operating as a monolithic standalone server [Source 2 and 3]. This high-performance Rust framework provides a composable foundation, allowing developers to precisely extend query planning and vectorized execution to meet unique architectural requirements. [Source 2 and 3]
Source 1 : https://midas.bu.edu/assets/slides/andrew_lamb_slides.pdf (cc @alamb )
Source 2 and 3 (this is the first two reference) : {{cite journal |last1=Lamb |first1=Andrew |last2=Shen |first2=Yijie |last3=Heres |first3=Daniel |last4=Chakraborty |first4=Jayjeet |last5=Kabak |first5=Mehmet Ozan |last6=Hsieh |first6=Liang-Chi |last7=Sun |first7=Chao |title=Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine |journal=Proceedings of the 2024 International Conference on Management of Data |year=2024 |doi=10.1145/3626246.3653368}}{{cite web |title=Introduction |url=https://datafusion.apache.org/user-guide/introduction.html |website=Apache DataFusion |publisher=Apache Software Foundation |access-date=2026-03-22}}
There was a problem hiding this comment.
I don't know if we should add the "LLVM for databases". Mostly because the primary source for it is from not the strongest source (slide show) and doesnt appear in the other sources like the SIGMOD paper or other coverage.
I was reviwing the Wikipedia guidelines and they advise anything promotional unless well-cited which this may get flagged for.
dev/wiki/apache-datafusion.wikitext
Outdated
| | website = {{URL|https://datafusion.apache.org/}} | ||
| }} | ||
|
|
||
| '''Apache DataFusion''' is an [[open-source software|open-source]], embeddable analytical query engine written in [[Rust (programming language)|Rust]], built on [[Apache Arrow]]'s columnar memory format.<ref name="sigmod-paper">{{cite journal |last1=Lamb |first1=Andrew |last2=Shen |first2=Yijie |last3=Heres |first3=Daniel |last4=Chakraborty |first4=Jayjeet |last5=Kabak |first5=Mehmet Ozan |last6=Hsieh |first6=Liang-Chi |last7=Sun |first7=Chao |title=Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine |journal=Proceedings of the 2024 International Conference on Management of Data |year=2024 |doi=10.1145/3626246.3653368}}</ref><ref name="intro-docs">{{cite web |title=Introduction |url=https://datafusion.apache.org/user-guide/introduction.html |website=Apache DataFusion |publisher=Apache Software Foundation |access-date=2026-03-22}}</ref> It provides [[SQL]] and DataFrame interfaces for analytical query execution and is designed to be used as a library by developers building databases, query engines, and analytical tools, rather than as a standalone database server.<ref name="sigmod-paper" /><ref name="intro-docs" /> The project originated in 2017, was donated to the [[Apache Arrow]] project in 2019, and became a top-level project of the [[Apache Software Foundation]] in 2024.<ref name="donation-post">{{cite web |title=DataFusion: A Rust-native Query Engine for Apache Arrow |url=https://datafusion.apache.org/blog/2019/02/04/datafusion-donation/ |website=Apache DataFusion Blog |publisher=Apache Software Foundation |date=2019-02-04 |access-date=2026-03-22}}</ref><ref name="asf-tlp">{{cite web |title=Apache Software Foundation Announces New Top-Level Project Apache DataFusion |url=https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion |website=The ASF Blog |publisher=Apache Software Foundation |date=2024-06-11 |access-date=2026-03-22}}</ref> |
There was a problem hiding this comment.
I believe the project will continue to grow so we can write at the end :
Apache DataFusion now sees over one million monthly downloads. [cite crate.io source]
There was a problem hiding this comment.
We could also say "as of March 2026, DataFusion saw one million monthly downloads" if we wanted to ensure the sstatement remained accurate
There was a problem hiding this comment.
Ya I think this is great, definitely with the third party source 👍
ntjohnson1
left a comment
There was a problem hiding this comment.
I got sick so fell off looking at this. I think this looks great for a first pass and we should push to wikipedia to see what the reviewers say. One note that I don't know if I have time for is that this seems to slightly over emphasize the extensibility perspective.
On a quick read through I would assume this was only for building the infrastructure and could easily miss the SQL/dataframe API bits. At rerun I use datafusion (specifically datafusion-python) quite heavily but don't really know the details about our table provider (since other people build that bit). I suspect our customers will also hit this page since we generate examples for the DataFrame API in python (and are generating more SQL examples). https://rerun.io/docs/howto/query-and-transform/dataframe_operations
Mostly just food for thought that there might be two distinct audiences interested in this page. People who build on datafusion and those who build data products using datafusion top level APIs. (I still think landing the page first makes sense then I or someone else can potentially try to add a section for more SQL/DataFrame API details)
| | website = {{URL|https://datafusion.apache.org/}} | ||
| }} | ||
|
|
||
| '''Apache DataFusion''' is an [[open-source software|open-source]], embeddable analytical query engine written in [[Rust (programming language)|Rust]], built on [[Apache Arrow]]'s columnar memory format.<ref name="sigmod-paper">{{cite journal |last1=Lamb |first1=Andrew |last2=Shen |first2=Yijie |last3=Heres |first3=Daniel |last4=Chakraborty |first4=Jayjeet |last5=Kabak |first5=Mehmet Ozan |last6=Hsieh |first6=Liang-Chi |last7=Sun |first7=Chao |title=Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine |journal=Proceedings of the 2024 International Conference on Management of Data |year=2024 |doi=10.1145/3626246.3653368}}</ref><ref name="intro-docs">{{cite web |title=Introduction |url=https://datafusion.apache.org/user-guide/introduction.html |website=Apache DataFusion |publisher=Apache Software Foundation |access-date=2026-03-22}}</ref> It provides [[SQL]] and DataFrame interfaces for analytical query execution and is designed to be used as a library by developers building databases, query engines, and analytical tools, rather than as a standalone database server.<ref name="sigmod-paper" /><ref name="intro-docs" /> The project originated in 2017, was donated to the [[Apache Arrow]] project in 2019, and became a top-level project of the [[Apache Software Foundation]] in 2024.<ref name="donation-post">{{cite web |title=DataFusion: A Rust-native Query Engine for Apache Arrow |url=https://datafusion.apache.org/blog/2019/02/04/datafusion-donation/ |website=Apache DataFusion Blog |publisher=Apache Software Foundation |date=2019-02-04 |access-date=2026-03-22}}</ref><ref name="asf-tlp">{{cite web |title=Apache Software Foundation Announces New Top-Level Project Apache DataFusion |url=https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion |website=The ASF Blog |publisher=Apache Software Foundation |date=2024-06-11 |access-date=2026-03-22}}</ref> As of March 2026, DataFusion exceeded one million monthly downloads on crates.io.<ref name="crates-io">{{cite web |title=datafusion |url=https://crates.io/crates/datafusion |website=crates.io |access-date=2026-03-26}}</ref> |
There was a problem hiding this comment.
There isn't a formal page on Dataframes but there is a stub that refers to Spark, pandas, etc. After this page lands we should add a pointer to it from there. https://en.wikipedia.org/wiki/Dataframe
| | website = {{URL|https://datafusion.apache.org/}} | ||
| }} | ||
|
|
||
| '''Apache DataFusion''' is an [[open-source software|open-source]], embeddable analytical query engine written in [[Rust (programming language)|Rust]], built on [[Apache Arrow]]'s columnar memory format.<ref name="sigmod-paper">{{cite journal |last1=Lamb |first1=Andrew |last2=Shen |first2=Yijie |last3=Heres |first3=Daniel |last4=Chakraborty |first4=Jayjeet |last5=Kabak |first5=Mehmet Ozan |last6=Hsieh |first6=Liang-Chi |last7=Sun |first7=Chao |title=Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine |journal=Proceedings of the 2024 International Conference on Management of Data |year=2024 |doi=10.1145/3626246.3653368}}</ref><ref name="intro-docs">{{cite web |title=Introduction |url=https://datafusion.apache.org/user-guide/introduction.html |website=Apache DataFusion |publisher=Apache Software Foundation |access-date=2026-03-22}}</ref> It provides [[SQL]] and DataFrame interfaces for analytical query execution and is designed to be used as a library by developers building databases, query engines, and analytical tools, rather than as a standalone database server.<ref name="sigmod-paper" /><ref name="intro-docs" /> The project originated in 2017, was donated to the [[Apache Arrow]] project in 2019, and became a top-level project of the [[Apache Software Foundation]] in 2024.<ref name="donation-post">{{cite web |title=DataFusion: A Rust-native Query Engine for Apache Arrow |url=https://datafusion.apache.org/blog/2019/02/04/datafusion-donation/ |website=Apache DataFusion Blog |publisher=Apache Software Foundation |date=2019-02-04 |access-date=2026-03-22}}</ref><ref name="asf-tlp">{{cite web |title=Apache Software Foundation Announces New Top-Level Project Apache DataFusion |url=https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion |website=The ASF Blog |publisher=Apache Software Foundation |date=2024-06-11 |access-date=2026-03-22}}</ref> As of March 2026, DataFusion exceeded one million monthly downloads on crates.io.<ref name="crates-io">{{cite web |title=datafusion |url=https://crates.io/crates/datafusion |website=crates.io |access-date=2026-03-26}}</ref> |
There was a problem hiding this comment.
NIT: I think extensible analytical query engine is clearer than embeddable analytical query engine. Extensible is what is listed on the landing page for datafusion on apache.org
|
|
||
| == Adoption and reception == | ||
|
|
||
| DataFusion has been adopted across a range of analytics and database products. [[Cloudflare]] used DataFusion in its Log Explorer product to execute SQL queries over log data stored in Cloudflare R2.<ref name="cloudflare">{{cite web |title=Cloudflare Log Explorer is now GA, providing native observability and forensics |url=https://blog.cloudflare.com/logexplorer-ga/ |website=The Cloudflare Blog |publisher=Cloudflare |date=2025-06-18 |access-date=2026-03-22}}</ref> [[Palantir Technologies|Palantir]] Lightweight Pipelines are powered by DataFusion.<ref name="palantir-2025">{{cite web |title=Announcements: July 2025 |url=https://www.palantir.com/docs/foundry/announcements/2025-07 |website=Palantir Foundry Documentation |publisher=Palantir Technologies |date=2025-07-29 |access-date=2026-03-22}}</ref><ref name="palantir-2024">{{cite web |title=Announcements: February 2024 |url=https://www.palantir.com/docs/foundry/announcements/2024-02 |website=Palantir Foundry Documentation |publisher=Palantir Technologies |date=February 2024 |access-date=2026-03-22}}</ref> [[InfluxDB]] 3.0 uses DataFusion as part of the FDAP stack: Apache Flight, DataFusion, Arrow, and Parquet.<ref name="influx-fdap">{{cite web |title=Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0 |url=https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/ |website=InfluxData |date=2023-10-25 |access-date=2026-03-22}}</ref> Other users described in public sources include EDB Postgres AI,<ref name="siliconangle-edb">{{cite web |title=Enterprise DB begins rolling AI features into PostgreSQL |url=https://siliconangle.com/2024/05/23/enterprise-db-begins-rolling-ai-features-postgresql/ |website=SiliconANGLE |date=2024-05-23 |access-date=2026-03-22}}</ref> Cube,<ref name="cube-pushdown">{{cite web |title=Query pushdown in Cube's semantic layer |url=https://cube.dev/blog/query-push-down-in-cubes-semantic-layer |website=Cube |date=2024-06-03 |access-date=2026-03-22}}</ref> Spice AI,<ref name="spice">{{cite web |title=How we use Apache DataFusion at Spice AI |url=https://spice.ai/blog/how-we-use-apache-datafusion-at-spice-ai |website=Spice AI |date=2026-01-17 |access-date=2026-03-22}}</ref> Pydantic Logfire,<ref name="logfire">{{cite web |title=We're changing database |url=https://github.com/pydantic/logfire/issues/408 |website=GitHub |date=2024-08-29 |access-date=2026-03-22}}</ref> and Kamu.<ref name="kamu">{{cite web |title=100X faster ingestion, and FlightSQL support for connecting BI tools |url=https://www.kamu.dev/blog/2023-09-datafusion-flightsql/ |website=Kamu Data |date=2023-09-26 |access-date=2026-03-22}}</ref> |
There was a problem hiding this comment.
I'm biased to want to include a link to rerun but we don't have a blog post calling out DataFusion even though it is all over our repo. Will work on that.
Which issue does this PR close?
dev/wiki/apache-datafusion.wikitext