Skip to content

Conversation

@alkis
Copy link
Contributor

@alkis alkis commented Aug 22, 2024

This is an annotated attempt to use flatbuffers as metadata for parquet. The goals are:

  1. flatbuffers "parse" extremely fast compared to thrift which
    • cuts down on critical path latency of processing parquet files
    • is so fast, O(n) effects of "parsing" metadata vs scanning 1 column are eliminated
  2. flatbuffers are typically bulkier than thrift, in this PR there are a multitude of optimizations to shrink the size of flatbuffer metadata
  3. keep the flatbuffer object model similar to that of thrift to facilitate easier migration to new metadata format

To run experiments:

mkdir arrow/src/o
cd arrow/src
cmake o --preset ninja-benchmarks
ninja -Co && o/relwithdebinfo/parquet-metadata3-benchmark path-to-footers/*

@alkis alkis requested a review from wgtmac as a code owner August 22, 2024 17:21
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@corwinjoy
Copy link

Thanks for posting this!
A few basic initial comments / questions:
Running the build, I think you want the following for instructions:

cd arrow/cpp
mkdir o
cd o
cmake .. --preset ninja-benchmarks
cd ..
ninja -Co
o/relwithdebinfo/parquet-metadata3-benchmark path-to-footers/*

Here I split the build and run. For now, it will not run until you generate footers per (https://github.com/apache/parquet-benchmark/pull/1/files).

@corwinjoy
Copy link

As part of running this code, I put in a PR with synthetic footers to the parquet-benchmark repository that can be used to get started (when setting your path-to-footers/directory).
apache/parquet-benchmark#2

@alkis alkis changed the title flatbuffers metadata experiments [GH-43695] flatbuffers metadata experiments Aug 23, 2024
@alkis alkis changed the title [GH-43695] flatbuffers metadata experiments GH-43695: flatbuffers metadata experiments Aug 23, 2024
@github-actions
Copy link

⚠️ GitHub issue #43695 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 23, 2024
@kou kou changed the title GH-43695: flatbuffers metadata experiments GH-43695: [C++][Parquet] flatbuffers metadata experiments Aug 23, 2024
@alkis alkis force-pushed the flatbuf3 branch 2 times, most recently from f273033 to 833e1cb Compare September 2, 2024 15:24
@github-actions
Copy link

Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer.

@github-actions github-actions bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025
@alkis
Copy link
Contributor Author

alkis commented Dec 10, 2025

Closing in favor of #48431

@alkis alkis closed this Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting committer review Awaiting committer review Component: C++ Component: Parquet Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants