bench: add Appian benchmark to the SQL bench matrix#8022
Conversation
Wires DuckDB's in-tree appian_benchmarks suite into vortex-bench so the same 8 join-heavy queries (~5M rows across 9 LEFT-OUTER-joined views) get the datafusion+duckdb × parquet/vortex/vortex-compact/duckdb treatment that clickbench/tpch/fineweb already get. The workload exercises wide CTE aggregations that the other suites don't. AppianBenchmark::generate_base_data downloads the upstream .duckdb blob and shells out to duckdb to materialize 9 lowercased Parquet shards, mirroring how realnest/gharchive and public_bi handle their own non-Parquet sources. The conversion lowercases column names at COPY time so DataFusion's default enable_ident_normalization=true resolves the verbatim camelCase Appian queries (orderItem_quantity, FROM CustomerView, ...) against the schema without per-engine special-casing or query rewriting — keeping upstream query strings byte-identical so future q09.sql etc. drop in unchanged. CI matrix entry runs appian-nvme at PR time on 5 core engine×format combos (datafusion+duckdb × parquet/vortex plus duckdb:duckdb), with develop fanning out to add vortex-compact for both engines. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
The eight Appian queries were ~340 lines of embedded string literals in appian/mod.rs, which is awkward to read and diff. Pull each one into its own `queries/qXX.sql` file (mirroring the upstream DuckDB layout) and embed via `include_str!` so it stays compile-time with no runtime fs read. Refreshing from upstream now reduces to dropping new .sql files into `queries/` and adding one line to the `QUERIES` array. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Move the Appian .sql files from `vortex-bench/src/appian/queries/q0N.sql` to `vortex-bench/appian/qN.sql` and load them at runtime through `appian_queries()`, mirroring `tpch_queries()` and `tpcds_queries()`. The prior `include_str!` setup was a workspace-novel pattern; this matches the existing TPC-H convention so reviewers don't have to learn a new one. Side effect: query indices in bench output are now 1-based (q1..q8) like TPC-H, instead of the 0-based numbering the old `enumerate()` produced. No historical Appian results to break since this is a new benchmark. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Merging this PR will not alter performance
|
The module-level `//!` doc had `[`TABLES`]` as an intra-doc link to a private const, which `Rust (docs)` CI flags via `-D warnings` -> `rustdoc::private-intra-doc-links`. Dropping the bracket link to plain code formatting; the reference is in the same file and a reader can find it visually without the navigation aid. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Polar Signals Profiling ResultsLatest Run
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 0.995x ➖ datafusion / vortex-file-compressed (0.995x ➖, 1↑ 0↓)
|
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.985x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.022x ➖, 0↑ 1↓)
datafusion / parquet (0.952x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.962x ➖, 2↑ 0↓)
duckdb / vortex-compact (1.018x ➖, 0↑ 0↓)
duckdb / parquet (0.988x ➖, 1↑ 0↓)
Full attributed analysis
|
File Sizes: PolarSignals ProfilingNo file size changes detected. |
File Sizes: FineWeb NVMeNo file size changes detected. |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.983x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.996x ➖, 0↑ 0↓)
datafusion / parquet (1.006x ➖, 0↑ 0↓)
datafusion / arrow (1.002x ➖, 1↑ 1↓)
duckdb / vortex-file-compressed (1.018x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.993x ➖, 0↑ 0↓)
duckdb / parquet (1.006x ➖, 1↑ 1↓)
duckdb / duckdb (1.013x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=1 on NVMENo file size changes detected. |
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.007x ➖, 0↑ 3↓)
datafusion / vortex-compact (0.995x ➖, 1↑ 1↓)
datafusion / parquet (1.001x ➖, 2↑ 0↓)
duckdb / vortex-file-compressed (1.000x ➖, 1↑ 0↓)
duckdb / vortex-compact (0.995x ➖, 2↑ 0↓)
duckdb / parquet (0.999x ➖, 0↑ 2↓)
duckdb / duckdb (0.998x ➖, 1↑ 1↓)
Full attributed analysis
|
File Sizes: TPC-DS SF=1 on NVMENo file size changes detected. |
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.203x ➖, 0↑ 3↓)
datafusion / vortex-compact (1.021x ➖, 0↑ 1↓)
datafusion / parquet (1.117x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.145x ➖, 0↑ 2↓)
duckdb / vortex-compact (1.049x ➖, 0↑ 1↓)
duckdb / parquet (1.047x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (0.971x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.994x ➖, 0↑ 0↓)
duckdb / parquet (0.975x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: Statistical and Population GeneticsNo file size changes detected. |
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.061x ➖, 0↑ 1↓)
datafusion / vortex-compact (1.015x ➖, 0↑ 0↓)
datafusion / parquet (1.022x ➖, 0↑ 0↓)
datafusion / arrow (1.017x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.030x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.001x ➖, 0↑ 0↓)
duckdb / parquet (0.989x ➖, 0↑ 0↓)
duckdb / duckdb (1.029x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=10 on NVMENo file size changes detected. |
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.998x ➖, 1↑ 0↓)
datafusion / parquet (0.998x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.984x ➖, 4↑ 1↓)
duckdb / parquet (1.001x ➖, 0↑ 0↓)
duckdb / duckdb (1.010x ➖, 2↑ 2↓)
Full attributed analysis
|
File Sizes: Clickbench on NVMEFile Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.196x ➖, 0↑ 6↓)
datafusion / vortex-compact (1.118x ➖, 0↑ 4↓)
datafusion / parquet (1.156x ➖, 0↑ 6↓)
duckdb / vortex-file-compressed (1.114x ➖, 0↑ 1↓)
duckdb / vortex-compact (1.073x ➖, 0↑ 0↓)
duckdb / parquet (1.085x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Appian on NVMEVortex (geomean): no vortex data datafusion / vortex-file-compressed (no group data, 0↑ 0↓)
datafusion / parquet (no group data, 0↑ 0↓)
duckdb / vortex-file-compressed (no group data, 0↑ 0↓)
duckdb / parquet (no group data, 0↑ 0↓)
duckdb / duckdb (no group data, 0↑ 0↓)
|
File Sizes: Appian on NVMENo baseline file sizes available yet. |
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.213x ➖, 0↑ 8↓)
datafusion / vortex-compact (1.245x ➖, 0↑ 5↓)
datafusion / parquet (1.178x ➖, 0↑ 7↓)
duckdb / vortex-file-compressed (1.094x ➖, 0↑ 1↓)
duckdb / vortex-compact (1.079x ➖, 0↑ 1↓)
duckdb / parquet (1.072x ➖, 0↑ 0↓)
Full attributed analysis
|
connortsui20
left a comment
There was a problem hiding this comment.
How important is it that we add this now? Ideally I would like to get #7849 in first, but if this is something we really want I can work around it
Summary
appian_benchmarkssuite tovortex-bench: 8 join-heavy queries against 9 LEFT-OUTER-joined views (~5M rows from the upstreamads.5Mdataset). Workload exercises wide CTE aggregations that TPC-H, clickbench, and fineweb don't. Layout mirrors TPC-H —vortex-bench/appian/q{1..8}.sqlplus anappian_queries()runtime loader modeled ontpch_queries()invortex-bench/src/tpch/mod.rs:25.AppianBenchmark::generate_base_datadownloads the ~593 MB upstream.duckdbblob and shells out toduckdbto project each of 9 tables into Parquet, lowercasing every column at COPY time. Same pattern asrealnest/gharchive.rs:116andpublic_bi.rs:424; uses theduckdbCLI already required bydata-genfor theOnDiskDuckDBformat.TableSpec) lets DataFusion's defaultenable_ident_normalization=trueresolve the verbatim camelCase Appian queries (FROM CustomerView,orderItem_quantity) without per-engine query rewriting. UpstreamqN.sqlfiles drop in byte-identically. See the module docs invortex-bench/src/appian/mod.rsfor the full rationale and the rejected alternatives.appian-nvmematrix entry in.github/workflows/sql-benchmarks.yml: PR runs datafusion+duckdb × {parquet, vortex} + duckdb:duckdb (5 combos); develop fans out to add vortex-compact on both engines. Pattern mirrorsclickbench-nvme.Test plan
cargo build -p vortex-bench --testscargo clippy -p vortex-bench --all-targets --all-featurescargo +nightly fmt --all -- --checkcargo test -p vortex-bench --lib live_dims_match_migrate_for_non_fan_out_suitesyamllint --strict -c .yamllint.yaml .github/workflows/sql-benchmarks.ymldata-gen appian --formats parquet,vortex,vortex-compactproduces 9 shards per format; idempotent re-run skips conversionappian-nvmematrix entry succeeds on the bench-dedicated runner once the PR opens🤖 Generated with Claude Code