Flag lukewarm cold runs and account for them in the Combined metric by toschmidt · Pull Request #954 · ClickHouse/ClickBench

toschmidt · 2026-06-22T09:26:48Z

ClickBench's "cold run" is only meaningful when the data is actually read back from the storage device, i.e., the engine is restarted (or never persists between queries) and the OS page cache is dropped before the first run. Systems that instead run an unseen query against a still-warm engine, or managed/remote services whose caches can't be flushed at all, do a lukewarm run and are supposed to carry the lukewarm-cold-run tag.

A number of such systems were never tagged, which lets their warm "cold" numbers compete directly with genuinely cold systems, distorting the combined ranking. This PR flags them and reworks how the Combined metric treats lukewarm (and otherwise inapplicable) components.

Changes (3 commits)

1. `flag lukewarm cold runs`

Adds lukewarm-cold-run to systems that do not perform a true cold run:

Persistent local daemons never restarted between queries (only the page cache is dropped, engine-internal caches survive): impala, firebolt.
Managed/remote services that can't be restarted or cache-flushed: clickhouse-cloud, motherduck, databricks, snowflake, redshift(+serverless), bigquery, athena(+partitioned), aurora-mysql, aurora-postgresql, alloydb, hologres, supabase, tablespace, tembo-olap, chyt, bytehouse, crunchy-bridge-for-analytics, timescale-cloud, tinybird, singlestore, hydra.

2. `reweight the combined metric for lukewarm and inapplicable metrics`

Unifies the per-metric exclusion rules in one metricExcludes() helper
(stateless→load, in-memory→cold/combined/load, lukewarm→cold, missing data
size→size) and reuses it everywhere:

Cold Run ranking excludes lukewarm systems by default.
The Combined per-query cold baseline excludes lukewarm/in-memory systems, so their warm "cold" numbers can't depress the baseline and inflate every true-cold system's cold ratio. min load time / data size likewise skip non-qualifying systems.
A metric that doesn't apply to a system is dropped, and the remaining weights renormalized rather than fed a bogus ratio. Lukewarm systems keep a cold weight of 0 folded into hot (load 10% / size 10% / hot 80%).

3. `show individual hot/cold/load/storage ratios in the combined view`

Expands the Combined score cell to also list the four component ratios that feed it (hot, cold, load, storage), each as the relative ×N ratio (n/a when a component doesn't apply).

FireShot Capture 060 - ClickBench — a Benchmark For Analytical DBMS -

ClickBench's "cold run" is only truly cold when the data must be read back from the storage device: the engine is restarted (or never persists between queries) AND the OS page cache is dropped before the first try. A system that instead runs an unseen query against a live engine with warm internal caches does a "lukewarm" run and is supposed to carry the "lukewarm-cold-run" tag. Several systems that do lukewarm runs were never tagged; this adds the tag to them. Newly flagged: - Persistent local daemons that are never restarted between queries, so only the page cache is dropped while engine-internal caches survive: impala, firebolt. - Managed/remote services that cannot be restarted or have their caches flushed at all: clickhouse-cloud, motherduck, databricks, snowflake, redshift(+serverless), bigquery, athena(+partitioned), aurora-mysql, aurora-postgresql, alloydb, hologres, supabase, tablespace, tembo-olap, chyt, bytehouse, crunchy-bridge-for-analytics, timescale-cloud, tinybird, singlestore, hydra. The tag is added to the displayed result files and, so future runs stay flagged, to each system's metadata source: template.json (impala, firebolt) and the inline JSON generators (clickhouse-cloud/collect-results.sh, motherduck/benchmark.py, databricks/benchmark.py). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The Combined score is a weighted geomean of load (10%), data size (10%), cold (20%) and hot (60%) ratios. That unfairly penalizes systems for metrics that don't apply to them, and lets lukewarm "cold" numbers (really warm queries) distort the cold component. Unify the per-metric exclusion rules in a single metricExcludes() helper (stateless from load, in-memory from cold/combined/load, lukewarm from cold, missing data size from size) and reuse it everywhere: - Cold Run metric: lukewarm systems are excluded from the ranking by default. - Combined per-query baseline: the cold-run minimum excludes lukewarm / in-memory systems, so their warm "cold" numbers can't depress the baseline and inflate every true-cold system's cold ratio. min load time / min data size likewise exclude systems that don't qualify. - Combined score: a metric that doesn't apply to a system is dropped and the remaining weights are renormalized, instead of feeding a bogus ratio. Lukewarm systems keep a cold component of 0 with its weight folded into hot (load 10% / size 10% / hot 80%); a stateless engine that still reports a load time (e.g. Polars (Parquet)) drops the load component; etc. The cold term is guarded so an all-lukewarm selection (empty cold baseline) can't poison the score with NaN. The Combined view still shows only the single overall score; the per-component breakdown is added in a follow-up commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The Combined metric collapses load, data size, cold and hot into a single weighted-geomean "×N" score, which hides how a system earned it. In the Combined view, expand the score cell to also list the four component ratios that feed it: hot, cold, load and storage, each as the relative "×N" ratio (a component that doesn't apply to the system, e.g. cold for a lukewarm engine or load for a stateless one, shows "n/a"). Each ratio is padded to a fixed width with non-breaking spaces so the columns line up in the monospace cell (6 fits "×12.34" for hot/cold/load, 5 fits "×1.23" for storage), and the overall score is shown in bold. Other metric views are unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

toschmidt and others added 3 commits June 22, 2026 11:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flag lukewarm cold runs and account for them in the Combined metric#954

Flag lukewarm cold runs and account for them in the Combined metric#954
toschmidt wants to merge 3 commits into
ClickHouse:mainfrom
umbra-db:schmidt/lukewarm-cold-runs

toschmidt commented Jun 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

toschmidt commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes (3 commits)

1. flag lukewarm cold runs

2. reweight the combined metric for lukewarm and inapplicable metrics

3. show individual hot/cold/load/storage ratios in the combined view

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

toschmidt commented Jun 22, 2026 •

edited

Loading

1. `flag lukewarm cold runs`

2. `reweight the combined metric for lukewarm and inapplicable metrics`

3. `show individual hot/cold/load/storage ratios in the combined view`