Skip to content

Flag lukewarm cold runs and account for them in the Combined metric#954

Open
toschmidt wants to merge 3 commits into
ClickHouse:mainfrom
umbra-db:schmidt/lukewarm-cold-runs
Open

Flag lukewarm cold runs and account for them in the Combined metric#954
toschmidt wants to merge 3 commits into
ClickHouse:mainfrom
umbra-db:schmidt/lukewarm-cold-runs

Conversation

@toschmidt

@toschmidt toschmidt commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

ClickBench's "cold run" is only meaningful when the data is actually read back from the storage device, i.e., the engine is restarted (or never persists between queries) and the OS page cache is dropped before the first run. Systems that instead run an unseen query against a still-warm engine, or managed/remote services whose caches can't be flushed at all, do a lukewarm run and are supposed to carry the lukewarm-cold-run tag.

A number of such systems were never tagged, which lets their warm "cold" numbers compete directly with genuinely cold systems, distorting the combined ranking. This PR flags them and reworks how the Combined metric treats lukewarm (and otherwise inapplicable) components.

Changes (3 commits)

1. flag lukewarm cold runs

Adds lukewarm-cold-run to systems that do not perform a true cold run:

  • Persistent local daemons never restarted between queries (only the page cache is dropped, engine-internal caches survive): impala, firebolt.
  • Managed/remote services that can't be restarted or cache-flushed: clickhouse-cloud, motherduck, databricks, snowflake, redshift(+serverless), bigquery, athena(+partitioned), aurora-mysql, aurora-postgresql, alloydb, hologres, supabase, tablespace, tembo-olap, chyt, bytehouse, crunchy-bridge-for-analytics, timescale-cloud, tinybird, singlestore, hydra.

2. reweight the combined metric for lukewarm and inapplicable metrics

Unifies the per-metric exclusion rules in one metricExcludes() helper
(stateless→load, in-memory→cold/combined/load, lukewarm→cold, missing data
size→size) and reuses it everywhere:

  • Cold Run ranking excludes lukewarm systems by default.
  • The Combined per-query cold baseline excludes lukewarm/in-memory systems, so their warm "cold" numbers can't depress the baseline and inflate every true-cold system's cold ratio. min load time / data size likewise skip non-qualifying systems.
  • A metric that doesn't apply to a system is dropped, and the remaining weights renormalized rather than fed a bogus ratio. Lukewarm systems keep a cold weight of 0 folded into hot (load 10% / size 10% / hot 80%).

3. show individual hot/cold/load/storage ratios in the combined view

Expands the Combined score cell to also list the four component ratios that feed it (hot, cold, load, storage), each as the relative ×N ratio (n/a when a component doesn't apply).

FireShot Capture 060 - ClickBench — a Benchmark For Analytical DBMS -

toschmidt and others added 3 commits June 22, 2026 11:23
ClickBench's "cold run" is only truly cold when the data must be read
back from the storage device: the engine is restarted (or never
persists between queries) AND the OS page cache is dropped before the
first try. A system that instead runs an unseen query against a live
engine with warm internal caches does a "lukewarm" run and is supposed
to carry the "lukewarm-cold-run" tag. Several systems that do lukewarm
runs were never tagged; this adds the tag to them.

Newly flagged:
- Persistent local daemons that are never restarted between queries, so
  only the page cache is dropped while engine-internal caches survive:
  impala, firebolt.
- Managed/remote services that cannot be restarted or have their caches
  flushed at all: clickhouse-cloud, motherduck, databricks, snowflake,
  redshift(+serverless), bigquery, athena(+partitioned), aurora-mysql,
  aurora-postgresql, alloydb, hologres, supabase, tablespace,
  tembo-olap, chyt, bytehouse, crunchy-bridge-for-analytics,
  timescale-cloud, tinybird, singlestore, hydra.

The tag is added to the displayed result files and, so future runs stay
flagged, to each system's metadata source: template.json (impala,
firebolt) and the inline JSON generators
(clickhouse-cloud/collect-results.sh, motherduck/benchmark.py,
databricks/benchmark.py).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Combined score is a weighted geomean of load (10%), data size (10%),
cold (20%) and hot (60%) ratios. That unfairly penalizes systems for
metrics that don't apply to them, and lets lukewarm "cold" numbers
(really warm queries) distort the cold component.

Unify the per-metric exclusion rules in a single metricExcludes()
helper (stateless from load, in-memory from cold/combined/load, lukewarm
from cold, missing data size from size) and reuse it everywhere:

- Cold Run metric: lukewarm systems are excluded from the ranking by
  default.
- Combined per-query baseline: the cold-run minimum excludes lukewarm /
  in-memory systems, so their warm "cold" numbers can't depress the
  baseline and inflate every true-cold system's cold ratio. min load
  time / min data size likewise exclude systems that don't qualify.
- Combined score: a metric that doesn't apply to a system is dropped and
  the remaining weights are renormalized, instead of feeding a bogus
  ratio. Lukewarm systems keep a cold component of 0 with its weight
  folded into hot (load 10% / size 10% / hot 80%); a stateless engine
  that still reports a load time (e.g. Polars (Parquet)) drops the load
  component; etc. The cold term is guarded so an all-lukewarm selection
  (empty cold baseline) can't poison the score with NaN.

The Combined view still shows only the single overall score; the
per-component breakdown is added in a follow-up commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Combined metric collapses load, data size, cold and hot into a
single weighted-geomean "×N" score, which hides how a system earned it.

In the Combined view, expand the score cell to also list the four
component ratios that feed it: hot, cold, load and storage, each as the
relative "×N" ratio (a component that doesn't apply to the system, e.g.
cold for a lukewarm engine or load for a stateless one, shows "n/a").
Each ratio is padded to a fixed width with non-breaking spaces so the
columns line up in the monospace cell (6 fits "×12.34" for
hot/cold/load, 5 fits "×1.23" for storage), and the overall score is
shown in bold. Other metric views are unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant