Flag lukewarm cold runs and account for them in the Combined metric#954
Open
toschmidt wants to merge 3 commits into
Open
Flag lukewarm cold runs and account for them in the Combined metric#954toschmidt wants to merge 3 commits into
toschmidt wants to merge 3 commits into
Conversation
ClickBench's "cold run" is only truly cold when the data must be read back from the storage device: the engine is restarted (or never persists between queries) AND the OS page cache is dropped before the first try. A system that instead runs an unseen query against a live engine with warm internal caches does a "lukewarm" run and is supposed to carry the "lukewarm-cold-run" tag. Several systems that do lukewarm runs were never tagged; this adds the tag to them. Newly flagged: - Persistent local daemons that are never restarted between queries, so only the page cache is dropped while engine-internal caches survive: impala, firebolt. - Managed/remote services that cannot be restarted or have their caches flushed at all: clickhouse-cloud, motherduck, databricks, snowflake, redshift(+serverless), bigquery, athena(+partitioned), aurora-mysql, aurora-postgresql, alloydb, hologres, supabase, tablespace, tembo-olap, chyt, bytehouse, crunchy-bridge-for-analytics, timescale-cloud, tinybird, singlestore, hydra. The tag is added to the displayed result files and, so future runs stay flagged, to each system's metadata source: template.json (impala, firebolt) and the inline JSON generators (clickhouse-cloud/collect-results.sh, motherduck/benchmark.py, databricks/benchmark.py). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Combined score is a weighted geomean of load (10%), data size (10%), cold (20%) and hot (60%) ratios. That unfairly penalizes systems for metrics that don't apply to them, and lets lukewarm "cold" numbers (really warm queries) distort the cold component. Unify the per-metric exclusion rules in a single metricExcludes() helper (stateless from load, in-memory from cold/combined/load, lukewarm from cold, missing data size from size) and reuse it everywhere: - Cold Run metric: lukewarm systems are excluded from the ranking by default. - Combined per-query baseline: the cold-run minimum excludes lukewarm / in-memory systems, so their warm "cold" numbers can't depress the baseline and inflate every true-cold system's cold ratio. min load time / min data size likewise exclude systems that don't qualify. - Combined score: a metric that doesn't apply to a system is dropped and the remaining weights are renormalized, instead of feeding a bogus ratio. Lukewarm systems keep a cold component of 0 with its weight folded into hot (load 10% / size 10% / hot 80%); a stateless engine that still reports a load time (e.g. Polars (Parquet)) drops the load component; etc. The cold term is guarded so an all-lukewarm selection (empty cold baseline) can't poison the score with NaN. The Combined view still shows only the single overall score; the per-component breakdown is added in a follow-up commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Combined metric collapses load, data size, cold and hot into a single weighted-geomean "×N" score, which hides how a system earned it. In the Combined view, expand the score cell to also list the four component ratios that feed it: hot, cold, load and storage, each as the relative "×N" ratio (a component that doesn't apply to the system, e.g. cold for a lukewarm engine or load for a stateless one, shows "n/a"). Each ratio is padded to a fixed width with non-breaking spaces so the columns line up in the monospace cell (6 fits "×12.34" for hot/cold/load, 5 fits "×1.23" for storage), and the overall score is shown in bold. Other metric views are unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ClickBench's "cold run" is only meaningful when the data is actually read back from the storage device, i.e., the engine is restarted (or never persists between queries) and the OS page cache is dropped before the first run. Systems that instead run an unseen query against a still-warm engine, or managed/remote services whose caches can't be flushed at all, do a lukewarm run and are supposed to carry the
lukewarm-cold-runtag.A number of such systems were never tagged, which lets their warm "cold" numbers compete directly with genuinely cold systems, distorting the combined ranking. This PR flags them and reworks how the Combined metric treats lukewarm (and otherwise inapplicable) components.
Changes (3 commits)
1.
flag lukewarm cold runsAdds
lukewarm-cold-runto systems that do not perform a true cold run:impala,firebolt.2.
reweight the combined metric for lukewarm and inapplicable metricsUnifies the per-metric exclusion rules in one
metricExcludes()helper(stateless→load, in-memory→cold/combined/load, lukewarm→cold, missing data
size→size) and reuses it everywhere:
minload time / data size likewise skip non-qualifying systems.3.
show individual hot/cold/load/storage ratios in the combined viewExpands the Combined score cell to also list the four component ratios that feed it (
hot,cold,load,storage), each as the relative×Nratio (n/awhen a component doesn't apply).