feat: add granular repartition metrics#21152
feat: add granular repartition metrics#21152gene-bordegaray wants to merge 2 commits intoapache:mainfrom
Conversation
|
I have some concerns about these low-level (kernel-profiling) metrics, so I’m sharing a few suggestions. (Not trying to block this, given this is useful to solve real problems—just offering additional perspectives and possible improvements.) Metrics are typically used for query tuning at the application level, while these low-level ones are mainly for internal debugging and are less frequently used. They may also introduce execution overhead that’s hard to observe, and bring maintenance overhead. In general, it might be better to keep metrics that directly help application tuning, are frequently used, or are difficult to capture with external profilers. I suspect some of them can be directly observed with profilers/flamegraphs, maybe they can be simplified? Additionally, we could consider introducing a new analyze level |
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
Ya I agree and am trying to be wary about this as well. I don't not see any changes on benches but like may be hard to catch microregressions. I am using these and they are quite useful (and very convenient compared to a profiling tool like xctrace). Maybe these could be kept at the 'dev' level of analyze output? |
Which issue does this PR close?
Rationale for this change
See issue
What changes are included in this PR?
Added these metrics:
fetch_time: unchangedrepartition_time: now the end-to-end total repartition timeroute_time: the time to distribute row indices to output partitionsbatch_build_time: the time to build the record batcheschannel_wait_time: per output partition, the time waiting for channel capacity / send(...) to completespill_write_time: per output partition, the time writing spilled batchesspill_read_wait_time: per output partition, time the consumer side waits for a spilled batch to become readableAre these changes tested?
Added smoke tests
I have also run tpch, tpch10, tpcds, clickbench_1, clickbench_partitioned across this branch and main and see no regressions (I renamed my branch to separaate the metrics and microbench I created in two PRs, I will publish the microbench if these metrics are merged)
Bench Results
Are there any user-facing changes?
Users will see new metrics in
EXPLAIN ANALYZE, this is not an API change