From bf95ce86d6a2b086e54bdb1cded833c20aa619ef Mon Sep 17 00:00:00 2001 From: "Yiqun (Ethan) Zhang" Date: Tue, 24 Feb 2026 00:10:33 -0600 Subject: [PATCH] Add PBench 1.2.1 blog post --- website/blog/2026-02-24-pbench-1.2.md | 505 ++++++++++++++++++++++++++ 1 file changed, 505 insertions(+) create mode 100644 website/blog/2026-02-24-pbench-1.2.md diff --git a/website/blog/2026-02-24-pbench-1.2.md b/website/blog/2026-02-24-pbench-1.2.md new file mode 100644 index 000000000..8f8611989 --- /dev/null +++ b/website/blog/2026-02-24-pbench-1.2.md @@ -0,0 +1,505 @@ +--- +title: "PBench 1.2.1: End-to-End Benchmarking and Performance Testing for Presto" +author: Ethan Zhang +authorURL: https://github.com/ethanyzhang +--- + +Benchmarking a distributed SQL engine like Presto involves much more than running a few queries and recording wall-clock times. Real-world performance evaluation demands multi-phase test execution, concurrent workloads, production traffic replay, and deep offline analysis. [PBench](https://github.com/prestodb/pbench) is a purpose-built benchmarking toolkit for Presto that handles all of this through a declarative, composable stage system. With the 1.2.1 release, PBench becomes significantly more robust and flexible — adding parallel stream execution, dynamic query generation, richer script integration, and a fully data-driven cluster configuration generator — while continuing to provide a unified interface for standard benchmarks, A/B testing, and production workload analysis. In this post, we'll walk through implementing a spec-compliant TPC-DS benchmark with PBench, deep offline analysis with `pbench loadjson`, and how the same building blocks extend to real-world workflows like schema capture, A/B testing, and traffic replay. + + + +## What's New in PBench 1.2.1 + +The key new capabilities that enable the workflows described in this post: + +- **Parallel stream execution** — the `stream_count` parameter runs N parallel instances of a stage, each with a deterministically derived random seed, mapping directly to the TPC-DS throughput test model +- **Directory expansion in `query_files`** — entries can now point to directories, expanded to contained SQL files at execution time (after `pre_stage_scripts`), enabling dynamic query generation workflows like `dsqgen` +- **`no_random_duplicates`** — shuffled random execution that cycles through all queries before repeating, useful for throughput tests requiring full query coverage +- **Shell script environment variables** — `PBENCH_STAGE_ID`, `PBENCH_OUTPUT_DIR`, `PBENCH_QUERY_FILE`, `PBENCH_QUERY_ID`, etc. are injected into all script hooks +- **Data-driven [`genconfig`](https://github.com/prestodb/pbench/wiki/Generating-Benchmark-Configurations)** — generalized to use generic maps and templates with arithmetic/string functions, so adding new cluster configuration fields requires only JSON and template changes + +The full release notes are available for [1.2](https://github.com/prestodb/pbench/releases/tag/v1.2) and [1.2.1](https://github.com/prestodb/pbench/releases/tag/v1.2.1). + +## Implementing a Full TPC-DS Benchmark with PBench + +The [TPC-DS specification](https://www.tpc.org/tpcds/) defines a multi-phase benchmark: data loading, a sequential power test, a concurrent throughput test, data maintenance operations, and a second throughput test. PBench's [DAG-based stage system](https://github.com/prestodb/pbench/wiki/Configuring-PBench) maps naturally to this structure. Let's walk through how to implement the complete TPC-DS benchmark lifecycle. + +### The Stage DAG + +PBench benchmarks are defined as JSON stage files. Each stage specifies queries to run, session settings, and optionally a `next` field that points to child stages. Child stages execute in parallel after the parent completes, forming a DAG. Settings like `catalog`, `schema`, and `session_params` are inherited by child stages unless overridden. + +The overall DAG looks like this: + +``` +data_load (CREATE TABLE, CTAS or INSERT SELECT, ANALYZE) + └→ power_test (99 queries, sequential) + └→ throughput_test_1 (N parallel streams) + └→ data_maintenance (INSERT/DELETE operations) + └→ throughput_test_2 (N parallel streams) +``` + +### Phase 1: Data Loading + +In TPC-DS, raw data is pre-generated as CSV flat files by the `dsdgen` tool — this happens outside PBench. The loading phase creates tables in your target format and populates them. A typical approach is to create external tables on the CSV files, then use CREATE TABLE AS SELECT (CTAS) or INSERT SELECT to load the data into Iceberg or Hive Parquet tables with proper type casting, partitioning schemes, and compression: + +```json +{ + "id": "data_load", + "description": "Create tables, load from CSV source, and gather statistics", + "catalog": "iceberg", + "schema": "tpcds_sf1000_parquet", + "query_files": [ + "./ddl/", + "./data_loading/", + "./analyze/" + ], + "abort_on_error": true, + "next": ["power_test.json"] +} +``` + +Since `query_files` supports directories as of PBench 1.2.1, each directory is expanded to its contained SQL files in sorted order. Here `ddl/` holds the CREATE TABLE statements, `data_loading/` holds the INSERT SELECT or CTAS statements that populate the tables from the CSV-backed source, and `analyze/` holds the ANALYZE statements for gathering column statistics. The directories are processed in the order listed, so the natural separation also gives you the correct execution sequence. + +When benchmarking across different table formats (Iceberg vs. Hive), partitioning schemes, or compression methods, the number of DDL scripts and configurations can quickly become unwieldy. PBench includes [`genddl`](https://github.com/prestodb/pbench/wiki/Command-Reference#pbench-genddl) and [`genconfig`](https://github.com/prestodb/pbench/wiki/Generating-Benchmark-Configurations) helper commands that generate these scripts and cluster configurations from templates, reducing the chance of human error when managing many variations. + +### Phase 2: Power Test + +The power test runs all 99 TPC-DS queries sequentially on a single stream. There are two approaches depending on your use case. + +**For development and regression testing**, you can use a fixed set of pre-generated queries with known expected row counts. This is useful during iterative development where you want fast feedback on correctness: + +```json +{ + "id": "power_test", + "description": "TPC-DS Power Test: 99 pre-generated queries with row count validation", + "query_files": [ + "queries/query_01.sql", + "queries/query_02.sql", + "...", + "queries/query_99.sql" + ], + "cold_runs": 1, + "save_json": true, + "expected_row_counts": { + "tpcds_sf1000": [100, 2520, 9, "..."] + }, + "next": ["throughput_test_1.json"] +} +``` + +The `expected_row_counts` field validates that each query returns the correct number of rows for the given scale factor, catching silent correctness regressions. For row-by-row correctness checking, set `save_output: true` to write the full query result to disk — you can then use `pbench cmp` to diff outputs between runs. Setting `save_json: true` captures the full Presto query JSON for each query, which we can later load into a database for deep analysis on the detailed metrics (more on this below). + +**For a spec-compliant TPC-DS run**, queries are generated on the fly by `dsqgen` with a specific random seed, producing a unique query set each time. We use a pre-stage script to invoke `dsqgen` and point `query_files` at the output directory: + +```json +{ + "id": "power_test", + "description": "TPC-DS Power Test: queries generated by dsqgen", + "pre_stage_scripts": ["./scripts/generate_power_queries.sh"], + "query_files": ["./generated_queries/power/"], + "cold_runs": 1, + "save_json": true, + "next": ["throughput_test_1.json"] +} +``` + +The `generate_power_queries.sh` script invokes `dsqgen` to produce the 99 queries into the directory that PBench will discover: + +```bash +#!/bin/bash +SCALE_FACTOR=1000 +SEED=12345 +OUTPUT_DIR="./generated_queries/power" + +mkdir -p "$OUTPUT_DIR" +dsqgen \ + -DIRECTORY ../query_templates \ + -INPUT ../query_templates/templates.lst \ + -SCALE "$SCALE_FACTOR" \ + -RNGSEED "$SEED" \ + -DIALECT presto \ + -OUTPUT_DIR "$OUTPUT_DIR" +``` + +Since `query_files` supports directories as of PBench 1.2.1, the generated SQL files are automatically discovered and executed in sorted order after the pre-stage script completes. In this mode there are no expected row counts — the queries are fresh from the generator and the focus is on performance measurement rather than regression checking. + +### Phase 3: Throughput Test + +The TPC-DS throughput test runs N concurrent query streams, each executing all 99 queries in a different permutation order (defined in the spec's Appendix D). There are two ways to model this in PBench. + +**Option A: Explicit streams via DAG.** Define each stream as a separate stage file with its own query ordering, and fan them out from a parent stage: + +```json +{ + "id": "throughput_test_1", + "description": "TPC-DS Throughput Test: 4 concurrent streams", + "next": [ + "streams/stream_01.json", + "streams/stream_02.json", + "streams/stream_03.json", + "streams/stream_04.json" + ], + "next": ["data_maintenance.json"] +} +``` + +Each stream file (e.g., [`stream_01.json`](https://github.com/prestodb/pbench/blob/main/benchmarks/tpc-ds/streams/stream_01.json)) lists the 99 queries in the spec-defined order for that stream and sets `start_on_new_client: true` so each stream gets its own Presto session: + +```json +{ + "start_on_new_client": true, + "query_files": [ + "../queries/query_96.sql", + "../queries/query_07.sql", + "../queries/query_75.sql", + "..." + ] +} +``` + +PBench ships with [21 pre-built stream orderings](https://github.com/prestodb/pbench/tree/main/benchmarks/tpc-ds/streams) matching the TPC-DS Appendix D specification. + +**Option B: `stream_count` for randomized throughput.** New in PBench 1.2.1, if you don't need the exact spec-defined orderings, you can use `stream_count` to spin up N parallel instances of a single stage: + +```json +{ + "id": "throughput_test_1", + "description": "4 concurrent random streams, no duplicates within each stream", + "stream_count": 4, + "random_execution": true, + "randomly_execute_until": "99", + "no_random_duplicates": true, + "query_files": ["queries/query_01.sql", "...", "queries/query_99.sql"], + "next": ["data_maintenance.json"] +} +``` + +Each stream gets a deterministic seed derived from the base seed (`seed + stream_index * 1000`), so the entire run is reproducible from a single `--seed` value. The `no_random_duplicates` flag ensures each stream cycles through all 99 queries before repeating. + +### Phase 4: Data Maintenance + +The TPC-DS spec includes data maintenance operations (INSERTs and DELETEs) between throughput tests. The refresh data is generated by `dsdgen`, so we use a `pre_stage_scripts` hook to generate the refresh flat files and create external staging tables on them, then execute the maintenance SQL: + +```bash +#!/bin/bash +# scripts/generate_refresh_data.sh +SCALE_FACTOR=1000 +UPDATE_SET=1 +REFRESH_DIR="./refresh_data" + +mkdir -p "$REFRESH_DIR" +dsdgen -SCALE "$SCALE_FACTOR" -UPDATE "$UPDATE_SET" -DIR "$REFRESH_DIR" +``` + +This produces new-row flat files and delete-key files for each affected fact table. The script can also create external staging tables pointing to these files (or that SQL can be part of the queries). The maintenance queries then reference the staging tables: + +```json +{ + "id": "data_maintenance", + "description": "TPC-DS Data Maintenance: generate refresh data, then INSERT/DELETE", + "pre_stage_scripts": ["./scripts/generate_refresh_data.sh"], + "queries": [ + "INSERT INTO catalog_sales SELECT * FROM catalog_sales_staging", + "INSERT INTO catalog_returns SELECT * FROM catalog_returns_staging", + "DELETE FROM catalog_sales WHERE cs_item_sk || cs_order_number IN (SELECT cs_item_sk || cs_order_number FROM catalog_sales_delete)", + "DELETE FROM catalog_returns WHERE cr_item_sk || cr_order_number IN (SELECT cr_item_sk || cr_order_number FROM catalog_returns_delete)" + ], + "next": ["throughput_test_2.json"] +} +``` + +### Phase 5: Second Throughput Test + +The second throughput test is identical in structure to the first, running after data maintenance to measure performance on the modified dataset. Simply define another throughput stage with the same query set: + +```json +{ + "id": "throughput_test_2", + "description": "Post-maintenance throughput test", + "stream_count": 4, + "random_execution": true, + "randomly_execute_until": "99", + "no_random_duplicates": true, + "query_files": ["queries/query_01.sql", "...", "queries/query_99.sql"] +} +``` + +### Running the Full Benchmark + +With the stage DAG defined, the entire multi-phase benchmark is launched with a single command: + +```bash +pbench run \ + --server-url http://presto-coordinator:8080 \ + --name "tpcds_sf1000_%t" \ + --output-path ./results \ + --mysql mysql_config.json \ # record per-query metrics for analysis and scoring + benchmarks/tpc-ds/sf1k.json benchmarks/tpc-ds/tpcds_full.json +``` + +The `sf1k.json` file sets the scale-factor-specific schema, and `tpcds_full.json` defines the DAG. PBench merges these, inheriting the schema across all stages. The `--mysql` flag records per-query metrics (duration, row count, success/failure, etc.) into a MySQL database, enabling you to analyze slow queries and compute the TPC-DS performance score from the recorded timings. Results are also written to local CSV files. + +## Offline Analysis with `pbench loadjson` + +Running benchmarks is only half the story — understanding *why* performance differs between runs requires deep query-level analysis. This is where `pbench loadjson` and the Presto event listener come in. + +### Capturing Query Details + +When PBench runs with `save_json: true`, it captures the full Presto query JSON (from the `/v1/query/{id}` API) for every query execution. These JSON files contain the complete query plan, operator statistics, stage-level metrics, and timing breakdowns. + +Independently, Presto's [event listener](https://prestodb.io/docs/current/develop/event-listener.html) can be configured to write query completion events to disk as JSON files. This captures *all* queries on a cluster — not just those from PBench — providing a complete picture of cluster activity during benchmarks. + +### Loading into a Database + +`pbench loadjson` processes these JSON files and loads them into MySQL tables for structured analysis: + +```bash +pbench loadjson \ + --mysql mysql_config.json \ + --name "tpcds_sf1000_analysis" \ + ./results/tpcds_sf1000/ +``` + +When `--mysql` is provided to `pbench run`, run metadata (run name, timing, query durations, success/failure) is automatically recorded to the database. The detailed per-query metrics (operator stats, stage stats, query plans) come from a separate path — either via `save_json: true` in PBench or by enabling a Presto [event listener](https://prestodb.io/docs/current/develop/event-listener.html) plugin that writes query JSON to disk. `pbench loadjson` then parses these JSON files and loads them into the database — you must provide `--mysql` for it to write to the database. + +This populates five tables (DDL is available in [`event_listener_ddl.sql`](https://github.com/prestodb/pbench/blob/main/stage/event_listener_ddl.sql)): + +- **`presto_query_creation_info`** — query text, catalog, schema, session properties, user, and resource group for each query +- **`presto_query_statistics`** — top-level execution metrics: wall time, CPU time, peak memory, input/output rows and bytes, queued time, and failure info +- **`presto_query_stage_stats`** — per-stage breakdown (each Presto execution stage, not PBench stages): CPU time, I/O, memory, and GC statistics +- **`presto_query_operator_stats`** — per-operator metrics: CPU and wall time, memory reservations, input/output rows for each scan, join, aggregation, etc. +- **`presto_query_plans`** — query plan in both text and JSON format for plan comparison across runs + +Together these tables let you drill down from a slow query all the way to the specific operator and stage responsible. + +### Comparative Analysis + +With results from multiple PBench runs loaded into the same database, you can perform comparative analysis using SQL: + +```sql +-- Compare query durations between two runs +SELECT + a.query_file, + a.duration_ms AS baseline_ms, + b.duration_ms AS candidate_ms, + ROUND((b.duration_ms - a.duration_ms) / a.duration_ms * 100, 1) AS pct_change +FROM pbench_queries a +JOIN pbench_queries b + ON a.query_file = b.query_file + AND a.sequence_no = b.sequence_no +WHERE a.run_id = (SELECT id FROM pbench_runs WHERE run_name = 'baseline_run') + AND b.run_id = (SELECT id FROM pbench_runs WHERE run_name = 'candidate_run') +ORDER BY pct_change DESC; +``` + +```sql +-- Find operators with the highest CPU increase in the candidate build +SELECT + b.operator_type, + SUM(b.wall_nanos - a.wall_nanos) / 1e9 AS wall_time_increase_sec, + SUM(b.input_rows - a.input_rows) AS input_rows_increase +FROM presto_query_operator_stats a +JOIN presto_query_operator_stats b + ON a.query_id = b.query_id AND a.operator_type = b.operator_type +WHERE a.query_id IN (SELECT query_id FROM pbench_queries WHERE run_id = 1) + AND b.query_id IN (SELECT query_id FROM pbench_queries WHERE run_id = 2) +GROUP BY b.operator_type +ORDER BY wall_time_increase_sec DESC; +``` + +This turns PBench + event listener data into a queryable performance analysis platform. You can identify exactly which operators regressed, which stages are bottlenecked, and how resource consumption changed between Presto versions. + +## Beyond Standard Benchmarks: Real-World Performance Testing + +Standard benchmarks like TPC-DS are valuable, but production workloads often behave very differently. PBench provides three tools for testing with real-world query patterns. + +### Capturing Production Schemas with `pbench save` + +Before you can replay production queries against a test cluster, you need to reproduce the schema. `pbench save` exports table metadata, column statistics, and partition information from a live cluster: + +```bash +pbench save \ + --server-url http://production-presto:8080 \ + --catalog hive \ + --schema production_db \ + --output-path ./saved_schemas \ + --parallel 8 \ + customers orders transactions +``` + +This generates one JSON file per table, named `{catalog}_{schema}_{table}.json` (e.g., `hive_production_db_orders.json`). Each file captures the full DDL and column-level statistics: + +```json +{ + "name": "orders", + "catalog": "hive", + "schema": "production_db", + "ddl": "CREATE TABLE hive.production_db.orders (\n \"order_id\" bigint,\n \"customer_id\" bigint,\n \"order_date\" date,\n \"total_amount\" decimal(12,2),\n \"status\" varchar,\n \"region\" varchar\n)\nWITH (\n format = 'PARQUET',\n partitioned_by = ARRAY['region']\n)", + "columnStats": [ + { + "column_name": "order_id", + "distinct_values_count": 5000000, + "nulls_fraction": 0, + "low_value": "1", + "high_value": "5000000", + "data_type": "bigint" + }, + { + "column_name": "total_amount", + "distinct_values_count": 48923, + "nulls_fraction": 0.002, + "low_value": "0.99", + "high_value": "9999.99", + "data_type": "decimal(12,2)" + }, + { + "column_name": "region", + "data_size": 42000000, + "distinct_values_count": 12, + "nulls_fraction": 0, + "data_type": "varchar", + "extra": "partition key" + }, + {"row_count": 5000000} + ], + "rowCount": 5000000 +} +``` + +This gives you the complete DDL to recreate the table on a test cluster, along with column statistics (cardinality, null fractions, min/max values, data sizes) that inform the query optimizer. For bulk capture, you can pass a CSV file listing all tables: + +```bash +pbench save \ + --server-url http://production-presto:8080 \ + -f tables.csv \ + --output-path ./saved_schemas +``` + +### A/B Testing with `pbench forward` + +`pbench forward` monitors a source Presto cluster and mirrors every incoming query to one or more target clusters in real time. This enables transparent A/B testing between Presto versions or configurations without modifying any client applications: + +```bash +pbench forward \ + --server-url http://current-presto:8080 \ + --server-url http://candidate-presto:8080 \ + --poll-interval 5s \ + --exclude "^(EXPLAIN|DESCRIBE|SHOW)" \ + --schema-mapping prod_schema,test_schema \ + --output-path ./forward_results \ + --name "v0.286_vs_v0.287" +``` + +This forwards all queries from the current cluster to the candidate, excluding DDL/metadata queries. The `--schema-mapping` flag handles cases where the test cluster uses a different schema name. You can add multiple `--server-url` targets to test more than two configurations simultaneously. + +`pbench forward` also supports query rewriting via regex patterns, useful when table names or function names differ between versions: + +```bash +pbench forward \ + --server-url http://source:8080 \ + --server-url http://target:8080 \ + --replace "old_udf\((.*?)\)" "new_udf(\1)" \ + --replace "legacy_table" "migrated_table" +``` + +### Traffic Replay with `pbench replay` + +While `forward` mirrors live traffic, `replay` replays recorded traffic from a CSV file. This is useful for reproducible testing — record a production workload once, then replay it against multiple cluster configurations: + +```bash +pbench replay \ + --server-url http://test-presto:8080 \ + --parallel 150 \ + --name "peak_hour_replay" \ + --output-path ./replay_results \ + workload_capture.csv +``` + +The CSV file contains query metadata captured from production (query text, original timing, catalog, schema, session properties). PBench replays queries with the original inter-query timing to simulate realistic load patterns, and respects the parallelism limit to avoid overwhelming the test cluster. + +### An End-to-End Real-World Testing Workflow + +Here's how these tools come together for a complete Presto version upgrade validation: + +**Step 1: Capture the baseline.** Save production schemas and run standard benchmarks against the current version: + +```bash +# Export schemas +pbench save --server-url http://current:8080 \ + --catalog hive --schema prod -f all_tables.csv \ + --output-path ./schemas + +# Run TPC-DS baseline +pbench run --server-url http://current:8080 \ + --name "baseline_tpcds_%t" --mysql mysql.json \ + --output-path ./results \ + benchmarks/tpc-ds/sf1k.json benchmarks/tpc-ds/ds_full.json +``` + +**Step 2: Live A/B testing.** Forward production traffic to the candidate cluster and collect real-world comparison data: + +```bash +pbench forward \ + --server-url http://current:8080 \ + --server-url http://candidate:8080 \ + --exclude "^(EXPLAIN|DESCRIBE|SHOW)" \ + --output-path ./forward_results \ + --name "upgrade_ab_test" +``` + +**Step 3: Replay and compare.** Take a recorded peak-hour workload and replay it against both clusters: + +```bash +pbench replay --server-url http://current:8080 \ + --name "peak_replay_baseline" workload.csv + +pbench replay --server-url http://candidate:8080 \ + --name "peak_replay_candidate" workload.csv +``` + +**Step 4: Run TPC-DS on the candidate.** Run the same standard benchmark on the new version: + +```bash +pbench run --server-url http://candidate:8080 \ + --name "candidate_tpcds_%t" --mysql mysql.json \ + --output-path ./results \ + benchmarks/tpc-ds/sf1k.json benchmarks/tpc-ds/ds_full.json +``` + +**Step 5: Deep analysis.** Load all query JSON files into the database and compare: + +```bash +# Load event listener output from both clusters +pbench loadjson --mysql mysql.json \ + --name "baseline" ./results/baseline_tpcds/ + +pbench loadjson --mysql mysql.json \ + --name "candidate" ./results/candidate_tpcds/ +``` + +Now you can query the MySQL database to compare operator-level metrics, identify regressions, and validate that the candidate version meets your performance bar — all without writing a single line of custom tooling. + +**Step 6: Validate correctness.** Use `pbench cmp` to diff query outputs between the two runs: + +```bash +pbench cmp \ + ./results/baseline_tpcds/ \ + ./results/candidate_tpcds/ \ + --output-path ./diffs +``` + +This generates diffs for any queries that returned different results, catching correctness regressions that performance metrics alone would miss. + +## Getting Started + +Check out the [PBench 1.2.1 release](https://github.com/prestodb/pbench/releases/tag/v1.2.1) for pre-built binaries for macOS and Linux, or build from source: + +```bash +git clone https://github.com/prestodb/pbench.git +cd pbench +make install +``` + +The [PBench Wiki](https://github.com/prestodb/pbench/wiki) has comprehensive documentation on configuring stages, setting up result databases, and writing benchmark suites. + +Whether you're running standard TPC-DS benchmarks, validating a Presto version upgrade with production traffic, or building a continuous performance regression pipeline, PBench 1.2.1 provides the building blocks to do it declaratively and reproducibly. We welcome contributions — check out the [GitHub repository](https://github.com/prestodb/pbench) to get involved.